In this section, we first describe the CW-EdgeConv and the global attention module. Then, we overview the whole model for classification and segmentation. Finally, we compare several structures of attention modules.
3.1. Channel Weighting Edge Convolution (CW-EdgeConv) Module
Our CW-EdgeConv module is an extension of EdgeConv and it consists of four steps: (1) calculate k nearest neighbors using kNN query, (2) map low-dimensional geometrical features to high-dimensional features using Multilayer Perceptron (MLP) [
35], (3) channel weighting, and (4) aggregate features of nearest neighbor points into features of a single point. The original EdgeConv will be described in the last of this subsection.
The first step is kNN query, which inputs a set of points and calculates the k nearest neighbors for each point. Specifically, consider an point set input
, where
N is the total number of points and
C is the dimension of geometrical features of a point, such as 3D location and normal. Given that our model does not resample points before each CW-EdgeConv layer, the number of points considered remains
N. For each
, we define a subset centered at it and choose
nearest points except the center
. Therefore, a kNN query of
can be calculated as
where
is the
k th nearest point from
, calculated using the kNN query. Therefore, the grouped input can be represented by
We apply the kNN query method to group the point set in each layer due to simplicity and less inference time.
The second step is using MLP to map low dimensional geometrical features to high-dimensional features. These low dimensional geometrical features include the edge feature in form of
and the original input points
, where
. The choice of such features strictly follows the best option in EdgeConv [
16]. Given such features, we use MLP to calculate high-dimensional features. Specifically, we apply a 2D
convolutional layer followed by a batch normalization layer [
36] and a ReLU activation function [
37]. We use the following notation to represent this convolutional operation of one group.
Note that is shared through all groups in that it works as a nonlinear function to discover the intrinsic features of each group in high dimensional space such as density, mean distance, etc. This is achieved by extracting the correlation of the input geometric features .
For the third step, given the middle features outputted from convolutional layers
, we apply channel weighting on these middle features by adapting a squeeze and excitation block (SE-Block2d) [
5] layer. The architecture of SE-Block is shown in
Figure 3. Here, we simply abbreviate the SE-Block as
.
where
is the number of the output channels of
.
We made two modifications to the SE-Block [
5]: (1) We adapt a 1d channel weighting model to fit the dimension of the concatenated feature; (2) We keep the original channel size of a layer in the block because the reduction of layer parameters limits the performance of channel weighting.
In the fourth step, we aggregate features of k nearest neighbor points
into features of a single point
. This is similar to 2D convolution networks that each pixel value should be aggregated from several values of a kernel. Here, we follow the convention of PointNet, PointNet++, and EdgeConv; the aggregation function is
instead of ∑. The output for a group centered at
is calculated as follows.
Finally, the output of the whole CW-EdgeConv is calculated as follows.
We denote the output of the l th CW-EdgeConv layer as .
After the output
of the last CW-EdgeConv layer, we further utilize an shared MLP
and a SE-1d block to obtain the global feature
.
Remarks: The only difference between first CW-EdgeConv++ layer and following CW-EdgeConv layers is that there are additional geometric features for the global attention module (see more detail in the next subsection). The form of this additional output is represented as
where
,
,
denotes the euclidean distance, and
k specifies the number of points in a group.
Remarks: The original EdgeConv module only contains step 1, 2, and 4 of CW-EdgeConv. Compared with Equation (
4), the output
for a group centered at
in EdgeConv can be calculated as
3.2. Global Attention Module
The input of this module is the output
of the CW-EdgeConv++ module (see
Figure 2). Similar to the channel attention in SENet [
5], we utilize two
2D convolutional layers to reduce the dimensions of grouped features (the input of this module) and one sigmoid function to generate the soft attention mask (
Figure 2). For a specific point group
centered at
, the importance
is calculated as
where the number of output channels of
is one and Sigmoid denotes the sigmoid activation function which is
. Finally, the module outputs a learned soft mask
.
The motivation of this design is simple: We consider the classification task as an example. Each object class has its characteristic patterns that make it distinct from other classes. Examples of such characteristic patterns include the string of guitars, the wings of airplanes, etc. Such characteristic patterns may be neglected due to excessive amount of features extracted during the pooling aggregation process. Therefore, it is necessary to measure the importance of each group and use such to weight the global feature by our learned soft mask .
The reason why we feed more pivotal geometric information (i.e.,
in Equation (
7)) into the global attention module is to accelerate and improve the learning of the global soft mask
. Though MLP can approximate any nonlinear functions theoretically such as high-order information like the square of the euclidean distance (2-order:
) from a group, experiments show that the model with high-order convolutional filters such as
can achieve higher classification accuracy in several benchmarks [
20]. To resolve this same problem in our model, inspired by this idea, we here feed additional pivotal geometric information (i.e.,
in Equation (
7)) to assist the shared MLP to efficiently find features of characteristic patterns and determine the importance
of every input point
.
In summary, this module aims at automatically discovering the characteristic patterns of point clouds and generate a point-wise soft attention mask to multiply the global feature .
3.3. Architecture for Classification and Segmentation
After obtaining the mask from the global attention module and the global feature , we operate an element-wise multiplication between them and utilize the ReLU activation function to generate the new global feature denoting the after being masked.
For classification (
Figure 2), we use both max-pooling and average-pooling to aggregate all points in global feature
and concatenate them. Finally, we use a 3-layer MLP to output the classification scores. C, C/R, and C are dimensions of three neural layer of the MLP, respectively, where R is the reduction factor to reduce the amount of parameters.
For segmentation, similar to other approaches, we first tile the one-hot category label and concatenate it with the global feature
and the output of ReLU and max-pooling on
(
Figure 2). The following 4-layer MLP eventually outputs the point-wise segmentation scores.
The selection of aggregation function through all points was actually discussed in a few researches [
16]. Most models use the max-pooling other than average-pooling layer due to the convention inherited from PointNet. Intuitively, the max-pooling ought to be better than avg-pooling because the strongest activation is probably the most prominent feature of one class. However, the outcome of avg-pooling can also reflect an important trait of a class; otherwise, the models using avg-pooling will not have a reasonable result. In order to gather more valuable information, we choose to concatenate both results from avg-pooling and max-pooling layers into a complete vector for classification whose dimension is 2048.