1. Introduction
A point cloud is an interactive point set with an unchanged sparse order defined in coordinate space and samples from the surface of objects to capture their spatial semantic information [
1]. Point clouds are obtained through 3D sensors (such as LiDAR scanners and RGB-D cameras). They can be used in human–machine interaction [
2], automatic driving vehicles [
3], and robot technology [
4], and have a high practical application value.
The manual labeling of point cloud targets is very expensive. Unsupervised visual representation learning can learn the effective visual representation of 3D point cloud targets without labeling information. Therefore, the unsupervised visual representation learning of point clouds [
5] has become a research hotspot.
The unsupervised visual representation learning of point clouds is mainly based on generative methods. Architecture such as autoencoders and generative adversarial networks have achieved success in learning the visual representation of point cloud targets and generating real samples from complex underlying distributions. Yang et al. proposed FoldingNet [
6] and trained an end-to-end depth autoencoder to directly use disordered point clouds. A new decoding operation called folding is also proposed, which can achieve a low reconstruction error even for objects of fine structure, and theoretically shows that it is universal in point cloud reconstruction. Achlioptas et al. proposed LatentGAN [
7], a self-encoder trained to learn a potential representation, and then trained a generation model, l-GAN, in a fixed potential space and realized shape editing through simple algebraic operations, such as semantic part editing, shape analogies and shape interpolation, and shape completion. This is easier to train than the original countermeasure generation network and achieves better reconstruction and coverage data distribution.
However, the generative method of point clouds is trained by reconstructing the difference between the point cloud and the original point cloud. The generative method pays more attention to the details of each point, rather than abstract semantic information [
8]. Point-based analysis usually assumes independence between each point, which reduces the ability to model correlations or complex structures.
Discriminant methods in unsupervised representation learning use objective functions, similar to supervised learning, to learn visual representations. Compared with generative methods, discriminative methods focus more on the feature space rather than the details of each point. Discriminant methods generally train the network by customizing one or more pretext tasks [
9] where the input and labels are from unlabeled data sets and the excuse task can provide powerful supervision signals for semantic information, thereby promoting learning semantic information. Doersch et al. [
5] extracted random patch pairs from each image and trained a discriminant model to predict their relative positions in the image. Experiments show that using this feature representation of intra-image context prediction task learning can indeed capture the visual similarity between images. Komodakis et al. [
10] took recognizing the two-dimensional rotation of the input image as a pretext task to learn the semantic information on the image. Compared with the attention map generated by supervised learning method, it was proven that the region of interest in this method is the same as that of the supervised learning method. Doersch et al. [
11] used a very large ResNet-101 network to jointly train the network of four different pretext tasks. Lasso regularization is also explored to encourage the network to decompose information on its representation and coordinate the network input in order to learn a more unified representation. The experimental results show that the deeper network effect is better and the performance can be improved even on a naïve, multi-head architecture.
As a discriminative method, contrastive learning based on latent space [
12] can effectively learn semantic similarity and can solve the problem of ignoring abstract semantic information learning by generative methods [
8]. Recently, contrastive learning has shown great prospects, and unprecedented achievements have been made in two-dimensional visual representation learning. Hinton et al. proposed a comparative learning method, SimCLR [
8]. On ImageNet, the trained linear classifier is close to supervised. The performance of ResNet50 reaches a top-1 accuracy rate of 76.5% and even surpasses some supervised learning methods on some data sets. However, the SimCLR method relies on a huge batch of data and a deeper network, which consumes computing resources. Kaiming He et al. proposed a contrastive learning method, MoCo [
13], which uses contrastive learning as a way to look up a dictionary. It constructs a dynamic dictionary with a queue and a moving average encoder, which makes it possible to construct a large and consistent dictionary that is conducive to training small batches of data and promotes the development of contrastive learning. Xie et al. [
14] proposed PointContrast, which performs contrastive learning on a large number of three-dimensional point cloud scene data sets and achieves the best results of segmentation and classification on six different benchmarks of indoor and outdoor and real and synthetic data sets, and experiments show that the learned representations can generalize across domains.
In the field of two-dimensional images, ResNet [
15] is mainly used as the backbone network of the contrast learning framework for feature extraction. Due to the disorder of point clouds, the ordering of points does not affect the properties of objects.The network of extracting point cloud features must be a symmetric function, so the commonly used ResNet is not suitable for direct application to point cloud. Qi et al. proposed PointNet [
16], which uses a symmetric function to solve the problem of point cloud disorder. Subsequently, Qi et al. proposed PointNet++ [
17], which extracts local features of fine geometric structures from the neighborhood and solves the problem of extracting local features, which PointNet cannot do. The proposal of PointNet++ has caused deep learning methods based on this work to widely appear in three-dimensional point cloud processing. Most of the unsupervised visual representation learning frameworks of point clouds also use networks similar to PointNet++ as the backbone network, such as LatentGAN [
7], 3DAAE [
18], So-Net [
19], and other methods.
However, the scalar features used by PointNet++ do not contain the spatial arrangement information in the point cloud data and do not consider the geometric relationship between parts [
20], which is very important for the interpretation and description of three-dimensional shapes. The capsule network proposed by Hinton et al. [
21] replaces the scalar features of CNN with vector features. The vector features save the feature information of different dimensions through a dynamic routing method, which can preserve the geometric relationship between components in the image. However, the dynamic routing in the capsule network will damage the robustness of the input affine transformation [
22]. The input of the comparative learning framework generates positive and negative sample pairs through affine transformation. Therefore, the dynamic routing in the capsule network will affect the performance of contrastive learning.
To solve the above problems, this paper proposes a contrastive learning method for three-dimensional point cloud visual representation. Combining the ideas of PointNet++ and the capsule network, a self-attention point cloud capsule network is designed as the backbone network of the contrastive learning framework. The FM (factorization machines) routing algorithm [
23] is used to replace the traditional dynamic routing algorithm so that the network can follow the geometric relationship between components, showing better learning capabilities and generalization characteristics. At the same time, the FM routing algorithm is non-iterative. The routing algorithm can speed up the calculation speed of the network. The self-attention mechanism [
24] is introduced to make the capsules input into the FM routing algorithm correlated, which, combined with the FM routing algorithm, improves the network’s ability to learn three-dimensional point cloud representation. By compressing the digital capsule layer output by the backbone network, the ability of the queue to store capsules is improved so that the compressed capsules pay more attention to the transformation operation of the sample, and thus the generalization ability of the model is improved. Aiming at the equivariance of the capsule network, this paper proposes a Jaccard contrast loss using Jaccard similarity coefficients to describe the similarity between features, which is conducive to the model’s distinction between positive and negative samples and improves the performance of the contrast learning method.
3. Experiment and Results
3.1. Experimental Environment
This paper uses the three data augmentation methods, rotation transformation, random moving point cloud, and random scaling of point cloud size, to generate positive and negative sample pairs. The rotation angle is set to {60°, 120°, 180°, 240°, 300°}, the range of the random moving distance value is set to [−0.1, 0.1], and the range of the scaling factor is set to [0.8, 1.2].
To speed up the convergence of the network and prevent vanishing gradients, batch normalization and activation function ReLU are applied at each MLP layer. The model is implemented using the PyTorch framework on an Nvidia 2080 Ti server equipped with two 8 G video memory cards. Limited by the size of the GPU’s video memory, the number of random sampling points is 2048. An Adam optimizer with small memory requirements is used for pre-training with a learning rate of 0.001, a batch size of 8, and a training cycle of 100. The maximum length of the feature queue is 640, which can store up to 640 features of batch size. According to the results of many experiments, the performance of the network is optimal when the momentum update parameter is 0.998 and the temperature coefficient in the Jaccard contrast loss is 0.07.
The pre-training data set is ShapeNetCore, which covers 55 categories and contains about 57,000 three-dimensional models. According to this setting, the model needs about 41 min to train in ShapeNetCore for one cycle, and it takes 68 h to train 100 cycles. After the model is pre-trained, the weights of the contrastive learning network are frozen and the model is transferred to two downstream tasks, including shape classification and part segmentation.
3.2. Performance of Representation Space
Tongzhou Wang et al. [
26] proposed that alignment and uniformity are the key attributes for contrastive learning to effectively learn image representation. Alignment indicates the closeness of the features of the positive sample, and uniformity indicates the uniformity of the feature distribution. This paper uses these two attributes to evaluate the performance of the representation space.
This article visualizes ModelNet40 to verify the alignment and uniformity of the centralized data features. ModelNet40 contains 13,843 models in 40 categories, which are divided into 9843 training samples and 3991 test samples. The positive sample pairs of the ModelNet40 verification set are obtained through the three data augmentation hybrid methods in this article and input to the network calculation features after the contrastive learning training is completed. The positive sample uses the distance for the distance between the features to visualize the alignment of the features. The uniformity of the feature is visualized by the Gaussian kernel density estimation method on the unit circle, the feature distribution is drawn, and the angle of all the points on the unit circle is counted (for each point (x, y), arctan2(y, x) is used). The von Mises-Fisher Gaussian kernel density estimation method is used to draw the probability distribution on the circle.
By visualizing the alignment and uniformity, the performance of the representation space learned by the generative method PointCapsNet [
20] and the method in this paper are compared, as shown in
Figure 4 and
Figure 5. The method in this paper can make the distance between the positive sample pairs as close as possible, and the feature distribution on the unit circle is very uniform. The feature alignment and uniformity of our method are better than the generative method PointCapsNet, which shows that our method can learn better three-dimensional representation than the generative method PointCapsNet.
3.3. Classification Performance
The performance of unsupervised visual representation learning is reflected in the completion of downstream tasks. The model after learning is compared to the classification task, and the performance of the model is evaluated according to the classification accuracy. The digital capsule layer of the self-attention capsule network is reconstructed into one-dimensional features as the input of the linear SVM classifier, and the SVM classifier is trained on ModelNet40.
In order to compare the superiority of the classification performance of the proposed method, the commonly used unsupervised learning methods VConv-DAE [
27], LatentGAN [
7], FoldingNet [
6], and PointCapsNet [
20] are compared, as are the classical supervised methods PointNet [
16] and PointNet++ [
17]. The results in
Table 1 show that the classification performance of the proposed method on ModelNet40 is better than that of the commonly used unsupervised methods, and it is close to classical supervised learning.
Unsupervised visual representation learning may not be better than supervised learning of the model directly. In order to verify the effectiveness of the method in this paper, the backbone network is not compared and learned, and the linear SVM classifier directly performs supervised learning on ModelNet40. The results in
Table 1 show that the backbone network can increase the classification accuracy by 3.3% after the contrastive learning training of this article, which shows that the contrastive learning of this model can effectively improve the quality of visual representation.
The same method is used to train SVM on the different data sets of ShapeNetParts. ShapeNetParts contains 16,881 models in 16 categories, which are divided into 12,149 training samples and 2874 test samples, using only 20% of the training samples for training. The results in
Table 2 show that the model achieves the advanced accuracy of shape classification in ShapeNetParts, which is an increase of 2.6% compared with the best comparison method, MulUnsupervised [
1]. It shows that the visual representation obtained by contrast learning of the model can handle smaller data sets and can be better generalized to new tasks.
One of the main ways for studying unsupervised classification problems is to perform pre-training on a large amount of unlabeled data and perform transfer learning on a small amount of labeled data. The experiment is very consistent with this setting. The data volume of the unlabeled data set ShapeNet for contrastive learning is relatively large, with a sample size of about 57,000, while the data volume of the labeled ModelNet data set used for transfer learning is relatively small, with about 13,800 samples. Since it is usually difficult to obtain manually labeled data, we want to test how the performance of the model decreases when there is less labeled data. The ShapeNet data set is still used to train the contrastive learning network. Then, only a% of all training data is used in the ModelNet40 dataset to train the linear SVM, where a can be 1, 2, 5, 10, 20 and 30. The test data still uses all the data in the test data of the ModelNet data set.
The results of this experiment are shown in
Figure 6. It can be seen that when only 1% of the label samples are used, the classification accuracy is still above 50%. When using 10% of the label data, the classification accuracy exceeds 80%, which is close to most of the unsupervised learning methods in
Table 1. It can be proven that the features obtained by the capsule network are linearly separable, the amount of labeled data required for training SVM can be small, and the model still has very good performance on a small amount of labeled data.
3.4. Part Segmentation Performance
Part segmentation is a fine-grained, point-by-point classification task. Its goal is to predict the part category label of each point in a given shape. The data set used in this article is ShapeNetParts, which contains 16,881 objects in 16 categories, which are divided into 12,149 training objects, 2874 test objects, and 1858 verification objects. Each object consists of 2 to 6 parts, and there are a total of 50 different parts in all categories.
This article uses mIoU as the evaluation criterion for part segmentation. For each shape of category A, the mIoU of the shape is calculated. For each component type in category A, the IoU between the real and predicted is calculated. Then, the average IoU for all component types in category A to obtain the mIoU of the shape is calculated. In order to calculate the mIoU of a category, this paper takes the average of the mIoU of all the shapes in the category.
The method in this paper is used to pre-train the network in ShapeNetCore, randomly sample 1% and 5% of ShapeNetParts’ training samples, copy the digital capsule layer of the self-attention capsule network
32 times to obtain 2048 point features (at each point, four layers of MLP (2048, 4096, 1024, and 50) on the features are trained), and evaluate the test data.
Figure 7 shows the visualized results of part segmentation.
In order to compare the superiority of the classification performance of the proposed method, the unsupervised methods SO-Net [
19], PointCapsNet [
20], and MulUnsupervised [
1] are compared. The results in
Table 3 show that when 1% and 5% of the training data are used for components segmentation, the model is better than the best comparison method, indicating that the model still has a very good part segmentation performance on a small amount of labeled data.
Table 4 shows the comparison results of the part segmentation between this model and the supervised learning model. The results show that the mIoU achieved by this model is 2.2% smaller than the classic method, PointNet, and 6.2% smaller than the best supervised model, thus narrowing the gap with the supervised model.
3.5. Comparative Experiment
The factors that may affect the performance of the method include the data augmentation mode, contrastive loss, and the routing algorithm. This paper compares the computational complexity of other unsupervised methods and analyzes them through comparative experiments.
3.5.1. Data Augmentation Methods
In order to study the influence of data augmentation methods on the results, three data augmentation methods (random rotation, random scaling, and random movement) were applied to the input point cloud separately and then mixed for training.
Figure 8 shows the impact of the data augmentation methods on the classification accuracy of ModelNet40, where A is a random rotation operation, B is a random scaling operation, C is a random movement operation, A + B refers to the mixed operation of random rotation and random scaling, A + C refers to the mixed operation of random rotation and random movement, B + C refers to the mixed operation of random scaling and random movement, and A + B + C refers to three mixed operations.
As shown in
Figure 8, the random rotation operation works best when only a single data augmentation operation is used. When the random rotation operation is combined with the other two operations, the accuracy rate is more than 80%, while the accuracy rate of the combined random scaling and random moving operations is about 70%. When the three data augmentation methods are combined, the classification accuracy is slightly improved, which shows that the performance of the network using multiple data augmentation methods is better than using a single data augmentation method for training. The classification accuracy using only random scaling or random moving point cloud is not high because the origin cloud and the transformed point cloud have a similar spatial distribution. When using the combination of rotation prediction and other transformations, it is easier for the neural network to distinguish positive and negative samples. Therefore, in order to learn generalizable features, it is very important to combine rotation prediction with other groups of data augmentation methods.
3.5.2. Contrastive Loss
In order to verify the advancement of the Jaccard contrast loss proposed in this paper, the contrastive losses in the MoCo [
13] and SimCLR [
8] methods are compared. The loss function in MoCo uses the vector dot product to represent the similarity between capsules, and the loss function in SimCLR uses the cosine similarity to measure the similarity. The comparative loss of the model in this paper is replaced with these two loss functions and trained separately, and the training of the trained model is performed to calculate the average distance between the features of the positive samples and the classification task. The experimental setup is the same as in
Section 3.2 and
Section 3.3.
Table 5 shows the impact of different contrastive losses on the average distance and classification accuracy between positive samples. Compared with the loss functions in MoCo and SimCLR, the Jaccard contrast loss proposed in this paper can make the distance between positive sample features closer, improve the ability of the contrastive learning method to distinguish positive and negative sample features, and increase the classification accuracy by 1.4%, indicating that the Jaccard contrast loss is more suitable for the backbone network as it improves the performance of contrastive learning.
3.5.3. Routing Algorithm
In order to verify the advanced nature of the FM routing algorithm used in this paper, we compare it with other commonly used routing algorithms. According to the original contrastive learning architecture, different routing algorithms (three-iteration dynamic routing, three-iteration EM routing, and FM routing algorithms) are used to train in ShapeNetCore with a cycle of 100. The trained models are classified on ModelNet40 using the method described in
Section 3.3.
Table 6 shows the impact of the different routing methods on time consumption and classification accuracy. In terms of time consumption, the FM routing algorithm training cycle requires 41 min, which saves 12 min compared to the best comparison method. This shows that the computational complexity of this method is lower than that of the dynamic routing and EM routing methods. In terms of classification accuracy, the FM routing algorithm, when compared with the other two methods, achieves the best results, indicating that this method enables the model to learn visual representations better.
3.5.4. Computational Complexity
In addition to its good performance on the standard data set, this algorithm has advantages in terms of the amount of model parameters and training speed because it chooses the non-iterative routing algorithm and it compresses the data capsule layer. In
Table 7, the model parameters and training time of the proposed algorithm and the two point cloud unsupervised learning methods are compared. The tests were conducted under the same experimental setting with a batch size of four, and the time to train one epoch on the ShapeNet data set was calculated.
The comparison experiment in
Table 7 shows that, compared with the best comparison method, PointCapsule, the parameter quantity of this method is reduced by 28.3% and the training time is reduced by 17.8%. This algorithm effectively reduces the model parameters and the amount of model calculation, which has more advantages in practical applications. Benefiting from the lower amounts of parameters and computation, and under the condition of making full use of the 8 G of video memory, the algorithm in this paper has higher concurrency capability and can process point cloud models with a batch size of 10 at the same time, which proves the efficiency of the algorithm.