Next Article in Journal
Detection and Recognition of Tilted Characters on Railroad Wagon Wheelsets Based on Deep Learning
Next Article in Special Issue
Manipulation Direction: Evaluating Text-Guided Image Manipulation Based on Similarity between Changes in Image and Text Modalities
Previous Article in Journal
CNN-Bi-LSTM: A Complex Environment-Oriented Cattle Behavior Classification Network Based on the Fusion of CNN and Bi-LSTM
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Point CNN:3D Face Recognition with Local Feature Descriptor and Feature Enhancement Mechanism

School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(18), 7715; https://doi.org/10.3390/s23187715
Submission received: 25 July 2023 / Revised: 23 August 2023 / Accepted: 5 September 2023 / Published: 6 September 2023
(This article belongs to the Special Issue Advanced Computer Vision Systems 2023)

Abstract

:
Three-dimensional face recognition is an important part of the field of computer vision. Point clouds are widely used in the field of 3D vision due to the simple mathematical expression. However, the disorder of the points makes it difficult for them to have ordered indexes in convolutional neural networks. In addition, the point clouds lack detailed textures, which makes the facial features easily affected by expression or head pose changes. To solve the above problems, this paper constructs a new face recognition network, which mainly consists of two parts. The first part is a novel operator based on a local feature descriptor to realize the fine-grained features extraction and the permutation invariance of point clouds. The second part is a feature enhancement mechanism to enhance the discrimination of facial features. In order to verify the performance of our method, we conducted experiments on three public datasets: CASIA-3D, Bosphorus, and Lock3Dface. The results show that the accuracy of our method is improved by 0.7%, 0.4%, and 0.8% compared with the latest methods on these three datasets, respectively.

1. Introduction

Face recognition, as an important part of the field of computer vision, is widely used in daily life. However, most related studies are based on common RGB images and it is difficult for common digital cameras to obtain effective RGB images under the condition of large illumination changes [1]. The devices of point clouds often do not rely on visible light, such as lidar and Kinect (based on infrared), which makes this kind of data stable to illumination changes and can be applied to some special scenes. The mathematical expression of the point clouds is simple (a group of points in 3D space). However, the disorder of the point clouds makes it difficult for them to have an ordered index such as ordinary 2D images, so it is difficult to use deep learning networks for feature extraction [2]. Deep learning is widely used in various research fields due to its powerful perception. Refs. [3,4] applied deep learning to real engineering technology and achieved outstanding performances. As the pioneers Qi et al. [5] used the symmetric function to construct PointNet that solved the disorder of point clouds in deep learning, many networks based on PointNet have been proposed, such as PointNet++ [6], ppfnet [7], pointcnn [8], etc. Subsequently, point clouds are also widely used in face analysis tasks, such as face detection, pose estimation, face recognition and verification, etc. Particularly in the field of face recognition, a large number of methods have been proposed. However, due to the lack of detailed textures in the point clouds, the fine-grained expression of facial features is still the focus of research in this field. Relying on powerful perception capabilities, convolutional neural networks (CNNs) have made breakthroughs in the field of 2D images. In order to make the point clouds effectively utilize the perceptual power of CNNs, Li et al. [8] constructed a convolution operator, which realizes the permutation invariance of the disordered points through a permutation matrix. Based on [7,8], we utilize the convolution operator with a local feature descriptor to construct a new operator, ψ c o n v , to extract fine-grained features of point faces. Furthermore, we propose a novel feature enhancement mechanism to further enhance the discrimination of facial features and introduce a triplet loss function based on the feature enhancement mechanism for efficient 3D face recognition.
In order to verify the effectiveness of our method, we conduct experiments on three public datasets: CASIA-3D, Bosphorus, and Lock3Dface.
The main novelty and contribution of this paper are summarized as follows:
  • We construct a new operator based on local feature descriptors to achieve fine-grained feature extraction from disordered point clouds;
  • A new feature enhancement mechanism is introduced, which effectively improves the accuracy of the point face recognition;
  • The experimental results on public datasets prove that the accuracy of our proposed method outperforms current advanced algorithms. Additionally, our method can better deal with the interference of facial expressions, partial occlusions, and head pose changes.

2. Related Works

In this section, we briefly review some typical and relevant works in the field of 2D face recognition and 3D face recognition.

2.1. Two-Dimensional Face Recognition

In recent years, the most widely used face recognition methods have mainly been proposed on 2D RGB images. Schroff et al. [9] used a convolutional neural network to extract features and introduced the triplet loss function to build the famous FaceNet for RGB face recognition, which outperforms humans in accuracy. In order to deal with the occlusion and illumination variations, Yang et al. [10] presented a 2D image matrix-based error model (NMR) for face representation and classification. Focusing on the illumination change challenge, Guo et al. [11] proposed a deep network model that takes both visible light images and near-infrared images into account to perform face recognition. Unlike conventional feature descriptors, Lu et al. [12] proposed a new joint feature learning (JFL) approach to automatically learn feature representation from raw pixels for face recognition. Deng et al. [13] proposed an additive angular margin loss to obtain highly discriminative features for face recognition. Aiming at inferring genuine emotions from micro-expression recognition, Zong et al. [14] designed a hierarchical spatial division scheme for spatiotemporal descriptor extraction. Wenhui et al. [15] studied the combination of 2D discriminant analysis and 1D discriminant analysis and proposed a stable framework MMC + LDA for face recognition. Zhang et al. [16] proposed a high-order local pattern descriptor (LDP) for face recognition, which achieves good performance under various conditions.

2.2. Three-Dimensional Face Recognition

With the development of 3D sensors, more and more methods have been proposed for 3D face analysis. Zhang et al. [17] proposed a general approach to deal with the 3D face recognition problem by making use of multiple key point descriptors (MKD) and the sparse representation-based classification (SRC). Chouchane et al. [18] presented an automatic face recognition system in the presence of illumination, expressions, and pose variations based on 2D and 3D information. In [19], Szegedy et al. explored ways to scale up CNNs that aimed at utilizing the added computation for computer vision. In order to optimize deeper neural networks for image recognition, He et al. [20] presented a residual learning framework to ease the training of networks. Based on local derivative pattern (LDP), Soltanpour et al. [21] proposed a descriptor for 3D face recognition. Focusing on the intrinsic invariance to pose and illumination changes, Mu et al. [22] designed a lightweight yet powerful CNN with low-quality data to achieve an efficient and accurate deep learning solution. Dutta et al. [23] constructed a sparse principal component analysis network (SpPCANet) to extract 3D face features for recognition.
In the field of 3D vision, as the PointNet [5] solves the disorder of point clouds in deep learning, these kinds of data are widely used with their simple mathematical expression; more algorithms are proposed for 3D face recognition. Bhople et al. [24] combined PointNet and Siamese network for similarity learning of point faces and have achieved encouraging performances in the field of face recognition. Atik et al. [25] mapped point clouds to feature maps and used 2D methods to solve 3D face recognition. In order to enhance the robustness of the 3D point cloud face recognition system for multiple expressions and multiple poses, Gao et al. [26] used point clouds as input and constructed a deep learning feature extraction network, ResPoint. Yu et al. [27] modified PointNet and supplemented a few data-guided learning frameworks based on a Gaussian process morphable model for 3D face recognition. Cao et al. [28] utilized PointNet++ and RoPS local descriptors to extract local features of a 3D face. In order to deal with the lack of large-scale 3D facial data, Zhang et al. [29] established a statistical 3D morphable model-based 3D face synthesizing strategy to generate large-scale unreal facial scans to train the proposed network. Yu et al. [30] proposed a meta learning-based adversarial training (MLAT) algorithm for deep 3D face recognition on point clouds, which consists of two alternate modules: adversarial sample generating for 3D face data augmentation and meta learning-based deep network training. Jiang et al. [31] used two weight-shared encoders and a feature similarity loss to guide the encoders to obtain discriminative face representations and have achieved good performance on different datasets. Apart from face recognition, point clouds are also used for other 3D face analysis tasks such as face verification and head pose estimation [1,2,32].

3. Methods

The convolutional neural network (CNN) is highly invariant to image translation, scaling, and tilting through multi-layer feature extraction and regional weight sharing [8]. However, due to the disorder of the point clouds, a CNN cannot directly perform feature extraction on them. In this section, firstly, we introduce a local feature descriptor for fine-grained feature representation and then introduce the ψ c o n v for the convolution operation of the point clouds. Thirdly, depending on the ψ c o n v , we construct a new convolutional neural network for facial feature extraction. Fourthly, a new feature enhancement mechanism is proposed to enhance the discrimination of facial features. Finally, based on the feature enhancement mechanism, we adopt a triplet loss function for training and construct an efficient face recognition network.

3.1. Local Feature Descriptor

In this part, inspired by [7], in order to obtain the fine-grained representation of features, we use a hand-crafted descriptor to describe the local geometric features of the point clouds.
For a points pair p i , p j , the geometric relationship between two points is represented by a four-dimensional descriptor:
ψ i j = || d || 2 , n i , d , n j , d , n i , n j
where || d 2 || represents the Euclidean distance between two points. The n i and n j are normal vectors of p i and p j , respectively. The is the angle between two vectors:
v i , v j = tan 1 ( || v i × v j || v i v j )
where v i , v j 0 , π , the × represents the cross-product and the represents the dot-product. As described above, the ψ i j describes in detail the geometric relationship between two points through normal vectors and angles.
For a local region, p 1 , p 2 , p 3 , , p n , we choose a center point p i , which has a total of n pairs of points (including p i , p i ); the geometric feature of this local region is expressed as follows:
F i = p 1 , n 1 , p 2 , n 2 , , p j , n j , ψ i 1 , ψ i 2 , ψ i j
where p j is the point in the local region and n j is the normal vector of point p j . The ψ i j is the four-dimensional descriptor between p j and center point p i . As shown in Figure 1, F i uses all points pairs with the center point p i to describe the spatial geometric characteristics of the local region.

3.2. ψ C o n v Operator

As mentioned above, because of the disorder of the point clouds, they cannot directly use the convolution operation. To deal with the problem, Li et al. [8] trained a permutation matrix through a multi-layer perceptron (MLP) to realize the permutation invariance of the point clouds. As shown in Figure 2, the points in Figure 2a,b have the same distribution but the orders are different.
In Figure 2, f a , f b , f c , f d represent the features of the corresponding points and the number represents the order of each point. We use a same convolution kernel K = k α , k β , k γ , k δ T to operate on the above two point clouds:
f a = C o n v ( K , f a , f b , f c , f d ] T
f b = C o n v ( K , f c , f a , f b , f d ] T
f a f b
As shown above, the two sets of point clouds have the same distribution, but the convolution results are different. As shown in Figure 3, in order to make the convolution result only related to the distribution but not to the order, we use a permutation matrix to adjust the order of the points.
Based on the local feature descriptor and the permutation matrix, we construct a new operator ψ c o n v , which achieves permutation invariance and fine-grained feature extraction of a local region of the point clouds. The algorithm of the ψ c o n v operator is as follows in Algorithm 1 below:
Algorithm 1  ψ c o n v operator
Input: P, p, K
Output: F p
1 : P * P p Local coordinate transformation.
2 : ψ 1 , ψ 2 , , ψ n P * ,   p Encode point pairs with the descriptor ψ i j .
3 : F l ψ , P * Local feature descriptor.
4 : F β P o i n t N e t F l PointNet to extract local geometric features.
5 : F α M L P α   P * M L P α performs point-by-point feature extraction.
6 : F * F β , F α Concatenate F β , F α .
7 : χ M L P χ   P * Obtain weight matrix χ through M L P χ .
8 : F χ χ × F * Achieving feature permutation invariance.
9 : F P C o n v K , F χ Feature extraction using the convolution kernel K.
The input of ψ c o n v is the set of feature points in the local region P = p 1 , p 2 , p 3 , , p k and p is the center of P (we take p as the center and use the k-nearest neighbors algorithm (KNN) to sample the nearest k points, p C 1 ). K represents the convolution kernel and the size of K is k (the size of the convolution kernel is the same as the number of points in the local region).
In the first step, the spatial coordinates of P = p 1 , p 2 , p 3 , , p k are transformed into relative coordinates based on the center point p (relative coordinates make local points translation invariant).
The second step is to encode the points pairs in the local region according to Formula (1).
In the third step, according to Formula (3), the local feature descriptor is used to encode the local geometric feature.
The fourth step, the PointNet is used to extract local geometric features. The structure of the PointNet is shown in Figure 4.
The PointNet consists of an MLP and a max pooling layer. The MLP has three layers and the number of nodes in each layer is the same, all of which are C γ . After feature extraction, we obtain a local feature F β C γ .
In the fifth step, we use the M L P α P * to improve the feature dimension of each point. The structure of the M L P α is shown as Figure 5.
In Figure 5, k is the point number of the local region and C 1 is the feature dimension of the points. M L P α has two convolutional layers. Due to the disorder of the points, only the 1 × 1 convolution kernel can be used to increase the dimension of the points (point-by-point). The number of channels of the two convolutional layers is C 2 and C 3 , respectively ( C 3 is the output dimension, C 2 = C 1 + C 3 / 2 ).
In the sixth step, the high-dimensional feature F α of each point obtained in the fourth step is concatenated with the local geometric feature F β obtained in the fourth step (each point has the same F β ).
In the seventh step, according to [8], we use M L P χ to train a permutation matrix (as shown in Figure 3, which is only related to the distribution of points; k is the number of points in the local region) that redistributes the weight of each point to eliminate the influence of different orders. The structure of the M L P χ is shown in Figure 6.
In Figure 6, a fully connected layer (FC) map k points ( D i m ) to k * k : F C D i m * k k * k and reshapes it into a k × k matrix. Then, we adopt two layers of depth-wise convolution (DC, different from normal convolutional layer, the kernel of depth-wise convolution is responsible for one channel and the feature map has the same number of channels as the input layer) and reshape the feature maps; a k × k permutation matrix χ can be obtained:
D C k × k k * k r e s h a p e k * k k × k D C k × k k * k r e s h a p e k * k k × k χ
Ideally, the permutation matrix is a binary matrix, as shown in Figure 3, but the obtained matrix by M L P χ is a weight matrix, as shown in Figure 7. The weight matrix can approximate the permutation invariance of the local region.
The eighth step, F χ χ × F * , where χ is the weight matrix obtained in the seventh step, F * is the concatenated feature of each point in the sixth step, and the “ × ” represents matrix multiplication. During this step, as shown in Figure 7, the point clouds achieve permutation invariance through the weight matrix χ and obtain the weighted features F χ of the local region.
The ninth step, we can directly perform convolution operation on F χ to obtain F P (feature map of this local region).
The above steps can be represented as follows:
F p = ψ C o n v K , p , P = C o n v ( K , M L P χ P p × M L P α P p , P o i n t N e t P p
where K, p, and P represent the input of ψ c o n v , C o n v , is the convolution operation, and PointNet, M L P α , and M L P χ are shown in Figure 4, Figure 5 andFigure 6, respectively.

3.3. CNN for Feature Extraction

In Section 3.2, we use the local feature descriptor to describe the local fine-grained feature of a local region of the point clouds and adopt ψ c o n v to weight the disordered point clouds to achieve the permutation invariance. In this section, based on ψ c o n v , we construct a convolutional neural network (CNN) for facial feature extraction. The structure of our network is shown in Figure 8.
The network consists of 5 convolutional layers; the parameters of each layer are shown in Figure 8, where K is the number of points in a local region in this layer, C is the output feature dimension, N is the number of feature points in the next layer, and D is the dilation rate, which determines the receptive filed of the convolutional layer: K × D / N p ( N p is the number of feature points in the previous layer). For each layer, we also list the dimension of F β , F α , which presents the size of the PointNet in Figure 4 and M L P α in Figure 5.
Take the first layer as an example. The input point cloud has 1024 points (in our method, according to [32] we use farthest point sampling (FPS) algorithm sample 1024 points for each face). We use k-nearest neighbors algorithm (KNN) to sample 8 nearest points for each point (each local region has 8 points), then we adopt ψ c o n v to extract the F p (convolution result, feature map) of each local region, where F β 8 , F α 8 , which represent the C γ in the PointNet of this layer is 8 (as shown in Figure 4) and C 3 in the M L P α of this layer is 8 (as shown in Figure 5). After the ψ c o n v operation, each local region becomes a feature map F p 32 and is regarded as a new point in 32 for the next convolution layer.
After 5 convolution layers, the number of feature points changes as follows: 1024 1024 512 256 128 32 . The feature dimension changes as follows: 3 32 64 128 256 512 . In the last convolutional layer, the receptive filed K × D / N p = 1 , which means the last 32 feature points “see” the whole region of the previous layer. Then, we use a global average pooling to extract the global feature F g 512 from 32 feature points. According to [9], in order to avoid large differences between facial features, we normalize the features by 2-norm ( L 2 ):
L 2 F g F L 512

3.4. Feature Enhancement Mechanism

In Section 3.3, we obtain normalized facial features F L 512 (the value of each dimension is between (−1 and 1)). However, not every dimension plays the same role in the recognition task. For example, the larger the eigenvalue of a certain dimension, the higher the recognition contribution of this dimension provides; on the contrary, the smaller the eigenvalue of a certain dimension is, the lower the recognition contribution of this dimension provides. Based on the above phenomenon, we propose a new feature enhancement mechanism to enhance the discrimination of features.
First, take the absolute value of the eigenvalues of each dimension according to Formula (10). Then, use softmax to map F L to the probability distribution between (0 and 1). In this step, according to Formula (11), the numerator of eigenvalue with a large absolute value grows fast and the numerator of eigenvalue with a small absolute value grows slowly (because e x = e x ). The stretched eigenvalues can improve the discrimination of features. Finally, as shown in Formula (12), we restore the eigenvalues to their original positive and negative distributions.
F L = x 1 , x 2 , x 3 , , x 512
F s = f 1 , f 2 , f 3 , , f i = s o f t m a x F L , f i = e x i k = 1 n e x k
F S * = f 1 * , f 2 * , f 3 * , , f 512 * , f i * = f i , x i 0 f i , x i < 0
We use softmax to enhance the eigenvalue in F L , but, in order to avoid ignoring some original information in F L , we utilize the enhancement parameter λ to linearly add F L and the enhanced feature F s * :
F T = F L + λ F S *
In Formula (13), the eigenvalues in F L and F s * are between (−1 and 1), but there is still a large gap. Parameter λ determines the degree of coupling of the two features and also determines the contribution of the proposed feature enhancement mechanism to the F T . The structure of feature enhancement mechanism is shown in Figure 9.

3.5. Triplet Loss Function

In the feature space, the metric distance between objects is related to the similarity and the training purpose of the face recognition network is to make the same object have a closer metric distance, with a far metric distance between different objects.
In the field of 2D face recognition, FaceNet [9] constructed a triplet loss function and has surpassed humans in accuracy. In this section, we construct a triplet loss based on enhancement parameter λ .
The triplet loss function includes three types of samples: anchor samples (Anchor), positive samples (Positive), and negative samples (Negative). The anchor samples and positive samples come from the same object and the negative samples come from different objects. As shown in Figure 10, the purpose of the network is to make the metric distance between the anchor sample ( F A ) and the positive sample ( F P ) with the farthest distance smaller than the anchor sample and the negative sample ( F N ) with the closest distance.
According to Formula (13), face features F T are composed of two parts: F L and λ F s * . As shown in Formula (12), F L to λ F s * is a non-linear change process. If directly using F T for measurement, some original details of the features will be ignored. Therefore, in this section, we construct a new triplet loss according to parameter λ and the training purpose in Figure 10 can be expressed as follows:
|| F L A F L P || 2 2 + || λ F S * A λ F S * P || 2 2 + β < || F L A F L N || 2 2 + || λ F S * A λ F S * N || 2 2
where F L A , F L P , and F L N represent the F L feature (as Formula (9)) of Anchor, Positive, and Negative samples, respectively. F S * A , F S * P , and F S * N represent the F s * feature (as Formula (12)) of Anchor, Positive, and Negative samples, respectively. λ is the enhancement parameter (as Formula (13)) and β is a margin that is the minimum distance between || F L A F L P || 2 2 + || λ F S * A λ F S * P || 2 2 and || F L A F L N || 2 2 + || λ F S * A λ F S * N || 2 2 .
In the training process, only samples that do not satisfy Formula (14) are used to optimize the model (the loss of sample that satisfies the Formula (14) is 0):
|| F L A F L P || 2 2 + || λ F S * A λ F S * P || 2 2 + β > || F L A F L N || 2 2 + || λ F S * A λ F S * N || 2 2
The loss function of our model is defined as follows:
L o s s = N [ A P A N + β ] +
A P = || F n o r m A F n o r m P || 2 2 + || λ F e n A λ F e n P || 2 2
A N = || F n o r m A F n o r m N || 2 2 + || λ F e n A λ F e n N || 2 2
where N represents the total number of triplet samples satisfying Formula (15). During the training process, according to the loss function, Anchor and Positive samples with far distance become closer. Anchor and Negative samples with close distance become farther. The whole structure of our network is shown in Figure 11.
Ideally, we want the farthest pair of same objects (hard positive pair) to have a smaller metric distance than the closest different objects (hard negative pair). However, for a large number of training samples, it is difficult to find the hard positive pair and the hard negative pair. Sample selection is very important for the performance of the model. As described in Section 3.3, each point face samples 1024 points as input. According to [8,9], we set the mini-batch in each batch. For a mini-batch, 40 samples are selected from the same subject and we find the hard positive pair in the 40 samples. The hard negative pair is randomly selected from other subjects. The margin β in Formula (14) is computed in each mini-batch. The size of each batch in our network is fixed at 1800. The ADAM optimizer has an initial learning rate of 0.01 for the training of our model.

4. Experiments

In this section, we conduct a series of experiments on public datasets to verify the effectiveness of our proposed method. Firstly, we introduce three public datasets CASIA-3D, Lock3Dface, and Bosphoru. Then, we conduct ablation experiments and explore the influence of enhancement parameter λ . Finally, we use our best results for comparison with current advanced methods and analyze the comparison results.

4.1. Datasets

CASIA-3D [33]: This dataset used Minolta vivid910 to scan 123 subjects and each subject collected 37 or 38 face images with the influence of different facial expressions, head poses, and light intensities. The dataset has a total of 4626 face samples.
We divide the training set and test set of CASIA-3D according to the method in [26]. Only the frontal face and small pose interference samples are used for experiments, including 1784 samples in the training set and 1783 samples in the test set.
Bosphorus: Savran et al. [34] collected this dataset for studying 2D and 3D face analysis tasks. This dataset, based on the structured light 3D system, collected a total of 4666 samples of facial data from 105 subjects; one-third of the subjects were professional actors and each subject provided 35 types of expressions.
We divide the training set and test set according to the method in [23], in which the training set contains 2403 samples and the test set contains 2263 samples.
Lock3DFace: Zhang et al. [35] collected this dataset by Kinect V2 for 3D face analysis. A total of 5671 samples from 509 subjects were included. According to different scenarios, this dataset is divided into five subsets covering variations in expression (FE), neutral face (NU), occlusion (OC), pose (PS), and time lapse (TM).
We divide the training set and test set of Lock3DFace according to the method in [31], in which the 340 subjects are randomly selected as the training set and the remaining 169 subjects are selected as the test set.

4.2. Ablation Experiments

In this section, we first investigate the effectiveness of the proposed feature enhancement mechanism and explore enhancement parameter λ in Formula (13).
In this step, we set λ as a fixed value and explore the effect of λ on the accuracy of the network. The results on CASIA-3D, Bosphorus, and Lock3DFace are reported in Table 1, Table 2 and Table 3.
According to Table 1, Table 2 and Table 3, when λ = 50 , λ = 55 , and λ = 55 on CASIA-3D, Lock3DFace, and Bosphorus, our network achieves the best accuracy 98.9%, 98.9%, and 88.0%, respectively. Figure 12 intuitively presents the relationship between λ (x-axis) and accuracy (y-axis).
As shown in Figure 12, when λ = 0 , according to Formula (13), the proposed feature enhancement mechanism is not utilized. With the increase in λ , the feature enhancement mechanism begins to enhance the features and the accuracy of the network begins to increase, which proves that our feature enhancement mechanism can effectively enhance the discrimination of features and improve the recognition accuracy of the network. As λ continues to increase, the accuracy begins to decline. This is because the contribution of F L in Formula (9) becomes small. In this case, features with smaller absolute values will be ignored ( F S * is mainly to enhance the features with large absolute value), which will interfere with the accuracy of the network.
Although the best accuracy on the three datasets corresponds to a different λ , according to Figure 12, when λ 40 , 55 , the accuracy curves reach a stable peak. In this interval, F L and F S * have the best coupling degree, which can provide the best discrimination for facial features. According to the evaluation method in [9], Table 4 shows the relationship between λ in the peak interval and list the mean accuracy with the standard error of the mean. According to Table 4, the accuracy is relatively stable in this interval for each dataset, which proves that our method has good generalization ability.
As the experimental results show above, we explored the relationship between λ and accuracy and also demonstrated the effectiveness of the feature enhancement mechanism. Then, in the second step, we continue to explore the effectiveness of the distance metric utilized in the triplet loss function. In Section 3.5, instead of taking the F T (in Formula (9)) as a whole, we measure the distance by F L and F S * separately:
|| F T A F T P || 2 2 || F L A F L P || 2 2 + || λ F S * A λ F S * A || 2 2
According to Formula (19), the right part is not equal to the left part. In order to verify the performance of two measurement methods, we conduct a comparison experiment on three datasets; the results are listed in Table 5.
As shown in Table 5, where L * represents the left part of Formula (19) to measure the distance between two features, while L represents the right part. Table 5 lists the mean accuracy with the standard error of the mean on tree datasets. According to Table 5, the results of the two measurement methods are very close but L is higher. This is because the eigenvalue with a smaller absolute value in F T will be ignored and L can be regarded as two kinds of features to measure the distance between two samples, which is better to capture more differences.

4.3. Comparison Experiments

The results of ablation experiments prove the effectiveness of our proposed method. In this section, according to [31], we use our best results to conduct comparison experiments with current advanced methods on three public datasets and analyze the results.
Firstly, in order to verify the effectiveness of the proposed ψ c o n v network, we use different point clouds based networks to extract facial features and perform face recognition under the same setting on CASIA-3D. The accuracy curves in the training process are shown in Figure 13 and the results are listed in Table 6.
As shown in Figure 13, during the training process, our accuracy curve is higher than other methods and, as the results listed in Table 6, our method also achieves the best accuracy on the test set, which prove the effectiveness of our ψ c o n v network. Compared with the method in [8], our network has a similar architecture but adds a local feature descriptor. The comparison results with the method in [8] prove that our network based on a local feature descriptor can better obtain facial fine-grained features and is more conducive to improving the accuracy of the model.
Table 7, Table 8 and Table 9 list the comparison results with the latest face recognition methods on three datasets, respectively.
The results in Table 7 and Table 8 show that, under different datasets, our accuracy is higher than other methods.
As described in Section 4.1, Lock3DFace has five subsets: expression changes (FE), normal face (NU), partial occlusion (OC), head pose changes (PS), and time lapse (TL). In order to intuitively verify the performance of our method in different scenarios, we conduct a comparison experiment on the first four subsets: FE, NU, OC, and PS. The results are shown in Table 9. According to the results, in the NU subset, which has no other interference, Jiang et al. [31] achieves the best accuracy, but in the OC and PS subsets, our method achieves the best accuracy, which proves that our network is better to cope with partial occlusions and head pose interference. Figure 14 shows the t-SN example of our network for face recognition on three datasets (each dataset selects five subjects for classification and each color represents one subject). As shown in Figure 14, the classification results on CASIA-3D and Bosphorus are more convergent, but on Lock3DFace are more discrete. This is because there are fewer samples for each subject in Lock3DFace and there is also more interference for samples. However, according to Figure 14c, our method can still distinguish different objects clearly on Lock3DFace.
Apart from the accuracy, the time cost is also an important indicator for measuring the efficiency of the network. Table 10 lists the comparison results of different methods in terms of time costs. In Table 10, “Ours *” represents our method without the feature enhancement mechanism. Compared with “Ours *”, our time cost is very close. This is because the feature enhancement adopts the softmax function to stretch the features, the calculation complexity is low, and no additional network parameters are added. The comparison results in Table 10 show that our network also maintains good real-time performance.

5. Conclusions

Since point clouds lack detailed textures and since face recognition require fine-grained representation of features, this paper proposes a new operator, ψ c o n v , based on the local feature descriptor to realize fine-grained feature extraction of disordered point clouds by a convolutional neural network and constructs the feature enhancement mechanism to improve feature discrimination; meanwhile, the triplet loss function is adopted to optimize the network. In order to verify the performance of our method, we conducted experiments on the CASIA-3D, Lock3Dface, and Bosphorus datasets. The results of the ablation experiments prove that the feature enhancement mechanism and the triplet loss can effectively improve the recognition accuracy of the model. The results of the comparison experiments show that our network outperforms current advanced methods and can better cope with the interference of face expressions, partial occlusions, and head pose changes. Meanwhile, our network also has good real-time performance and can be applied in real scenarios. However, when the pose interference is too large due to the lack of some facial features, the accuracy of our method is still insufficient. We will further explore new methods to improve the accuracy under large pose interference and investigate new algorithms for 3D face analysis, such as head pose estimation, expression recognition, face detection, and other 3D visual tasks in real applications.

Author Contributions

Conceptualization, Q.W.; data curation, Q.W.; formal analysis, Q.W. and W.Q.; investigation, Q.W.; methodology, Q.W.; project administration, H.L.; resources, Q.W.; software, Q.W.; supervision, H.L.; visualization, Q.W.; writing—original draft, Q.W.; writing—review and editing W.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the National Natural Science Foundation of China (61802052). Innovative Research Foundation of Ship General Performance (26422206). The Sichuan Science and Technology Program (2023YFSY0040).

Institutional Review Board Statement

Not applicable for studies not involving humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xiao, S.; Sang, N.; Wang, X.; Ma, X. Leveraging Ordinal Regression with Soft Labels for 3D Head Pose Estimation from Point Sets. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1883–1887. [Google Scholar]
  2. Ma, X.; Sang, N.; Xiao, S.; Wang, X. Learning a Deep Regression Forest for Head Pose Estimation from a Single Depth Image. J. Circuits Syst. Comput. 2021, 30, 2150139. [Google Scholar] [CrossRef]
  3. Yu, Y.; Hoshyar, A.N.; Samali, B.; Zhang, G.; Rashidi, M.; Mohammadi, M. Corrosion and coating defect assessment of coal handling and preparation plants (CHPP) using an ensemble of deep convolutional neural networks and decision-level data fusion. Neural Comput. Appl. 2023, 35, 18697–18718. [Google Scholar] [CrossRef]
  4. Yu, Y.; Li, J.; Li, J.; Xia, Y.; Ding, Z.; Samali, B. Automated damage diagnosis of concrete jack arch beam using optimized deep stacked autoencoders and multi-sensor fusion. Dev. Built Environ. 2023, 14, 100128. [Google Scholar] [CrossRef]
  5. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  6. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
  7. Deng, H.; Birdal, T.; Ilic, S. Ppfnet: Global Context Aware Local Features for Robust 3D Point Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 195–205. [Google Scholar]
  8. Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. In Proceedings of the Annual Conference on Neural Information Processing Systems 2018 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
  9. Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
  10. Yang, J.; Luo, L.; Qian, J.; Tai, Y.; Zhang, F.; Xu, Y. Nuclear Norm Based Matrix Regression with Applications to Face Recognition with Occlusion and Illumination Changes. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 156–171. [Google Scholar] [CrossRef] [PubMed]
  11. Guo, K.; Wu, S.; Xu, Y. Face recognition using both visible light image and near-infrared image and a deep network. CAAI Trans. Intell. Technol. 2017, 2, 39–47. [Google Scholar] [CrossRef]
  12. Lu, J.; Liong, V.E.; Wang, G.; Moulin, P. Joint feature learning for face recognition. IEEE Trans. Inf. Forensics Secur. 2015, 10, 1371–1383. [Google Scholar] [CrossRef]
  13. Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
  14. Zong, Y.; Huang, X.; Zheng, W.; Cui, Z.; Zhao, G. Learning from Hierarchical Spatiotemporal Descriptors for Micro-Expression Recognition. IEEE Trans. Multimed. 2018, 20, 3160–3172. [Google Scholar] [CrossRef]
  15. Yang, W.H.; Dai, D.Q. Two-dimensional maximum margin feature extraction for face recognition. IEEE Trans. Syst. Man Cybern. Part B 2009, 39, 1002–1012. [Google Scholar] [CrossRef] [PubMed]
  16. Zhang, B.; Gao, Y.; Zhao, S.; Liu, J. Local Derivative Pattern Versus Local Binary Pattern: Face Recognition With High-Order Local Pattern Descriptor. IEEE Trans. Image Process. 2009, 19, 533–544. [Google Scholar] [CrossRef] [PubMed]
  17. Zhang, L.; Ding, Z.; Li, H.; Shen, Y.; Lu, J. 3D Face Recognition Based on Multiple Keypoint Descriptors and Sparse Representation. PLoS ONE 2014, 9, e100120. [Google Scholar] [CrossRef] [PubMed]
  18. Chouchane, A.; Belahcene, M.; Bourennane, S. 3D and 2D face recognition using integral projection curves based depth and intensity images. Int. J. Intell. Syst. Technol. Appl. 2015, 14, 50–69. [Google Scholar] [CrossRef]
  19. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  20. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  21. Soltanpour, S.; Wu, Q.J. High-order local normal derivative pattern (LNDP) for 3D face recognition. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 2811–2815. [Google Scholar]
  22. Mu, G.; Huang, D.; Hu, G.; Sun, J.; Wang, Y. Led3d: A lightweight and efficient deep approach to recognizing low-quality 3d faces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5773–5782. [Google Scholar]
  23. Dutta, K.; Bhattacharjee, D.; Nasipuri, M. SpPCANet: A simple deep learning-based feature extraction approach for 3D face recognition. Multimed. Tools Appl. 2020, 79, 31329–31352. [Google Scholar] [CrossRef]
  24. Bhople, A.R.; Shrivastava, A.M.; Prakash, S. Point cloud based deep convolutional neural network for 3D face recognition. Multimed. Tools Appl. 2021, 80, 30237–30259. [Google Scholar] [CrossRef]
  25. Atik, M.E.; Duran, Z. Deep learning-based 3D face recognition using derived features from point cloud. In Innovations in Smart Cities Applications Volume 4: The Proceedings of the 5th International Conference on Smart City Applications; Springer International Publishing: Cham, Switzerland, 2021; pp. 797–808. [Google Scholar]
  26. Gao, G.; Yang, H.; Liu, H. 3D point cloud face recognition based on deep learning. J. Comput. Appl. 2021, 41, 2736. [Google Scholar]
  27. Yu, Y.; Da, F.; Zhang, Z. Few-data guided learning upon end-to-end point cloud network for 3D face recognition. Multimed. Tools Appl. 2022, 81, 12795–12814. [Google Scholar] [CrossRef]
  28. Cao, Y.; Liu, S.; Zhao, P.; Zhu, H. RP-Net: A PointNet++ 3D Face Recognition Algorithm Integrating RoPS Local Descriptor. IEEE Access 2022, 10, 91245–91252. [Google Scholar] [CrossRef]
  29. Zhang, Z.; Da, F.; Yu, Y. Learning directly from synthetic point clouds for “in-the-wild” 3D face recognition. Pattern Recognit. 2022, 123, 108394. [Google Scholar] [CrossRef]
  30. Yu, C.; Zhang, Z.; Li, H.; Sun, J.; Xu, Z. Meta-learning-based adversarial training for deep 3D face recognition on point clouds. Pattern Recognit. 2023, 134, 109065. [Google Scholar] [CrossRef]
  31. Jiang, C.; Lin, S.; Chen, W.; Liu, F.; Shen, L. PointFace: Point Cloud Encoder-Based Feature Embedding for 3-D Face Recognition. IEEE Trans. Biom. Behav. Identity Sci. 2022, 4, 486–497. [Google Scholar] [CrossRef]
  32. Xiao, S.; Sang, N.; Wang, X. 3D point cloud head pose estimation based on deep learning. J. Comput. Appl. 2020, 40, 996. [Google Scholar]
  33. Institute of Automation of Chinese Academy of Sciences. Note on CASIA-3D FaceV1. 2004. Available online: http://biometrics.idealtest.org (accessed on 10 November 2021).
  34. Savran, A.; Alyüz, N.; Dibeklioğlu, H.; Çeliktutan, O.; Gökberk, B.; Sankur, B.; Akarun, L. Bosphorus database for 3D face analysis. In Biometrics Identity Manage; Springer: Berlin/Heidelberg, Germany, 2008; pp. 47–56. [Google Scholar]
  35. Zhang, J.; Huang, D.; Wang, Y.; Sun, J. Lock3DFace: A large-scale database of low-cost kinect 3D faces. In Proceedings of the 2016 International Conference on Biometrics (ICB), Halmstad, Sweden, 13–16 June 2016; pp. 1–8. [Google Scholar]
  36. Serafin, J.; Grisetti, G. Nicp: Dense normal based point cloud registration. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, Hamburg, Germany, 28 September–2 October 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 742–749. [Google Scholar]
  37. Liu, Y.; Fan, B.; Xiang, S.; Pan, C. Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8887–8896. [Google Scholar]
  38. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–26 June 2018; pp. 4510–4520. [Google Scholar]
Figure 1. Example of the local feature descriptor with the center point p i .
Figure 1. Example of the local feature descriptor with the center point p i .
Sensors 23 07715 g001
Figure 2. Example of the disorder of the point clouds, where (a,b) represent point clouds with different index orders under the same distribution.
Figure 2. Example of the disorder of the point clouds, where (a,b) represent point clouds with different index orders under the same distribution.
Sensors 23 07715 g002
Figure 3. The permutation matrix to adjust the order of the points.
Figure 3. The permutation matrix to adjust the order of the points.
Sensors 23 07715 g003
Figure 4. The PointNet for local feature extraction.
Figure 4. The PointNet for local feature extraction.
Sensors 23 07715 g004
Figure 5. The structure of M L P α .
Figure 5. The structure of M L P α .
Sensors 23 07715 g005
Figure 6. The structure of M L P χ .
Figure 6. The structure of M L P χ .
Sensors 23 07715 g006
Figure 7. The weight matrix for permutation invariance.
Figure 7. The weight matrix for permutation invariance.
Sensors 23 07715 g007
Figure 8. The convolutional neural network for facial feature extraction.
Figure 8. The convolutional neural network for facial feature extraction.
Sensors 23 07715 g008
Figure 9. The structure of the feature enhancement mechanism.
Figure 9. The structure of the feature enhancement mechanism.
Sensors 23 07715 g009
Figure 10. The schematic diagram of triplet loss training process.
Figure 10. The schematic diagram of triplet loss training process.
Sensors 23 07715 g010
Figure 11. The complete pipeline of our proposed network for face recognition.
Figure 11. The complete pipeline of our proposed network for face recognition.
Sensors 23 07715 g011
Figure 12. Accuracy change curves with different λ on three datasets.
Figure 12. Accuracy change curves with different λ on three datasets.
Sensors 23 07715 g012
Figure 13. Accuracy change curves during training with different feature extraction network.
Figure 13. Accuracy change curves during training with different feature extraction network.
Sensors 23 07715 g013
Figure 14. T-SN examples of face recognition, where (ac) represent the classification results on CASIA-3D, Bosphorus, and Lock3DFace, respectively.
Figure 14. T-SN examples of face recognition, where (ac) represent the classification results on CASIA-3D, Bosphorus, and Lock3DFace, respectively.
Sensors 23 07715 g014
Table 1. Performance evaluation with different λ on CASIA-3D.
Table 1. Performance evaluation with different λ on CASIA-3D.
λ 0101520253035404550
Acc (%)89.594.895.997.197.898.398.698.798.898.9
λ 556065707580859095100
Acc (%)98.697.997.196.395.795.194.593.793.693.6
Table 2. Performance evaluation with different λ on Bosphorus.
Table 2. Performance evaluation with different λ on Bosphorus.
λ 0101520253035404550
Acc (%)90.193.394.795.696.697.398.098.398.598.8
λ 556065707580859095100
Acc (%)98.998.697.897.296.796.295.594.994.093.9
Table 3. Performance evaluation with different λ on Lock3DFace.
Table 3. Performance evaluation with different λ on Lock3DFace.
λ 0101520253035404550
Acc (%)82.284.585.386.086.586.987.387.587.687.8
λ 556065707580859095100
Acc (%)88.087.786.986.285.584.984.584.183.983.8
Table 4. Accuracy of the different λ on three datasets.
Table 4. Accuracy of the different λ on three datasets.
λ
40455055
CASIA-3D (Acc%) 98.66 ± 0.04 98.77 ± 0.03 98.89 ± 0.01 98.55 ± 0.05
Bosphorus (Acc%) 98.23 ± 0.07 98.47 ± 0.03 98.78 ± 0.02 98.88 ± 0.02
Lock3DFace (Acc%) 87.38 ± 0.12 87.48 ± 0.12 87.70 ± 0.10 87.91 ± 0.09
Table 5. Accuracy of the different metric distance on CASIA-3D.
Table 5. Accuracy of the different metric distance on CASIA-3D.
Dataset L * Acc (%)L Acc (%)
CASIA-3D 98.27 ± 0.03 98.89 ± 0.01
Bosphorus 98.08 ± 0.02 98.88 ± 0.02
Lock3DFace 87.39 ± 0.11 87.91 ± 0.09
Table 6. Comparison of accuracy achieved by different feature extraction network on the test set of CASIA-3D.
Table 6. Comparison of accuracy achieved by different feature extraction network on the test set of CASIA-3D.
MethodsAcc (%)
PointNet++ [6]95.6
NICP [36]90.3
RSCNN [37]95.9
Pointcnn [8]97.5
Ours98.9
Table 7. Comparison of accuracy achieved by different methods on CASIA-3D.
Table 7. Comparison of accuracy achieved by different methods on CASIA-3D.
MethodsAcc (%)
Chouchane et al. [18]96.8
Dutta et al. [23]98.2
Gao et al. [26]97.6
Cao et al. [28]97.9
Ours98.9
Table 8. Comparison of accuracy achieved by different methods on Bosphorus.
Table 8. Comparison of accuracy achieved by different methods on Bosphorus.
MethodsAcc (%)
Zhang et al. [17]93.0
Soltanpour et al. [21]97.3
Dutta et al. [23]98.5
Cao et al. [28]98.0
Ours98.9
Table 9. Comparison of accuracy achieved by different methods on Lock3DFace.
Table 9. Comparison of accuracy achieved by different methods on Lock3DFace.
MethodsFE (%)NU (%)OC (%)PS (%)Total (%)
K. He et al. [20]96.199.354.961.476.6
C.S et al. [19]93.699.057.054.174.4
M.S et al. [38]95.798.961.469.979.5
Mu et al. [22]98.199.678.170.484.2
Jiang et al. [31]98.599.580.173.787.2
Ours98.599.382.674.988.0
Table 10. Comparison of time costs, where “Ours *” represents our method without the feature enhancement mechanism.
Table 10. Comparison of time costs, where “Ours *” represents our method without the feature enhancement mechanism.
Methodsfps
Dutta et al. [23]7
M.S et al. [38]6
Xiao et al. [1]117
Xiao et al. [32]125
Mu et al. [22]136
Jiang et al. [31]118
Ours *139
Ours125
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Q.; Lei, H.; Qian, W. Point CNN:3D Face Recognition with Local Feature Descriptor and Feature Enhancement Mechanism. Sensors 2023, 23, 7715. https://doi.org/10.3390/s23187715

AMA Style

Wang Q, Lei H, Qian W. Point CNN:3D Face Recognition with Local Feature Descriptor and Feature Enhancement Mechanism. Sensors. 2023; 23(18):7715. https://doi.org/10.3390/s23187715

Chicago/Turabian Style

Wang, Qi, Hang Lei, and Weizhong Qian. 2023. "Point CNN:3D Face Recognition with Local Feature Descriptor and Feature Enhancement Mechanism" Sensors 23, no. 18: 7715. https://doi.org/10.3390/s23187715

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop