MMPCANet: An Improved PCANet for Occluded Face Recognition

Wang, Zewei; Zhang, Yongjun; Pan, Chengchang; Cui, Zhongwei

doi:10.3390/app12063144

Open AccessArticle

MMPCANet: An Improved PCANet for Occluded Face Recognition

¹

Key Laboratory of Intelligent Medical Image Analysis and Precise Diagnosis of Guizhou Province, College of Computer Science and Technology, Guizhou University, Guiyang 550025, China

²

School of Mathematics and Big Data, Guizhou Education University, Guiyang 550018, China

³

Big Data Science and Intelligent Engineering Research Institute, Guizhou Education University, Guiyang 550018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(6), 3144; https://doi.org/10.3390/app12063144

Submission received: 24 January 2022 / Revised: 9 March 2022 / Accepted: 15 March 2022 / Published: 19 March 2022

Download

Browse Figures

Versions Notes

Abstract

:

Principal Component Analysis Network (PCANet) is a lightweight deep learning network, which is fast and effective in face recognition. However, the accuracy of faces with occlusion does not meet the optimal requirement for two reasons: 1. PCANet needs to stretch the two-dimensional images into column vectors, which causes the loss of essential image spatial information; 2. When the training samples are few, the recognition accuracy of PCANet is low. To solve the above problems, this paper proposes a multi-scale and multi-layer feature fusion-based PCANet (MMPCANet) for occluded face recognition. Firstly, a channel-wise concatenation of the original image features and the output features of the first layer is conducted, and then the concatenated result is used as the input of the second layer; therefore, more image feature information is used. In addition, to avoid the loss of image spatial information, a spatial pyramid is used as the feature pooling layer of the network. Finally, the feature vector is sent to the random forest classifier for classification. The proposed algorithm is tested on several widely used facial image databases and compared with other similar algorithms. Our experimental results show that the proposed algorithm effectively improves the efficiency of the network training and the recognition accuracy of occluded faces under the same training and testing datasets. The average accuracies are 98.78% on CelebA, 97.58% on AR, and 97.15% on FERET.

Keywords:

PCANet; face recognition; occluded face; MMPCANet; deep learning

1. Introduction

During the new coronavirus pneumonia epidemic, almost everyone wore masks, which invalidated face recognition technology as an important means of authentication. As a result, occluded face recognition is an urgent problem. Facial occlusion is the obstruction of facial information due to the influence of external factors. Complete facial information is often the key to normal recognition of faces by the security system; otherwise, it creates loopholes for criminals. Occluded faces are of different types, among which include complex illuminated occlusion, objects in front of faces, self-occlusion, etc. These circumstances may result in the distribution of the distorted images between the training and testing images [1]. In general, face recognition mainly includes three steps: face detection [2], feature extraction [3], and face classification [4]. Face detection is the segmentation of the face region from the whole image during subsequent feature extraction. Feature extraction is the extraction of the most discriminative features from each face image. Face classification is based on the extracted features to determine whether the unknown face image belongs to someone. At present, the key to solving the occlusion face recognition problem is still the extraction of effective features [5,6,7]. Our proposed method is centered between feature extraction and classification.

In recent years, researchers have investigated on a series of occlusion face recognition, which is mainly divided into deep learning and traditional. For example, Wright et al. [8] proposed sparse representation classification (SRC) for partially occluded face recognition. On the basis of SRC, Deng et al. [9] proposed an extended SRC (ESRC), whose basic idea is that intraclass variations of the same person can be shared by others. In order to solve the problem of weak discrimination and a large-scale occlusion dictionary, Du et al. [10] proposed kernel norm based adaptive occlusion dictionary learning (NNAODL). Gao et al. [11] proposed an approach named multilayer locality-constrained structural orthogonal Procrustes regression (MLCSOPR). One of the advantages of these traditional occlusion face recognition technologies is that they can achieve high recognition accuracy even in the case of few training images [12,13]. Methods based on deep learning have been the main research in recent years, but deep learning often incurs some challenges, including a large number of training parameters and a complex neural network structure. To improve this problem, PCANet was proposed by Chan et al. [14] at the end of 2015, and its most important contribution is to introduce subspace learning into deep learning. In other words, deep learning is connected with the traditional feature extraction method, which provides a new idea for the learning of a convolution kernel in a convolutional neural network (CNN) [15]. Compared with CNN, PCANet has a simple structure and does not need to adjust parameters by a random gradient descent algorithm in the learning process of the filter. To some extent, it avoids the overfitting problem caused by too few training data sets, and it also eliminates the tedious parameter adjustment process. At present, PCANet has been widely used in face recognition and texture recognition [16,17]. A large number of experiments have proven that PCANet still has very good performance in many different types of face databases [18,19], and they have also proven that it can extract deep features that are very suitable for classification [20,21].

It should be noted that, for occluded faces, PCANet can easily cause the loss of image information in the double PCA decomposition process. In addition, compared with the nonlinear mapping function of the activation function in CNN, PCANet has poor nonlinear fitting ability and cannot guarantee that the original data set is linearly separable [22]. Experiments by Alahmadi et al. [23] show that PCANet features are still greatly affected by occlusion. At the same time, the recognition accuracy of PCANet is low when the training samples are few. At present, the commonly used neural networks include DeepID2 [24], DeepID2+ [25], PCANet, FaceNet [26], and FROM [5]. Although they are inspired by the principles of human vision, the difference between the occlusion problem and the human eye is that the neural network needs a lot of training to obtain the occlusion face image information, whereas the human eye only needs to observe a small number of pictures. Therefore, it is still necessary to train a neural network for occlusion face recognition based on small samples. In order to solve the above problems, this paper proposes a multi-scale and multi-layer feature fusion based PCANet (MMPCANet) for occluded face recognition. Table 1 shows key differences between a number of previous methods and the proposed methods. Our contributions are as follows:

(1) A multi-scale multi-layer feature fusion PCANet is proposed. Firstly, the channel concatenates the original image features and the output features of the first layer, and then it uses the concatenated result as the input of the second layer to regain better spatial information. In addition, in order to avoid the loss of spatial information of the image, a spatial pyramid is used as the feature pooling layer of the network.

(2) The results of the original image extracted by MMPCANet are often high-dimensional data. Random forest not only processes high-dimensional data efficiently, but it also obtains an unbiased estimation of the internal generation error. Using it instead of a support vector machine (SVM) [27] classifier can avoid many parameter adjustment problems, and it has very high accuracy.

(3) Experimental results show that our proposed MMPCANet outperforms PCANet on multiple benchmark datasets.

The organizational structure of this paper is as follows. In Section 2, the process of MMPCANet is described in detail. In Section 3, the proposed algorithm is analyzed and compared with PCANet. In Section 4, comparative experiments are conducted with other algorithms on a large number of databases. Section 5 summarizes the findings of this study.

2. Proposed Method

2.1. Multi-Layer Feature Fusion PCANet

Inspired by DenseNet [28], PCANet (MPCANet) is proposed to fuse different levels of feature information. MPCANet first concatenates the original image features and the output features of the first layer of the network at the channel level (channel concatenation), and then it takes the concatenated results as the input of the second layer of the network; therefore, MPCANet uses more feature information than PCANet. MPCANet includes the input layer, convolution layer, output layer, and classification layer. Its specific design process is as follows:

(1) Input layer of MPCANet: First, block sampling is performed on each input training image. Assuming that there are N training images of

m \times n

size, the filter size is

k_{1} \times k_{2}

. Taking each pixel of the image

I_{i}^{1}

as the center, l

k_{1} \times k_{2}

blocks are obtained and vectorized. This allows the input image to build a matrix shaped like

X_{i}

.

X_{i} = [x_{i, 1}, x_{i, 2}, \dots, x_{i, \tilde{m} \tilde{n}}] ϵ R^{k_{1} k_{2} \times \tilde{m} \tilde{n}}

(1)

where

\tilde{m} = m - k_{1} + 1, \tilde{n} = n - k_{2} + 1

. Then, the average value of the column is subtracted from each column to obtain the demeaned matrix

{\bar{X}}_{i} = [{\bar{x}}_{i, 1}, {\bar{x}}_{i, 2}, \dots, {\bar{x}}_{i, \tilde{m} \tilde{n}}]

. This is repeated for all N training images, and all

{\bar{X}}_{i}

are put together to form a large matrix X.

X = [{\bar{X}}_{1}, {\bar{X}}_{2}, \dots, {\bar{X}}_{N}] ϵ R^{k_{1} k_{2} \times N \tilde{m} \tilde{n}}

(2)

X

is the input of MPCANet.

(2) The first layer convolution: Let the number of filters be

L_{1}

. Firstly,

X X^{T}

is decomposed by SVD, and the feature vectors

V_{l}^{1} = {V_{1}^{1}, V_{2}^{1}, \dots, V_{L_{1}}^{1}}

corresponding to the first

L_{1}

main eigenvalues of

X X^{T}

are obtained. Subsequently, each feature vector is represented as a matrix with the size of

k_{1} \times k_{2}

. Each matrix is regarded as the first layer of MPCANet. In other words, the filter of first layer convolution can be expressed as follows:

W_{l}^{1} = {mat}_{k_{1} k_{2}} (V_{l}^{1}), W_{l}^{1} ϵ R^{k_{1} \times k_{2}}

(3)

In the expression,

mat

is a function that maps the l-th feature vector

V_{l}^{1}

to matrix

W_{l}^{1}

. Therefore, for image

I_{i}^{1}

, the output after the first layer convolution can be obtained by the following expression:

I_{i, l}^{1} = I_{i}^{1} * W_{l}^{1}, i = 1, 2, \dots, N; l = 1, 2, \dots, L_{1}

(4)

Mark

*

denotes two-dimensional convolution [14]. For each input image,

L_{1}

output maps can be obtained after the first layer convolution.

(3) The second layer convolution: Unlike the original PCANet, MPCANet concatenates the original input image with the output of the first convolution layer along the channel, and then it concatenates the result as the input of the second convolution layer; i.e., the input of the second convolution layer of MPCANet is

I^{2}

.

I^{2} = [I_{1}^{1}, I_{2}^{1}, \dots, I_{N}^{1}, I_{1, 1}^{1}, \dots, I_{1, L_{1}}^{1}, \dots, I_{N, 1}^{1}, \dots, I_{N, L_{1}}^{1}] \in R^{k_{1} k_{2} \times (L_{1} + 1) N m n}

(5)

L_{2}

is the number of filters in the second convolution layer [14].

I_{i}^{1}

and

I_{i, l}^{1}

represent the output of the i-th original image and its first layer convolution, respectively. Enter

I^{2}

and repeat almost the same operation as the first layer to learn the filter kernel of the second layer, denoted as

W_{j}^{2} = {W_{1}^{2}, W_{2}^{2}, \dots, W_{L_{2}}^{2}}

. Therefore, for each

I^{2}

there are

L_{2}

outputs corresponding to:

U_{i}^{l} = {I_{i}^{2} * W_{j}^{2}}_{j = 1}^{L_{2}}, i = 1, 2, \dots, (L_{1} + 1) N; l = 1, 2, \dots, L_{1} + 1

(6)

(4) Output layer: The output layer includes nonlinear processing, block histograms, and feature concatenation. After the second layer convolution, each original input image

I_{i}^{1}

has a

L_{1} + 1

set of output maps, and each group contains

L_{2}

feature maps; i.e., a total of

(L_{1} + 1) L_{2}

feature maps are generated. For each set, the Heaviside step function H is used to binarize each output map, and then the binary hash code is used to convert each

L_{2}

output map into an output matrix D with an element value of

0 ~ 2^{L_{2}} - 1

.

D_{i}^{l} = \sum_{j = 1}^{L_{2}} 2^{j - 1} H (U_{i}^{l})

(7)

A total of

L_{1} + 1

output matrices

D_{i}^{l} (l = 1, 2, \dots, L_{1} + 1)

can be obtained. Finally, the block-based histogram [29,30] is used to obtain the feature of each image.

D_{i}^{l}

is divided into B blocks, and the histogram information of each block is counted. Then, the B histogram features are serially represented as

B h i s t (D_{i}^{l})

. Finally, all

B h i s t (D_{i}^{l})

are connected into a matrix named the extended histogram feature vector, which is regarded as the feature representation

F_{i}

of input image

I_{i}^{1}

.

F_{i} = {[B h i s t (D_{i}^{1}), B h i s t (D_{i}^{2}), \dots, B h i s t (D_{i}^{L_{1} + 1})]}^{T}, F_{i} \in R^{2^{L_{2}} (L_{1} + 1) B}

(8)

(5) Classification layer: the feature vector

F_{i}

obtained by the network output is used as the input of the random forest classifier, and the classifier is trained to obtain the classification results [14]. Kremic and Subasi [27] used two methods to prove that the random forest algorithm is effective for face data.

In addition, although MPCANet is only a two-layer network model, it is as easy to extend to three or more layers as PCANet. Three-layer MPCANet (named MPCANet3) is also available. On the basis of retaining the first two layers of MPCANet structure, we cascade the output of the first layer and the second layer, and then we use the cascade results as the input of the third layer. Although we only extend to three layers, it can be extended to multiple layers in theory. A simple comparison of the processes for PCANet, MPCANet, and MPCANet3 is shown in Figure 1.

2.2. Multi-Scale PCANet

Section 2.1 establishes the mapping of input data from low-level features to high-level features using the convolution layer, nonlinear processing layer, and a block histogram, which extract the high-level semantic features of input samples. However, only extracting high-level semantic features as the final feature vector ignores the low-level features of the data, which contain more local details. The authors of [31] show that the multi-scale feature vector representation of samples combining the local detail information of low-level features and the abstract semantic information of high-level features can effectively reduce the information loss caused by multiple convolution and pooling operations. At the same time, it extracts the original redundant features by the spatial pyramid maximum pooling method, aggregates the extracted feature vectors, and retains the important information in the image after discarding irrelevant information; therefore the network obtains better noise resistance and robustness [32].

Based on the above analysis, a multi-scale and multi-layer feature fusion PCANet (MM-PCANet) is proposed. The feature pooling layer is added to MPCANet, and its structure is shown in Figure 2.

I_{i}

is the input sample;

W_{L_{1}}^{1}

and

W_{L_{2}}^{1}

are the convolution kernels of the first layer and the second layer in the convolution layer, respectively; and

I_{i, l}^{1}

and

U_{i}^{l}

are the convolution outputs of the first layer and the second layer in the convolution layer, respectively. In the nonlinear processing layer,

D_{i}^{1} (L_{1})

and

D_{i}^{2}

are obtained by nonlinear processing of

I_{i, l}^{1}

and

U_{i}^{l}

. The specific improvement steps are as follows:

(1) The spatial pyramid pooling strategy is introduced into the feature pooling layer to construct the feature descriptor of the image layer by layer.

F_{i, 1}

and

F_{i, 2}

are the feature vectors of each layer

D_{i}^{1} (L_{1})

and

D_{i}^{2}

obtained by spatial pyramid pooling. The feature vectors of different scales obtained by spatial pyramid pooling are cascaded to obtain multi-scale features of the image.

F_{i} = [F_{i, 1}, F_{i, 2}]

is the feature vector obtained by the concatenation of

F_{i, 1}

and

F_{i, 2}

.

(2)

F_{i}

is sent to the random forest algorithm for image classification. Experiments have proven the effectiveness of the random forest algorithm on face data. Therefore, the random forest algorithm, which is better at processing high-dimensional data, is used to replace SVM [33] for classification training of fused features, which has achieved good classification results. The random forest algorithm used has been described in detail in [34].

The multi-scale strategy is added to extract the low-level feature

F_{i, 2}

of the input sample by binary weighting and by the spatial pyramid pooling operation on the first layer output in the convolution layer. Concatenating the underlying feature

F_{i, 2}

and the high-level feature

F_{i, 1}

, MMPCANet can extract both the high-level semantic information and the underlying detail information, thus obtaining the feature representation of multi-scale and multi-layer feature fusion of input samples.

The multi-scale features described in this paper represent the multi-scale relationship of input samples from low-level features to high-level features, and the image pyramid represents the multi-level relationship of the spatial pixel level, which is shown in Figure 3. U is a set of feature vectors extracted from local histograms,

l \in L

denotes the partition of the l-th level of the space, L is an integer greater than or equal to zero, and each level has

2^{l}

subsets; i.e.,

2^{l}

grids are divided in the horizontal and ordinate directions, and the number of image blocks is

2^{l} \times 2^{l} = 4^{l}

. The pooling algorithm is applied to the feature vector in the L level, and the pooling function is expressed as

Z = F (U)

(9)

Since mean pooling easily loses the details of the target object in the image, this paper uses the maximum function as the spatial function. The maximum value of the dimension corresponding to the feature vector extracted from the image represents the corresponding response degree of the image. The formula for calculating the maximum value of local image block coding is as follows:

z_{j} = \max {| u_{1 j} |, | u_{2 j} |, \dots, | u_{m j} |}

(10)

where

z_{j}

is the j-th element of

Z

,

u_{m j}

is the j-th column element of the m-th row of matrix

U

, and m is the number of feature vectors. The maximum pooling function is used for the characteristics of each region at each level of the pyramid, and, after the pooling of all levels, the vectors are combined to complete the representation of high-level local characteristics. The advantage of introducing a spatial pyramid based on a local histogram is that the spatial pyramid uses the maximum pooling function to have strong robustness to local noise. At the same time, the image feature vector is represented by the image of the spatial pyramid in the pooling operation, which can give position information to the characteristics of the image and make the classification more accurate. With the increase in the level of division, the density of the region will also increase, and the corresponding vector dimension will increase, resulting in an increase in computational complexity. The experimental results show that, when

L \geq 3

, the computational complexity is large, and the accuracy is not significantly improved. Therefore, this paper uses the spatial pyramid model with

L = 2

; i.e., the image is divided into three levels (

l = 0, 1, 2

), and, after the pooling of the three levels, the feature vector is spliced as the feature vector of the high-level feature to represent

f_{2}

.

3. The Analysis of the Proposed Algorithm

This section mainly analyzes the differences between PCANet and MMPCANet.

The original PCANet only considers the feature vectors corresponding to k maximum eigenvalues, and it discards the remaining feature vectors that also contain useful identification information, which leads to the loss of this part of useful identification information. A simple way to avoid information loss is to use feature vectors in PCA as much as possible to construct convolution kernels. However, such operations lead to a large increase in computational complexity and the final feature dimension. Therefore, random forests that are good at dealing with high-dimensional data to improve classification efficiency are used. The input of each layer of MMPCANet comes from the output of all previous layers and realizes feature reuse through the parallel connection of features, thus achieving better performance with fewer parameters. In other words, combining the feature information of each layer improves the ability of the features extracted from the network, and thus the transfer of features is strengthened.

It is found that even if there are many occlusions in the image, taking full advantage of the non-occluded features can still increase the performance of face recognition [25]. The low-level feature of the neural network is the structure information of the image, which is also known as the edge texture information. High-level features are abstract semantic information of images. The fusion of the two features is proven to be more effective than the simple deep feature expression. Our experimental results also show that the robust feature extraction of occluded faces can reduce the impact of occlusion on face recognition and improve the robustness and accuracy of occluded face recognition.

Different from the traditional convolution neural network, in the convolution layer of the MMPCANet model, the convolution kernel is obtained by the PCA algorithm from the training concentration rather than by the Stochastic Gradient Descent (SGD) algorithm for adjustment. Compared with the existing stochastic gradient descent, our model can effectively avoid the shortcomings of excessive parameters, long training time, and the need of experience for initializing parameters and fine tuning.

In order to test the effect of the random forest algorithm in face recognition, this paper compares the differences by accuracy and training time between the random forest and the SVM in CelebA datasets [35], and it defines the factor α to represent the average training benefit of the network to the sample.

α = \frac{p}{c}

(11)

where

c

represents the time spent training a sample and

p

represents the recognition rate of the network. Figure 4 shows the curve of the average factor between the random forest and the SVM in CelebA datasets (the same number of training samples and test samples). It can be seen from Figure 4 that the average factor of random forest is always greater than that of SVM when the number of training samples is different. When the number of training samples is more than 500, the convergence speed of the average factor of random forest is significantly faster than that of SVM. It is proven that random forest has better performance than SVM on small data sets. In the case of the shortage of occlusion face data sets, random forest is more practical than SVM, and the average training time of random forest is 0.15 s faster than SVM.

4. Experiments and Results

In this section, the proposed algorithm is evaluated on three public face databases. These face databases are the CelebA dataset, the AR face database [36] and the FERET database [29]. The detailed comparison table of datasets is shown in Table 2.

In order to illustrate the performance of MMPCANet in tasks, the algorithm is compared with classical non-deep learning methods (ESRC, NNAODL), PCANet, 2DPCANet [30], and L1-2D²PCANet [19]. For comparative analysis, PCANet, 2DPCANet, L1-2D²PCANet, and the new network proposed in this paper are set with the same network parameters. The number of filters in the first and second stages is

L = 8

, the filter size is set to

5 \times 5

, the local histogram block size of the output layer is

7 \times 7

, and the block area overlap ratio is zero. Each experiment was run 20 times, and the average was computed as the result. The results show that, compared with other similar algorithms, the accuracy of our algorithm has a great advantage. The following is the specific experimental analysis of each database.

4.1. Experiments on CelebA

CelebA includes faces without occlusion in front and occluded faces. It has a total of 202,599 face images of 10,177 individuals who are well marked with 40 attribute images, such as wearing sunglasses and smiling expressions. Due to the diversity of occlusion, positive samples are defined as faces without occlusion in front, including faces wearing ordinary glasses, and are manually labeled as l in the sample annotation; negative samples are other face images, including wearing glasses, side faces, overlooking, and other face images, and are manually labeled as -l in the sample annotation, as shown in Figure 5. A total of 100 people were selected in the CelebA dataset as the test object, 50 positive samples and 50 negative samples for each person were selected as training samples, and 1000 (500 positive samples, 500 negative samples) were selected as testing samples. Color images of 218 × 178 in the CelebA dataset are preprocessed to 200 × 150 gray images.

Table 3 shows the comparison of experimental results on 1000 testing samples. As it is shown, our algorithm obtains a higher average recognition rate than the original algorithm, with an average increase of 3.87% in positive samples and 3.56% in negative samples. It indicates that MMPCANet is more suitable for processing complex images, which proves the advantages of multi-layer fusion features compared with traditional single-layer features. Compared with other similar algorithms, our algorithm also has significant advantages, which are 12.09%, 2.91%, 2.16%, and 0.99% higher than those of ESRC, NNAODL, 2DPCANet, and L1-2D²PCANet, respectively.

In order to verify the performance of the proposed algorithm in all aspects, we conducted time–cost experiments. Table 4 shows the average training time and test time of different methods on the CelebA dataset. Compared with PCANet, 2DPCANet extracts image features based on a two-dimensional image matrix, which saves computational time. Therefore, the training time of 2DPCANet is shorter than that of PCANet. MMPCANet takes a long training time, but in turn extracts more useful image features. The short test time of MMPCANet benefits from the random forest classifier.

4.2. Experiments on the AR Database

The AR database contains 3276 images of 126 people, which has an average of 26 per person, collected in two different time periods, including different expressions, illumination, and occlusion changes. The experiment was deployed in the following way and selected 100 individuals (50 males and 50 females) from the AR database as subjects. Eight face images per person in 100 objects were selected as training samples, one of which was not occluded and had no illumination change. Seven of them contained occlusion or illumination factors, as shown in Figure 6a. Testing set I is composed of 300 face images with normal light, strong light on the left, and strong light on the right, as shown in Figure 6b. Testing set II is composed of 300 face images with sunglasses occlusion mixed with normal light, strong light on the left and strong light on the right, as shown in Figure 6c. Testing set III is composed of 300 face images with scarf enclosures that contain normal light, light on the left, and strong light on the right, as shown in Figure 6d.

All images were cut and aligned before the experiment. In the selection of experimental data, testing sets I, II, and III contain common sunglasses, scarves, and posed face occlusions, respectively, in a real uncontrolled environment, and strong illumination makes them more in line with the real scene. In order to verify that the model can effectively deal with occluded faces, we compared the face recognition methods commonly used in ESRC, NNAODL, PCANet, 2DPCANet, and L1-2D2PCANet, as shown in Table 5.

The experimental results show that the recognition rates of the original algorithm and other methods are lower than our method. In testing set I, the recognition rate of MMPCANet is 7.9%, 3.21%, 5.68%, 3.91%, and 2.9% higher than ESRC, NNAODL, PCANet, 2DPCANet, and L1-2D²PCANet, respectively. In testing set II, the recognition rate of MMPCANet is 7.42%, 4.12%, 6.08%, 5.67%, and 0.68% higher than that of ESRC, NNAODL, PCANet, 2DPCANet, and L1-2D²PCANet, respectively. In testing set III, the recognition rate of MMPCANet is 12.59%, 5.54%, 9.57%, 4.33%, and 2.8% higher than that of ESRC, NNAODL, PCANet, 2DPCANet, and L1-2D²PCANet, respectively. It is easy to see that the recognition rate of MMPCANet in self-occlusion (testing set I) is higher than that in scarf occlusion (testing set III), and the recognition rate of scarf occlusion (testing set III) is higher than that of mirror occlusion (testing set II). In these three cases, we suspected that eye features are more recognizable than mouth features. After further verification, we added testing set IV, under which we manually added occlusion to face images. As shown in Figure 7, we divided images into eight occlusion levels, namely, 0% occlusion, 10% occlusion, 20% occlusion, 30% occlusion, 40% occlusion, 50% occlusion, 60% occlusion, and 70% occlusion. Considering 80%, 90%, and 100% occlusion, the image does not contain facial features and therefore was not considered. Then, the accuracy of face recognition was compared based on this dataset. The experimental results are shown in Table 6. Figure 8 shows the accuracy of face recognition under different occlusion rates for different face recognition algorithms.

From the test samples and experimental results, it can be seen that, when the face image is unshielded or when the occlusion rate is 10%, each deep learning network can achieve a very high face recognition rate. From 20% occlusion to 60% occlusion, the face recognition rates of other deep learning networks decrease rapidly. The more occlusions that there are, the fewer useful face recognition features that there are that can be extracted. However, in the case of 20%~60% occlusion, eye features can still be extracted, so the proposed method can also effectively capture eye features for face recognition. However, when the occlusion is increased to 70%, the recognition rates of all networks are approximately low (only about 12%), and it can be seen that almost no facial features can be learned by the network.

4.3. Experiments on FERET Database

In this section, the proposed algorithm is tested on the FERET database. The FERET dataset is a face database established in 1993 by the United States Department of Defense Advanced Research Project and FERET project team, which was established by the United States Army Research Experiment. As a result of the image having complex intra-class variability, it is often used to evaluate the performance of face recognition algorithms. It is divided into five parts: Fa (1199 subjects per image), Fb (with expression variation), Fc (image composition under different light conditions), DupI (with age variation), and DupII (a subset of DupI). Figure 9 shows a performance comparison between the various methods and the proposed algorithm. We used face detection tools to cut and standardize the images. The recognition rate comparison results of different algorithms on the FERET database are shown in Table 7.

Through Figure 9 and Table 5, it can be seen that, in most cases, the proposed method outperforms other methods. It is particularly evident in DupI and DupII, indicating that MMPCANet has strong robustness for samples with age variation. The average correct recognition rate in DupII is 8.9% higher than that in PCANet, and the average correct recognition rate in DupI is 8.37% higher than that in PCANet. The smallest improvement is that the proposed algorithm in Fc only increases by 2.05% compared with L1-2D²PCANet. The above experiments show that the proposed algorithm can effectively improve the accuracy of face recognition.

According to the above experiments, most of the traditional algorithms focus on the high-level features of the image when using features to express the image, which is an effective expression method for clean images. However, for occluded images, since occlusion is also a part of the feature, the occlusion of the input model is easily confused with facial features, and it even seriously affects the extraction of key features. It is also easy to classify occluded features as key features, which is the main problem of the low recognition rate of occluded face images. The input of each layer of MMPCANet comes from the output of all previous layers and realizes feature reuse through the parallel connection of features. This is the reason why MMPCANet can achieve good results with a low training time. The current bottleneck that keeps MMPCANet from becoming deeper is that the dimension of the resulting feature increases exponentially with the number of stages. Fortunately, the random forest classifier can solve this problem. Experiments on three datasets verify the effectiveness of the proposed method, and, compared with other methods, it can obtain better classification results.

5. Conclusions

This paper proposes a multi-scale and multi-layer feature fusion based PCANet (MMPCANet) for occluded face recognition. The channel concatenates the original image features and the output features of the first layer, and then it uses the concatenated result as the input of the second layer to regain better spatial information. In addition, in order to avoid the loss of spatial information of the image, a spatial pyramid is used as the feature pooling layer of the network. Finally, random forests that are good at dealing with high-dimensional data are used for classification. Compared with the original PCANet, our algorithm can extract more low-level features of the image and does not require a large number of training samples, but it is more time-consuming. Therefore, it has certain advantages for occluded face recognition. The non-occluded area features of the face were almost fully extracted. The average accuracies are 98.78% on CelebA, 97.58% on AR, and 97.15% on FERET. Our next goal is to add occlusion detection in MMPCANet in order to target occluded faces for feature extraction, thereby reducing the computational cost. The experiments compared ESRC, NNAODL, PCANet, and 2DPCANet, and it achieved the expected results in recognition rate and training efficiency.

Author Contributions

This paper proposes a multi-scale and multi-layer feature fusion based PCANet (MMPCANet) for occluded face recognition. C.P. and Y.Z. conceptualized the methodology and all the research and wrote large sections of the paper. Z.W., Y.Z. and Z.C. were involved in the interpretation of the results. C.P. was responsible for the visualization and presentation of the mobility indicators. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Research Foundation for Advanced Talents of Guizhou University under Grant: (2016) No. 49, Key Disciplines of Guizhou Province Computer Science and Technology (ZDXK [2018]007), Research Projects of Innovation Group of Education (QianJiaoHeKY[2021]022), Project supported by the Guizhou Province Graduate Research Fund (YJSCXJH[2020]53, YJSCXJH[2020]189), and supported by the National Natural Science Foundation of China (62062023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

CelebA dataset: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html (accessed on 10 December 2021); FERET database: https://www.nist.gov/itl/products-and-services/color-feret-database (accessed on 10 December 2021); AR database: http://web.mit.edu/emeyers/www/face_databases.html (accessed on 10 December 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

Ou, W.; Luan, X.; Gou, J.; Zhou, Q.; Xiao, W.; Xiong, X.; Zeng, W. Robust discriminative nonnegative dictionary learning for occluded face recognition. Pattern Recognit. Lett. 2018, 107, 41–49. [Google Scholar] [CrossRef]
Kumar, A.; Kaur, A.; Kumar, M. Face detection techniques: A review. Artif. Intell. Rev. 2019, 52, 927–948. [Google Scholar] [CrossRef]
Wang, H.; Hu, J.; Deng, W. Face feature extraction: A complete review. IEEE Access 2017, 6, 6001–6039. [Google Scholar] [CrossRef]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep face recognition. Br. Mach. Vis. Assoc. 2015, 1–12. [Google Scholar]
Qiu, H.; Gong, D.; Li, Z.; Liu, W.; Tao, D. End2End occluded face recognition by masking corrupted features. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 1. [Google Scholar] [CrossRef] [PubMed]
Martinez, A.M. Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 748–763. [Google Scholar] [CrossRef] [Green Version]
Cament, L.A.; Castillo, L.E.; Perez, J.P.; Galdames, F.J.; Perez, C.A. Fusion of local normalization and Gabor entropy weighted features for face identification. Pattern Recognit. 2014, 47, 568–577. [Google Scholar] [CrossRef]
Wright, J.; Yang, A.Y.; Ganesh, A.; Sastry, S.S.; Ma, Y. Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 210–227. [Google Scholar] [CrossRef] [Green Version]
Deng, W.; Hu, J.; Guo, J. Extended SRC: Undersampled face recognition via intraclass variant dictionary. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1864–1870. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Du, L.; Hu, H. Nuclear norm based adapted occlusion dictionary learning for face recognition with occlusion and illumination changes. Neurocomputing 2019, 340, 133–144. [Google Scholar] [CrossRef]
Gao, G.; Yu, Y.; Yang, M.; Chang, H.; Huang, P.; Yue, D. Cross-resolution face recognition with pose variations via multilayer locality-constrained structural orthogonal procrustes regression. Inf. Sci. 2020, 506, 19–36. [Google Scholar] [CrossRef]
Wang, J.; Lu, C.; Wang, M.; Li, P.; Yan, S.; Hu, X. Robust face recognition via adaptive sparse representation. IEEE Trans. Cybern. 2014, 44, 2368–2378. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chen, Z.; Wu, X.J.; Kittler, J. A sparse regularized nuclear norm based matrix regression for face recognition with contiguous occlusion. Pattern Recognit. Lett. 2019, 125, 494–499. [Google Scholar] [CrossRef]
Chan, T.H.; Jia, K.; Gao, S.; Lu, J.; Zeng, Z.; Ma, Y. PCANet: A simple deep learning baseline for image classification? IEEE Trans. Image Process. 2015, 24, 5017–5032. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Albawi, S.; Mohammed, T.A.; Al-Zawi, S. Understanding of a convolutional neural network. In Proceedings of the 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017; pp. 1–6. [Google Scholar]
Zhang, P.; Boudaren, M.E.Y.; Jiang, Y.; Song, W.; Li, B.; Li, M.; Wu, Y. High-Order Triplet CRF-PCANet for Unsupervised Segmentation of Nonstationary SAR Image. IEEE Trans. Geosci. Remote Sens. 2020, 59, 8433–8454. [Google Scholar] [CrossRef]
Korichi, A.; Slatnia, S.; Aiadi, O. TR-ICANet: A Fast Unsupervised Deep-Learning-Based Scheme for Unconstrained Ear Recognition. Arab. J. Sci. Eng. 2022, 1–12. [Google Scholar] [CrossRef]
Ng, C.J.; Teoh, A.B.J. DCTNet: A simple learning-free approach for face recognition. In Proceedings of the 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Hong Kong, China, 16–19 December 2015; pp. 761–768. [Google Scholar]
Li, Y.K.; Wu, X.J.; Kittler, J. L1-2D 2 PCANet: A deep learning network for face recognition. J. Electron. Imaging 2019, 28, 023016. [Google Scholar] [CrossRef]
Sang, H.; Zhou, Q.; Zhao, Y. Pcanet: Pyramid convolutional attention network for semantic segmentation. Image Vis. Comput. 2020, 103, 103997. [Google Scholar] [CrossRef]
Abdelbaky, A.; Aly, S. Human action recognition using short-time motion energy template images and PCANet features. Neural Comput. Appl. 2020, 32, 12561–12574. [Google Scholar] [CrossRef]
Zeng, R.; Wu, J.; Shao, Z.; Chen, Y.; Chen, B.; Senhadji, L.; Shu, H. Color image classification via quaternion principal component analysis network. Neurocomputing 2016, 216, 416–428. [Google Scholar] [CrossRef] [Green Version]
Alahmadi, A.; Hussain, M.; Aboalsamh, H.A.; Zuair, M. PCAPooL: Unsupervised feature learning for face recognition using PCA, LBP, and pyramid pooling. Pattern Anal. Appl. 2020, 23, 673–682. [Google Scholar] [CrossRef]
Sun, Y. Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems; The Chinese University of Hong Kong: Hong Kong, China, 2015. [Google Scholar]
Sun, Y.; Wang, X.; Tang, X. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2892–2900. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Kremic, E.; Subasi, A. Performance of random forest and SVM in face recognition. Int. Arab J. Inf. Technol. 2016, 13, 287–293. [Google Scholar]
Iandola, F.; Moskewicz, M.; Karayev, S.; Girshick, R.; Darrell, T.; Keutzer, K. Densenet: Implementing efficient convnet descriptor pyramids. arXiv 2014, arXiv:1404.1869. [Google Scholar]
Phillips, P.J.; Moon, H.; Rizvi, S.A.; Rauss, P.J. The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1090–1104. [Google Scholar] [CrossRef]
Yu, D.; Wu, X.J. 2DPCANet: A deep leaning network for face recognition. Multimed. Tools Appl. 2018, 77, 12919–12934. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Hu, J.; Wu, X.; Zhou, J. Noise robust single image super-resolution using a multiscale image pyramid. Signal Process. 2018, 148, 157–171. [Google Scholar] [CrossRef]
Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef] [PubMed]
Guehairia, O.; Ouamane, A.; Dornaika, F.; Taleb-Ahmed, A. Feature fusion via Deep Random Forest for facial age estimation. Neural Netw. 2020, 130, 238–252. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Large-scale celebfaces attributes (celeba) dataset. Retrieved August 2018, 15, 11. [Google Scholar]
Martinez, A.; Benavente, R. The AR Face Database: CVC Technical Report, 24; Computer Vision Center: Barcelona, Spain, 1998. [Google Scholar]

Figure 1. Illustration of the architecture of the PCANet, MPCANet, and MPCANet3 models. The first row represents PCANet, the second row represents MPCANet, and the third row represents MPCANet3.

Figure 2. The block diagram of the proposed MMPCANet.

Figure 3. Feature pyramid pooling processing.

Figure 4. Curve of average factor between the random forest and the SVM in CelebA datasets (the same number of training samples and test samples). The red line represents SVM, and the blue line represents random forest.

Figure 5. Some examples of CelebA: the first line represents positive samples, and the second represents negative samples.

Figure 6. Some examples of the AR database. Eight face images per person were selected as training samples, one of which was not occluded and had no illumination change. Seven of them contained occlusion or illumination factors, as shown in (a). Testing set I is composed of 300 face images with normal light, strong light on the left, and strong light on the right, as shown in (b); testing set II is composed of 300 face images with sunglasses occlusion mixed with normal light, strong light on the left, and strong light on the right, as shown in (c); testing set III is composed of 300 face images with scarf enclosures that contain normal light, light on the left, and strong light on the left, as shown in (d).

Figure 7. Some examples of face images under different occlusions in AR database.

Figure 8. Average Recognition Rate (%) on AR database.

Figure 9. Average accuracy on FERET database.

Table 1. Key differences between a number of previous methods and proposed methods.

Methods	Advantages	Disadvantages
ESRC [9]	Intraclass variations of the same person can be shared by others	Weak discrimination and large scale of occlusion dictionary
NNAODL [10]	Integrates the error image with training samples to construct the dictionary	In the case of extreme dark illuminations or unconstrained settings, the accuracy is low
PCANet [14]	Can be extremely easily and efficiently designed and learned	Easy to cause the loss of image information
L1-2D2PCANet [19]	Uses the L1-norm-based 2DPCA for filter learning	The training time is long
Methods based on deep learning [5,24,25,26]	Good performance and strong learning ability	Takes a lot of training samples and time
MMPCANet	Makes full use of the locality and continuity of occlusion space to obtain better spatial information	High storage overhead

Table 2. Detailed comparison of datasets.

Datasets	Subject	Images	Description
CelebA [35]	10,177	202,599	Includes faces without occlusion in front and occluded faces, such as wearing sunglasses and smiling expressions.
AR [36]	126	3276	Includes different expressions, illumination, and occlusion changes.
FERET [29]	1199	14,126	The images of the same person have different expressions, illumination, posture, and age changes.

Table 3. Average Recognition Rate (%) on CelebA dataset.

Method	Positive Samples	Negative Samples
ESRC [9]	89.30	86.40
NNAODL [10]	96.09	95.58
PCANet [14]	95.20	94.93
2DPCANet [30]	98.02	96.33
L1-2D2PCANet [19]	97.58	97.50
MMPCANet	99.07	98.49

Table 4. Performance on CelebA dataset.

	Positive Samples		Negative Samples
	Training Time (s)	Testing Time (s)	Training Time (s)	Testing Time (s)
PCANet [14]	454.83	0.31	499.48	0.30
2DPCANet [30]	374.07	0.23	404.80	0.23
L1-2D2PCANet [19]	554.81	0.17	577.59	0.18
MMPCANet	493.03	0.16	512.69	0.17

Table 5. Average Recognition Rate (%) on AR database.

Method	Testing Set I	Testing Set II	Testing Set III	Avg.
ESRC [9]	91.05	89.14	84.64	88.27
NNAODL [10]	95.74	92.44	91.69	93.29
PCANet [14]	93.27	90.48	87.66	90.47
2DPCANet [30]	95.04	90.89	92.90	92.94
L1-2D2PCANet [19]	96.05	95.88	94.43	95.45
MMPCANet	98.95	96.56	97.23	97.58

Table 6. Average Recognition Rate (%) on AR database.

Occlusion Percentage	0%	10%	20%	30%	40%	50%	60%	70%
ESRC [9]	92.79	89.94	85.46	79.70	70.76	53.73	30.52	10.36
NNAODL [10]	95.71	93.48	89.62	83.49	79.99	71.96	44.05	12.58
PCANet [14]	96.46	94.14	91.29	81.72	77.21	70.93	41.98	11.27
2DPCANet [30]	97.21	95.51	91.09	82.19	76.85	67.79	39.39	11.34
L1-2D2PCANet [19]	98.04	97.32	92.78	85.86	78.34	69.93	44.48	12.16
MMPCANet	98.93	97.51	93.53	89.37	85.60	79.32	66.55	13.99

Table 7. Average Recognition Rate (%) on FERET database.

Method	Fb	Fc	DupI	DupII
ESRC [9]	91.98	84.05	79.85	72.94
NNAODL [10]	96.21	92.78	88.26	86.22
PCANet [14]	94.44	91.84	85.33	86.40
2DPCANet [30]	95.65	98.98	93.52	90.37
L1-2D2PCANet [19]	97.93	97.82	93.15	92.48
MMPCANet	99.74	99.87	93.70	95.30

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Zhang, Y.; Pan, C.; Cui, Z. MMPCANet: An Improved PCANet for Occluded Face Recognition. Appl. Sci. 2022, 12, 3144. https://doi.org/10.3390/app12063144

AMA Style

Wang Z, Zhang Y, Pan C, Cui Z. MMPCANet: An Improved PCANet for Occluded Face Recognition. Applied Sciences. 2022; 12(6):3144. https://doi.org/10.3390/app12063144

Chicago/Turabian Style

Wang, Zewei, Yongjun Zhang, Chengchang Pan, and Zhongwei Cui. 2022. "MMPCANet: An Improved PCANet for Occluded Face Recognition" Applied Sciences 12, no. 6: 3144. https://doi.org/10.3390/app12063144

APA Style

Wang, Z., Zhang, Y., Pan, C., & Cui, Z. (2022). MMPCANet: An Improved PCANet for Occluded Face Recognition. Applied Sciences, 12(6), 3144. https://doi.org/10.3390/app12063144

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MMPCANet: An Improved PCANet for Occluded Face Recognition

Abstract

1. Introduction

2. Proposed Method

2.1. Multi-Layer Feature Fusion PCANet

2.2. Multi-Scale PCANet

3. The Analysis of the Proposed Algorithm

4. Experiments and Results

4.1. Experiments on CelebA

4.2. Experiments on the AR Database

4.3. Experiments on FERET Database

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI