1. Introduction
Despite recent advances in image sensor technology, cameras have always played an important role in security. In most cases, cameras are usually installed in a fixed position, but in order to realize smart monitoring systems, a highly accurate embedded system and a large computing device that can work properly under low power consumption circumstances are needed. In recent decades, the development of science and technology has advanced rapidly, and large demands have arisen in the surveillance industry [
1]. As a result, multi-camera systems have become widely used for surveillance and access control under various situations in the hopes of deterring crimes and improving both public and private security. Relying on human beings to monitor the views is unrealistic and usually insufficient since it typically happens when requested. As such, face recognition in automatic identification can be considered a non-intrusive and relatively reliable method for user authentication by using their face. Furthermore, due to the emergence of deep learning methodologies and the advance of hardware, it has become feasible to perform automatic face recognition on high-end processors easily and with none of the aforementioned limitations. Recently, deep feature extraction approaches on the basis of deep learning, especially the convolutional neural network (CNN), have revealed remarkable advantages and in-depth development in face recognition technology. Among them, FaceNet [
2], proposed by Google researchers in 2015, is one of the representative methods based on the deep learning architecture of CNN in the literature.
Nowadays, computing devices using human face recognition are being increasingly applied under various contexts, such as secure access and time attendance systems.
Figure 1 depicts the architecture of a typical embedded face recognition system on a security device. In some common scenarios, the computing device with user-friendly face recognition may store one or more enrollment facial images of the authorized users. While a face is presented to the camera of a security device with face recognition enabled, the captured image is then converted into mathematical data by the algorithm. Thereafter, the unique mathematic representation, as extracted from the face image of a user, is compared to the underlying face database. If the probed facial features of the user match the stored facial features in terms of a high matching score, user access will be granted, and shift data for recording attendance can also be collected in an effective way. So far, since deep learning (DL) with CNN is becoming the leading technology and a promising prospect for a wide range of applications, face recognition relying on an embedded platform with limited onboard resources, such as processors, memory, and batteries, is in urgent demand for deployment in embedded systems [
3].
In order to make automatic camera surveillance or access control systems available with the aid of face recognition, many extended devices with high computing power in the CPU or even the Graphical Processing Unit (GPU) have been successfully applied to perform recognition and verification tasks via deep learning networks. Although several deep convolutional networks have exhibited a promising accuracy of over 99% when executed on CPUs and GPUs, implementing such complex architectures directly in resource-constrained embedded devices is a great challenge. Due to attractive features such as higher throughput and energy efficiency, embedded devices with CNN acceleration are promising platforms for image processing and deep learning applications. However, deployment of DL and CNN in embedded devices still presents challenges mainly resulting from the limited hardware resources and strong demand for careful design and performance optimization [
4]. Thus, a lightweight face recognition model, which can specifically be compliant to a resource-constrained development environment, is expected to make the automatic camera surveillance system more feasible on embedded devices.
The related works on face recognition are discussed in
Section 2. In addition, the remainder of the work is organized as follows.
Section 3 develops a lighter face recognition framework conceptualized on the basis of FaceNet.
Section 4 presents the experiments with conducted results. In addition, further discussions are provided. The conclusions and future works are summarized in
Section 5.
2. Related Works
Traditional face recognition is composed of three tasks: face detection, feature extraction, and face matching/face classification. To perform automatic face recognition, many methods have been proposed to represent faces or extract face features, such as principal component analysis (PCA) [
5], Fisher’s discriminant analysis (FDA), linear discriminant analysis (LDA) [
6], neural networks [
7], scale-invariant feature transform (SIFT), discrete cosine transform (DCT), wavelet transform, and other feature extraction methods. Among them, PCA has been one of the most widely used techniques for extracting face features, as it can be simply applied in practical problems. PCA can reduce the dimensions of face features and easily perform the matching/classification task. However, PCA suffers from posture changes and variations in lighting. FDA and LDA are statistical approaches that attempt to maximize the distance between different identities and minimize the variance within samples for the same identity projected onto the face feature space. The first neural network applied to face recognition is a single layer network. This network distinguishes each identity by proposing a separate network. Other machine learning techniques are applied with a face being represented by the predefined feature expression, which characterizes a face through multiple processing modules. Once the feature is extracted, classifiers, such as Support Vector Machine (SVM) [
8], Random Forest [
9], and K-nearest neighbor (KNN) [
10], can be applied to distinguish patterns of the sample images for the same person from those of different people. However, the extracted features influence the recognition result greatly.
Most recently, CNN was proposed to perform feature extraction and pattern classification in one multi-layer network. CNN takes the original image instead of the characteristic feature as input to learn the best feature representation automatically. As being the most representative method among many works of literature that apply the CNN variants for face recognition, FaceNet consists of a set of trained layers, known as face identifiers, and an intermediate bottleneck layer, representing generalized recognition beyond the set of identifiers. It converts face images into a
k-dimensional feature space as face feature embedding, which is similar to word embedding. To discriminate the faces from one identity from those of other identities, FaceNet employs triplet loss for enforcing margin distances between each pair of faces. Therefore, similarity and difference can be calculated among various faces. The
k-dimensional embedding technology established by this model can cluster faces effectively and accurately. With FaceNet-generated face embeddings as features, face recognition and verification can subsequently be performed. Similar images would result in a closer distance in the embedding space, while non-similar images would have their corresponding embeddings much further away from each other. Although FaceNet could achieve over 99% accuracy on LFW, it still faces the challenge of its large model size, which makes it unrealistic to implement in mobile environments or even in embedded devices. Furthermore, OpenFace [
11] and some other models originating from FaceNet, but with smaller sizes, as stated in [
12], have been designed for use in mobile applications. These models successfully reduced the model size by about 60%–80% (12.5 MB–30 MB) compared with FaceNet (90 MB); however, in the reduced versions, the error rate increased by up to 16 times and did not fit the common hardware limitations of embedded devices.
Other than FaceNet, many other CNNs have been applied and modified for different applications with face recognition as the foundation. Additionally, more and more facial databases have been created to reach the goal. Almabdy et al. [
13] investigated the performance of a hybrid model with pre-trained CNN applied to perform feature extraction followed by linear-based SVMs to perform multi-class classification. In their research, two networks, AlexNet and ResNet-50, were constructed to extract facial features. Additionally, AlexNet was applied for transfer learning, with the network acting as both feature extractor and classifier, simultaneously. The results of the conducted experiment were then compared to the hybrid model. To make SVM learn the best discriminations among the subjects, multiple images were needed for every single identity to construct the networks. The results showed an accuracy of 94% on the LFW dataset for SVM in classifying the identities based on the network-extracted features. Besides, the transfer learning network using AlexNet achieved an accuracy of 95.63% on the LFW dataset. In another study proposed by Yang et al. [
14], the CNN model for face recognition was improved by fusing the face features extracted via AlexNet with features extracted through processes of scale-invariant feature transformation (SIFT) and rotation-invariant texture features (RITF). In their study, Random Forest was used as a classifier for the fused features. According to their experiments, the enhanced fused features helped improve the model, with a true positive rate (TPR) increase of 10.97–13.24%, achieving an accuracy of 98.98% on the LFW dataset. Due to the introduction of SIFT-RITF features, the model’s computing time was greatly increased. Therefore, the study made use of a graphics processing unit (GPU) to reduce the computing time, reaching 5–6 times acceleration compared to Central Processing Unit (CPU)-based computing.
Cuculo et al. [
15] proposed a method based on a VGG-face network. The authors took advantage of the augmentation process of face images to increase the recognition ability of their network which only accounts for one single reference image per subject. With a sparse sub-dictionary learning process, the model was able to derive a concise description for each face image. And the identity would be recognized via their
L0-norm minimization algorithm with a majority voting optimizer. The results showed effectiveness on large datasets, with the model outperforming other state-of-the-art approaches on very low-resolution images and images with some disguises. Abdallah et al. [
16] proposed a zero-shot learning model consisting of 19 CNN layers for person spotting and face clustering in video stream data. The proposed network extracts face feature vectors similar to FaceNet-extracted embeddings from the pre-whitening processed video frames. Prior to clustering, softmax loss was applied to calculate the face similarity among all feature vectors. New face clusters would be created if the similarity values did not match any existing clusters under a preset threshold. Their model outperformed conventional clustering methods, including
k-means, spectral clustering, and hierarchical clustering, with the F-measure of 0.935 on the LFW dataset.
Liu et al. [
17] modified FaceNet and constructed a liveness detection model by attaching the Kinect infrared (IR) sensor to avoid face spoofing attacks while simultaneously performing face recognition. They mainly introduced a classification method—support vector machine (SVM) following FaceNet—to determine the similarity of the recognized face and various other faces. Their approach was able to effectively avoid face spoofing while recognizing a face and for authentication purposes. However, the structure of FaceNet was not reduced in their work, and its computational burden was still heavy and therefore not suitable for deployment in embedded devices. For most similar studies using multiple spectral images, such as visible and IR images, image registration is the main issue, as discussed in [
18,
19]. Images from multiple sensors must be fused before applying face recognition. Additionally, a threshold needs to be predefined for distinguishing the recognized face from other faces in the database on the basis of the predicted similarity values. The authors also built an unknown category to tackle the issue of false recognition. The obtained results showed a tremendous reduction in false recognition rate, and the model with liveness detection and face recognition was able to be deployed for identity authentication. Lee et al. [
20] constructed a lightweight and computationally efficient model to perform face recognition for a stand-alone access control system. The proposed model was based on the framework composed of the local binary pattern (LBP) and the AdaBoost classifier. The Gabor-LBP histogram was modified by applying Gaussian derivative filters as alternatives to Gabor wavelets to extract facial features. In addition, AdaBoost was used to perform a rapid face and eye detection with the model invariant to illumination changes. The results showed an accuracy of 97.27% and 99.06% on the E-face and the XM2VTS datasets, respectively.
Three-dimensional (3D) face recognition is another branch of face recognition techniques for overcoming issues arising from ambient light, background, and shooting angle. Supported by auxiliary 3D face images and 3D imaging technologies, more accurate results can be obtained [
21,
22,
23,
24]. However, the requirements of computational power and equipment increase heavily and become highly dominant. To achieve a lightweight architecture while retaining efficient and highly accurate face recognition, MobiFace [
25] and ShuffleFaceNet [
26] have recently been proposed. Their accuracy is maintained or even improved with the proposed lighter structures. By doing so, the number of parameters in their neural networks was significantly reduced. However, the development of both networks was still carried out with no assumption of hardware limitations, not to mention the hardware supportability into consideration.
Although several face recognition models based on deep learning approaches, as investigated above, have already achieved over 99% accuracy, their architectures make them less than ideal and unsuitable for embedded devices with the constraints of limited hardware resources. Despite the dramatic performance of DL and CNN, most embedded devices cannot support their application while retaining low latency, low power usage, and high precision within the computational resource constraints. To further make face recognition available and implementable on the embedded devices with only grayscale images, a lightweight learning network conceptualized on FaceNet, called FN13, is proposed in this study according to the main hardware limitations, while deploying deep learning in an embedded target. On the basis of the project cooperation with Holtek Semiconductor, which is a leading professional IC design house in Taiwan, the proposed FN13 model can be deployed in a fully integrated device (HT82V82) as the embedded system featuring AI computing for facial recognition applications [
27]. To comply with the development constraints for making FN13 work in the integrated HT82V82 package, the first limitation is the constraints on convolutional layers. The second is the constraints on the pooling layer. Last but not least, the limitation of using padding for all layers is considered. It is also worth mentioning that the proposed FN13 model can diminish the large number of parameters required by FaceNet, and very little sacrifice of accuracy on the dataset collected from Labeled Faces in the Wild database (LFW) [
28], which was used for evaluating the performance of the proposed face recognition scheme. By using one-shot learning without the need for a retraining process, FN13 can effectively distinguish faces of various identities even if the postures and the portions of the face in the images vary. Prior to detailing our proposed method,
Figure 2 depicts the considerable variations in terms of pose and lighting conditions possible when recognizing a face, and shows the feasible recognition ability of FN13.
4. Experiments
The recognition models are implemented with a device using a general PC development setup and Microsoft Windows 10 operating system. To compare the performance for FN13 and FaceNet on various computing environments, the experimental results are conducted on CPU-based device with/without GPU supported.
4.1. LFW Dataset
The dataset used to conduct experiments on the proposed FN13 was collected from the LFW. LFW was designed for unconstrained face recognition studies, with more than 13 thousand name-labeled face images, including 1680 individuals having more than one distinct photo, all collected from the web [
28]. In order to test FN13′s capability for face recognition, the entire dataset of 13 thousand photos was taken as the test data in the experiments.
Figure 8 shows a few image examples from the LFW dataset. The FN13 model was pre-trained using the CASIA-WebFace [
31], crawled from the Internet by the Institute of Automation from Chinese Academic Science, to get FN13 to learn the best feature representation of each face photo as embedding. Thereby, FN13 can generate the feature embedding corresponding to each test data.
4.2. Performance Evaluation
Figure 9 depicts the confusion matrix for a binary classifier, which is the face recognition system in this manuscript, to evaluate the overall performance. All subjects are categorized into two groups, namely, condition positive and condition negative. The condition positive group consists of all sample images belonging to a subject that is to be identified, while the condition negative group includes other sample images. Accordingly, four compound conditions, including true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), can be defined to analyze the correctness and robustness of the proposed FN13 model. To evaluate the performance of FN13 and compare it with other state-of-the-art face recognition models, the model accuracy, precision, recall, and F
1-score are computed.
Precision and recall are two measures for indicating the true positive rate from all predicted positives and true predicted rates of all positive samples, respectively. The higher the precision, the higher the degree of correctness for positive samples. Similarly, the higher the recall, the stronger the ability of the recognition model to distinguish negative samples from positives. In addition, the F1 score measures the harmonic mean of the precision and recall to take both measures into account in order to deal with the issue of uneven class samples.
Table 3 summarizes the performance comparison among some typical CNN-based recognition models, including computational complexity in FLOPs, the number of parameters, and the memory storage of models. As can be seen, FN13 is superior to the others in terms of complexity in FLOPs. This is due to the fact that FN13 is under the constraint of limited hardware support in computing resources, where no more complicated arithmetic is allowed. As a consequence, FN13 has a slightly larger model size compared to FaceNet, since there exists a trade-off between reducing complexity and reducing the size of the model.
4.3. Experimental Results
Firstly, the LFW dataset was used to evaluate the proposed FN13 model with ten-fold cross-validation used to select an appropriate L2-normalization distance threshold for discrimination among the different classes based on FN13-generated face embeddings. The LFW dataset was split into 10 folds for conducting experiments, including 9 training folds and one test fold, and 0.6 was chosen as the optimal threshold for all tests. The performance of FN13 was compared to FaceNet, and the results are listed in
Table 4. To overcome the limitations for having constrained hardware resources on embedded devices, FN13 exhibits a lighter structure with fewer layers and fewer parameters than FaceNet does, which results in a higher false acceptance rate (FAR). Nevertheless, only using grayscale images, the overall accuracy for recognizing known faces remains at a high level, at 96.65% correctness.
To compare the time performance of FN13 to FaceNet, both models were implemented in computing environments with and without a Graphical Processing Unit (GPU)-accelerated computing device to test and conduct the experiment using 300 input images. As shown in
Table 5, with the lighter structure, FN13 took less time than FaceNet to recognize a face under both computing environments and showed a great improvement in time when using a GPU for acceleration. FN13 was further improved to enhance recognition ability by introducing more image samples associated with one individual subject. For instance, FN13 became more accurate when taking not just one sample, but three samples into account, while not sacrificing much in terms of time.
Once the device captures a face image, the corresponding 512-D facial descriptor is generated via FN13. To further identify the identity of the face, Mahalanobis distances between the generated facial descriptor and the stored multi-sample embeddings for each identity class are then calculated. The face image should be assigned to the identity class with the shortest distance among all classes. However, the face image could represent an unregistered face. Thus, to avoid such misidentification, a threshold is preset as a criterion. If the minimum distance between the facial descriptor and stored embeddings is less than the threshold, the face image would be assigned to the ID belonging to the identity class with the shortest distance from any of its multi-samples. Otherwise, the recognition system would deny access to this unrecognized identity. Algorithm 1 describes the steps to perform embedding matching with multiple samples.
Algorithm 1. Embedding Matching with Multiple Samples |
Input: A 512-D facial descriptor (embedding extracted by FN13): . Output: Recognized ID: |
Get the stored facial embedding set of identity class , where represents the number of sample embedding in identity class . Calculate the Mahalanobis distance (M-distance) between x and each facial embedding in as . Update the minimum distance and if exists any . That is, and . Let . Repeat 1–3 until no more identity classes are found. Given a preset threshold . If , then assign the output by . Otherwise, the output would be set to −1 as an unrecognized identity.
|
Under such circumstances, FN13 obtained 98.41% accuracy, which is about 2% more accurate than using one sample, and with only a 0.00097 s delay. This is due to the fact that the model generates the corresponding embedding for the input image only once. In addition, FN13 using more samples for recognizing can accommodate itself to the changed grayscale input, which is less informative than the color images used by FaceNet, without losing accuracy. This makes FN13 superior to FaceNet. With respect to the size of the input and the size of embedding used by both recognition approaches, it can be found notably that FN13 requires a much smaller amount than FaceNet. This also demonstrates the possible transplantation for the FN13 model for implementation in real-time onboard processing.
For recognizing a face in a real-time face recognition system, the process consists of four stages with a captured facial image being loaded into the system, including face detection, embedding generation, face recognition, and output response. For detecting a face, hair-like features that capture the structural similarities within faces can work effectively. Thus, the detected face can be further recognized with the face embedding generated by FN13. The recognition result will be stored for further analysis if a known face exists in the input image, or an alert may be sent if it detected a possible intrusion.
Figure 10 illustrates the real-time processing flowchart with FN13 implemented in a vision-based human face recognition system. From the acquisition of an optical camera in real-time, the raw input for our system is the face image of a user.
Figure 11 shows an example of recognizing a known face using three samples via FN13 under various conditions, including the variation in terms of occlusion, lighting, and posture changes.
In addition to the performance evaluation for real-time recognition, generated feature embeddings can be inspected via various classifiers as mentioned in [
36]. The features are obtained by the same CNN structure with softmax and center loss function. The classification results via various classifiers with the generated features are then compared with FN13 using the same loss functions.
Table 6 presents the performance comparison of FN13 with respect to various typical classifiers including eXtreme Boosting Gradient (XGBoost) [
37], Light Gradient Boosted Machine (LightGBM) [
38], and Gradient Boosting Decision Tree (GBDT) [
39], on LFW dataset. As can be observed, FN13-generated face embeddings provide good face representations and make most classifiers available to distinguish different faces effectively; thereby, FN13 outperforms the other models.
Figure 12 shows the similarity between faces via Pearson product–moment correlation with values ranging within [−1, 1]. The vertical and horizontal axes are formed by the first eight sample images listed in the LFW dataset. The correlation between two samples is calculated based on their corresponding face embeddings so that all the values in the diagonal line are those with the darkest background color, representing two identical sample images. The darker the background color (the larger the value) is, the greater the similarity the image pair possesses, and zero correlation suggests that the two images share nothing in common within their face embeddings. Despite the diagonal line in
Figure 12, correlations for image pairs of two different samples are within the range [−0.24, 0.33]. This implies that the FN13-generated face embeddings can specify and separate different identities into face classes easily. Consequently, the face recognition system can achieve a good true negative rate, while the overall accuracy can be maintained.
4.4. Discussion
To further analyze the correlation among different identities, the image pair sharing a correlation coefficient of 0.33 in the example of the LFW is discussed. As can be seen in
Figure 13, the image with a man wearing sunglasses on the left would have a larger similarity value with the image of another man with a mustache. This may be a result of significant facial features being covered and disguised by accessories or facial hair. As a consequence, the facial pattern recognition model would have difficulty extracting facial descriptors, such as the shape and depth of eye contour, the distance between eyes, or even eyebrows, which means that the model has to rely on other descriptors, lowering its discrimination ability. Additionally, it turns out that the Euclidean distance measure used by FaceNet cannot distinguish the patterns among various classes due to dependencies between generated embedding vectors. The Mahalanobis distance measure was then considered as an alternative to get rid of the abovementioned relations among each signature via the following equation:
where
D is the Mahalanobis distance,
x is the vector of the facial descriptor,
μ represents the mean vector of the facial descriptor, and
C shows the covariance of vectors.
Next, a self-collected, non-public dataset was used to promote and validate FN13.
Figure 14 shows a collection of images that were tested and recognized by FN13. To test the model capability and observe possible conditions that cause error recognition, subjects were asked to cover their usual appearance or to intentionally avoid facing the camera. The experiments show two types of errors. The first type is the false acceptance type, where the recognition model misidentified one subject as another registered subject. The other type is the false rejection type, where a registered subject could not be identified.
Figure 15 and
Figure 16 show some case images that resulted in false acceptance and false rejection, respectively. It can be observed from the images that the error identification occurred while the test subjects were wearing some accessories or had their vital facial signatures disguised. Additionally, it was found that the signatures surrounding eyes are most decisive. Eye closing, shades overlaying eye area, and hair covering eyes all caused the model to fail. As the corresponding similarity analysis shows in
Figure 17 for those false acceptance cases, it can be found that the correlation among them was higher than the normal situation. In other words, the similarity becomes higher if two identities intentionally wear the same accessories. Despite this observation, it is worthy to note that an individual with or without accessory would still contribute the highest similarity, for instance, Daniel with or without glasses, and Peter with or without glasses. This intriguing finding again validates that the correctness and robustness of our proposed model can be improved by adding a few more samples for recognizing one identity.
5. Conclusions
In this paper, FN13, a novel lightweight face recognition model based on FaceNet, was proposed to overcome hardware limitations that may occur when the recognition model is implemented in an embedded target. To achieve real-world deployment of deep learning in the target device HT82V82, developed by Holtek, FN13 was designed in consideration of the constrained resources required in the embedded environment, such as the constraints of size input, the constraints of specific sizes that can be used in deep learning models, and the constraints of supported operations. In contrast to FaceNet, FN13 directly trains its output as a compact 512-D embedding and uses the center loss function for training. Through center loss, we can effectively lessen the distance between subjects in the same category and enlarge those within different categories. Center loss enables FN13 to achieve better balance and performance for recognizing faces with constrained hardware resources.
With its light structure and computational efficiency, FN13 can also be implemented in a real-time camera surveillance system to track, identify, and monitor subjects. Based on the proposed method, the theoretical design principles of deep learning are clarified to achieve more efficient simplification and robustness in the execution of face recognition algorithms. As shown in the experiments, FN13 generates a good representation of faces in sample images via face embeddings, and it achieves great performance, while the size and the parameters used to construct the face recognition model are significantly reduced.