A Systematic Review of CNN Architectures, Databases, Performance Metrics, and Applications in Face Recognition

Nemavhola, Andisani; Chibaya, Colin; Viriri, Serestina

doi:10.3390/info16020107

Open AccessReview

A Systematic Review of CNN Architectures, Databases, Performance Metrics, and Applications in Face Recognition

by

Andisani Nemavhola

^1,†

,

Colin Chibaya

^2,*,‡

and

Serestina Viriri

^3,‡

¹

School of Consumer Intelligence and Information Systems, University of Johannesburg, Johannesburg 2197, South Africa

²

School of Natural and Applied Sciences, Sol Plaatje University, Kimberley 8300, South Africa

³

School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban 4000, South Africa

^*

Author to whom correspondence should be addressed.

^†

This author contributed 80% to this work.

^‡

These authors contributed equally to this work.

Information 2025, 16(2), 107; https://doi.org/10.3390/info16020107

Submission received: 27 November 2024 / Revised: 25 January 2025 / Accepted: 27 January 2025 / Published: 5 February 2025

(This article belongs to the Special Issue Machine Learning and Data Mining for User Classification)

Download

Browse Figures

Versions Notes

Abstract

:

This study provides a comparative evaluation of face recognition databases and Convolutional Neural Network (CNN) architectures used in training and testing face recognition systems. The databases span from early datasets like Olivetti Research Laboratory (ORL) and Facial Recognition Technology (FERET) to more recent collections such as MegaFace and Ms-Celeb-1M, offering a range of sizes, subject diversity, and image quality. Older databases, such as ORL and FERET, are smaller and cleaner, while newer datasets enable large-scale training with millions of images but pose challenges like inconsistent data quality and high computational costs. The study also examines CNN architectures, including FaceNet and Visual Geometry Group 16 (VGG16), which show strong performance on large datasets like Labeled Faces in the Wild (LFW) and VGGFace, achieving accuracy rates above 98%. In contrast, earlier models like Support Vector Machine (SVM) and Gabor Wavelets perform well on smaller datasets but lack scalability for larger, more complex datasets. The analysis highlights the growing importance of multi-task learning and ensemble methods, as seen in Multi-Task Cascaded Convolutional Networks (MTCNNs). Overall, the findings emphasize the need for advanced algorithms capable of handling large-scale, real-world challenges while optimizing accuracy and computational efficiency in face recognition systems.

Keywords:

face recognition; CNN; neural networks; artificial intelligence; imaging

1. Introduction

Face recognition has piqued the interest of researchers in the fields of artificial intelligence and computer vision, and it has made tremendous progress over the previous four decades [1]. Face detection, face alignment, feature extraction, and classification are the steps of a realistic face recognition system, with feature extraction being one of the most important problems [2]. To obtain excellent recognition performance, it is crucial to find strong face descriptors for the look of face regions that are unique, resilient, and computationally inexpensive [3].

Face recognition has been widely used in a variety of applications such as surveillance control, face attendance, border control gates, entrance/exit from public communities, facial security checks at airports and railway stations, and so on [1]. Due to differences in head posture, lighting, age, and facial expression, face identification in unconstrained situations is a tough challenge. Additionally, cosmetics, facial hair, and accessories (such as scarves or spectacles) may alter one’s image. The resemblance between people (e.g., twins, relatives) presents another challenge to face recognition [4,5]. Face recognition is the most favoured bio-metric for identity recognition for the following reasons: 1. high accuracy, 2. cross platform, 3. reliability [6], and 4. consent—unlike other bio-metric systems, face recognition does not require consent from the subject [7].

In our scoping review [8], we consulted 266 CNN face recognition articles from the year 2013 to 2023 and found the following gaps in the literature: (1) Most researchers are interested in face recognition using images compared to videos. (2) Researchers prefer using clean images compared to occluded ones. This is a problem because it affects model performance when applied in the real world, where occlusion exists. (3) There has been a lot of research that has been conducted using traditional CNN compared to other CNN architectures.

The objectives of this systematic review are as follows:

To determine which techniques have been applied in the face recognition domain.
To identify which databases of face recognition are most common.
To find out which areas have adopted face recognition.
To assess and identify suitable evaluation metrics to use when comparative studies are carried out in the field of face recognition.

This study seeks to review CNN architectures, databases, metrics, and applications for face recognition.

2. Face Recognition History

1964: American researchers investigated computer programming for face recognition. They envisioned a semi-automated process in which users input twenty computer measurements, such the length of the mouth or the width of the eyes [9]. After that, the computer would automatically compare the distances shown in each picture, determine how much the distances differed, and provide a potential match from closed records [10].
1970: Takeo Kanade introduced a facial recognition system that considered the spacing between facial features to identify anatomical elements, including features like the chin. Subsequent trials showed that the system’s ability to accurately recognize face characteristics was not always consistent. Yet, as curiosity about the topic increased, Kanade produced the first comprehensive book on face recognition technology in 1977 [11].
1990: Research on face recognition increased dramatically due to advancements in technology and the growing significance of applications connected to security [5].
1991: Eigenfaces [12], a facial recognition system that uses the statistical principal component analysis (PCA) approach, was introduced by Alex Pentland and Matthew Turk of the Massachusetts Institute of Technology (MIT) as the first effective example of facial identification technology [9].
1993: The Defense Advanced Research Project Agency (DARPA) and the Army Research Laboratory (ARL) launched the Face Recognition Technology Program (FERET) with the goal of creating “automatic face recognition capabilities” that could be used in a real-world setting to productively support law enforcement, security, and intelligence personnel in carrying out their official duties [13].
1997: The PCA Eigenface technique of face recognition was refined by employing linear discriminant analysis (LDA) to generate Fisherfaces [14].
2000s: Hand-crafted features such as Gabor features [14,15], local binary patterns (LBPs), and variations became popular for face recognition. The Viola–Jones object identification framework for faces was developed in 2001, making it feasible to recognize faces in real time from video material [16].
2011: Deep learning, a machine learning technology based on artificial neural networks [17], accelerated everything. The computer chooses which points to compare: it learns faster when more photos are provided. Studies aimed to enhance the performance of existing approaches by exploring novel arcface loss functions.
2015: The Viola–Jones method was implemented on portable devices and embedded systems employing tiny low-power detectors. As a result, the Viola–Jones method has been utilized to enable new features in user interfaces and teleconferencing, not just broadening the practical use of face recognition systems [18].
2022: Ukraine is utilizing Clearview AI face recognition software from the United States to identify deceased Russian servicemen. Ukraine has undertaken 8600 searches and identified the families of 582 Russian troops who died in action [19].

3. Application of Face Recognition

3.1. Security and Surveillance

Face recognition is frequently used in surveillance systems for security, especially in places like banks, government buildings, and airports. By detecting criminals, people of interest, or those on watchlists, it increases safety. Furthermore, cellphones and other gadgets include this technology for user identification, providing safe and practical access [20].

3.2. Law Enforcement

Through the comparison of facial photographs with criminal databases, law enforcement organizations use face recognition technology to identify criminals or find missing individuals. This technology speeds up case resolution by automating the process of looking through vast amounts of photos and videos [21].

3.3. Healthcare

Face recognition may be used in the healthcare industry to identify and monitor patients, guaranteeing that the appropriate person receives the proper therapy. Furthermore, facial expressions and characteristics can be examined to track patient reactions to medicine, identify early indicators of mental health conditions like depression, and monitor emotional states [22].

3.4. Access Control

Access control in secure facilities also uses facial recognition to make sure that only people with permission may enter areas that are off-limits. Automating time-tracking and attendance procedures in officse with facial recognition software lowers the possibility of fraud and mistakes that come with manual approaches [20].

3.5. Automotive Industry

Face recognition technology in automobiles may be used to identify drivers [23] and personalize settings like temperature, seat position, and favorite routes. To improve road safety, it may also track driver attentiveness and identify signs of exhaustion or distraction [24].

Table 1 summarizes the different uses of facial recognition technology, which include security, surveillance, healthcare, and mobile devices.

4. Face Recognition Systems

A face recognition system consists of four steps: face detection, face pre-processing, feature extraction, and face matching [26]. The stages are illustrated in Figure 1 below.

4.1. Face Recognition Systems Traditionally Consist of Four Main Stages

The face is captured in an input photograph or video.
Pre-processing is the process of applying several techniques to an image or video, such as alignment, noise reduction, contrast enhancement, or video frame selection.
Extracting facial features from a picture or video. Holistic, model-based, or texture-based feature extraction approaches are used in image-based methods, whilst set-based or sequence-based approaches are used in video-based methods.
Face matching is performed with a database of stored images. If the image exists, it will be matched and if it does not exist, there will not be any match.

We provide a quick overview of face detection and facial land-marking approaches in the literature below. Face detection and facial land-marking techniques that are accurate and effective improve the accuracy of face recognition systems.

4.1.1. Face Detection

The process of face detection involves determining the bounding-box of a face within a certain picture or video frame. Every face in the pictures is identified if there are several. In addition to removing the backdrop as much as possible, face detection should be resistant to changes in position, lighting, and size [5,27].

The ability to identify faces at various scales is a significant difficulty in face detection. Even with the usage of deep CNNs, this problem still exists. Deeper networks alone cannot solve the problem of detection across scales [28]. Despite these challenges, great progress has been achieved in the past decade, with several systems demonstrating excellent real-time performance. These algorithms’ recent advancements have also made major contributions to recognizing other objects such as humans/pedestrians and automobiles [29].

4.1.2. Face Preprocessing

This stage’s goal is to remove characteristics that make it difficult to classify photographs of the same person (intra-class differences), which makes them stand out from other people (inter-class differences) [30].

4.1.3. Face Extraction

The technique of removing face component characteristics such as eyes, nose, mouth, and so on from a human face picture is known as facial feature extraction. Facial feature extraction is critical for the start-up of processing techniques such as face tracking, facial emotion detection, and face identification [31]. Every face is unique and may be recognized by its structure, size, and shape. Using the size and distance, ways of carrying this out include identifying the face by removing the mouth, eyes, or nose form [32].

4.1.4. Face Matching

The process of face matching involves comparison of a digital target image against a stored image.

5. Methodology

Our systematic review’s primary objective is to identify the databases, approaches, and issues that face recognition is now facing. Future research in this field will be based on the knowledge gathered from this study. The goals and research topics covered in this study are displayed in Table 2.

In the following subsection, we will cover the steps suggested to identify and screen the studies in our systematic review.

5.1. Data Selection

Twelve internet databases provided the data used in this study, which were utilized to generate a population of pertinent papers for the systematic review. EBSCOhost (GreenFile), ScienceDirect, MasterFile, Emerald ERIC, ProQuest, Taylor & Francis, Cambridge Core, JSTOR, ACM Digital Library, Springer, IEEE Xplore Digital Library, and Premier Scopus were among these databases. Articles were only considered for literature currency if they were published between 2013 and 2023, inclusive. Following the PRISMA technique [33], the records were screened using the four stages of a systematic review, which are detailed below.

5.1.1. Inclusion and Exclusion Criteria

The research questions served as guidance throughout the evaluation process. The Table 3 and Table 4 shows the inclusion and exclusion criteria used to find relevant papers for the investigation.

5.1.2. Search Strategy

A detailed search strategy was created employing keywords and boolean operators. Primary search phrases were the following:

“Facial recognition”;
“Face recognition technology”;
“Biometric identification”;
“Deep learning in facial recognition”;
“Privacy and facial recognition”;
“Bias in facial recognition”;
“Face recognition applications”;
“Ethics of facial recognition”;
“Convolutional Neural Networks for facial recognition”;
“Deep learning”.

Based on the research questions, a search string was created to specify the studies’ parameters. To integrate pertinent terms, boolean operators were employed:

(“Facial recognition” OR “Face recognition technology” OR “Biometric identification”) AND (“Deep learning” OR “Deep learning in facial recognition” OR “Convolutional Neural Networks”) AND (“Privacy and facial recognition” OR “Bias in facial recognition” OR “Ethics of facial recognition” OR “Face recognition applications”).

The search did not use any population filter—on the contrary, all fields were explored. A total of 3622 records were found in accordance with the search strategy and the twelve databases proposed.

5.1.3. Screening

In order to eliminate duplicate records, the first filter was applied during the screening phase. Twelve databases included a total of two texts that were repeated, and 3620 of those texts matched the remaining records.

5.1.4. Eligibility and Inclusion of Articles

Four categories were created from the 266 articles that met the requirements for inclusion. We started by examining the distribution of these papers by year of publication. The primary objectives and areas of emphasis of these essays were also significant. Next, we looked at the distribution of these articles based on the type of CNN they represented. Finally, we categorized the included publications according to the CNN types they employed for facial recognition. The outcomes for each category are described in the remainder of this section.

5.1.5. Quality Assessment Rules (QAR)

A quality evaluation procedure ensures that only the most credible and relevant research are included in a review or analysis. Table 5’s questions give a systematic approach to evaluating research based on essential characteristics such as clarity, study design, and the validity of outcome measures. Figure 2 below shows the PRISMA-ScR methodology that was applied when filtering the articles.

6. Databases

Researchers studying face recognition have access to many face datasets. The databases can be offered with free access to data or for sale. Figure 3 shows the databases used in face recognition systems.

6.1. ORL Database

The ORL Database of Faces provides a collection of lab-taken facial photographs from April 1992 to April 1994 [5]. The database was utilized as part of a face recognition experiment conducted in partnership with Cambridge University Engineering Department’s Speech, Vision, and Robotics Group. There are 10 different photos for each of the forty various themes. Some subjects were photographed at different times, with variable lighting, facial emotions (open or closed eyes, smiling or not smiling), and face features (glasses or no glasses). All photographs were taken against a black, homogenous background, with the individuals standing erect and facing forward [5]. Figure 4 below shows a preview picture of the database.

6.2. FERET Database

The Face Recognition Technology (FERET) database is a dataset that is used to evaluate face recognition systems as part of the Face Recognition Technology (FERET) initiative. It was founded in 1993 to 1996 as a collaboration between Harry Wechsler at George Mason University and Jonathan Phillips at the Army Research Laboratory in Adelphi, Maryland [34]. The FERET database is a standard library of face photos that academics may use to design algorithms and report on findings. The usage of a common database also enables one to evaluate the effectiveness of different methodologies and measure their strengths and shortcomings [34]. The objective was to create machine-based facial recognition for authentication, security, and forensic applications. Facial pictures were captured during 15 sessions in a semi-controlled setting. The dataset includes 1564 sets of face pictures for 14,126 individuals, including 365 duplicates [5]. Figure 5 below shows a preview picture of the database.

6.3. Yale Face Database

The Yale Face Database, which was established in 1997, has 165 GIF-formatted grayscale photos of 15 different people [35]. There are 11 pictures for each topic, one for each distinct combination of features or expressions: normal, right-light, sad, drowsy, shocked, joyful, center-light, with glasses, and wink [35].

6.4. AR Database

Aleix Martinez and Robert Benavente created the AR Face database in 1998 at the Computer Vision Center (CVC) of the Universitat Autònoma de Barcelona in Barcelona, Spain [20,36,37]. This database focuses on face recognition but can also recognize facial expressions. The AR database includes almost 4000 frontal pictures from 126 participants (70 men and 56 females). Each subject had 26 samples recorded in two sessions on different days, with photos measuring 13,576 × 768 pixels each session [20,36,37]. Figure 6 below shows a preview picture of the database.

6.5. CVL Database

The CVL Face Database was established in 1999 by Peter Peer [38]. The database features 114 people, each associated with 7 images that were captured in a controlled environment [38].

6.6. XM2VTS Databases

The XM2VTS frontal data collection includes 2360 mug images of 295 individuals gathered during four sessions [39]. Figure 7 below shows a preview picture of the database.

6.7. BANCA Database

The BANCA database was created in 2003 to evaluate multi-modal in a variety of settings (controlled, deteriorated, and unfavorable) using two cameras and two microphones as acquisition devices [40]. Video and audio data were gathered for 52 respondents (26 males and 26 females) across twelve occasions in four distinct languages (English, French, Italian, and Spanish), for a total of 208 people [40]. Every population that was particular to a language and a gender was further split into two groups of thirteen participants each [40].

6.8. FRGC (Face Recognition Grand Challenge) Database

The FRGC Database was collected between May 2004 and March 2006; it consists of 50,000 recordings separated into training and validation divisions [41]. The controlled photographs were captured in a studio environment and are full frontal facial images with two lighting settings and two facial emotions (smiling and neutral). The uncontrolled photographs were taken in various lighting circumstances, such as corridors, atriums, and outdoors [41]. Figure 8 below shows a preview picture of the database.

6.9. LFW (Labeled Faces in the Wild) Database

Labeled Faces in the Wild (LFW) is a library of facial photos created in 2007 to investigate the topic of unconstrained face recognition [36]. Researchers at the University of Massachusetts, Amherst established and maintain this database (particular references are provided in the Acknowledgements section). The Viola–Jones face detector recognized and centered 13,233 photos of 5749 people taken from the internet. In the dataset, 1680 persons are represented by two or more photographs [36]. Figure 9 below shows a preview picture of the database.

6.10. The MUCT Face Database

Stephen Milborrow, John Morkel, and Fred Nicolls of the University of Cape Town created the MUCT database in 2008 [42]. The MUCT database has 3755 faces along with 76 manual markers. The database was developed to increase variety in lighting, age, and ethnicity [42].

6.11. CMU Multi-PIE Database

The CMU Multi-PIE face database comprises over 750,000 photos of 337 persons captured in up to four sessions over the course of five months [43]. Subjects were photographed using 15 view angles and 19 lighting settings while making a variety of facial expressions. High-resolution frontal pictures were also captured [43]. Figure 10 below shows a preview picture of the database.

6.12. CASIA-Webface Database

The CASIA-WebFace large-scale dataset was proposed by Yi et al. [44] in 2014 for the facial recognition challenge. It was acquired semiautomatically from the IMDb website, containing 49,414 photographs of the faces of 10,575 individuals. An autonomous training set for LFW can be considered CASIA-WebFace [44].

6.13. IARPA Janus Benchmark-A Database

There are 2085 face pictures from a video and 5712 face photos from the network in the IARPA Janus Benchmark A (IJB-A) dataset [45]. A total of 500 distinct people contributed these facial photos. Each target typically contains 4 images from the network video and 11 images from the network photos [45].

6.14. Megaface

The MegaFace dataset is a massive facial recognition dataset intended to test and enhance face recognition capabilities at scale [46]. It is one of the biggest datasets for face recognition system training and benchmarking, with over 1 million tagged photos representing 690,000 distinct identities [46]. MegaFace provides a strong platform for testing face verification and identification algorithms, including a probe set for testing and a gallery set containing the majority of identities [46]. Despite the fact that its magnitude offers important insights into actual face recognition situations, the dataset’s web-based collection presents problems including noise and poor image quality [46].

6.15. IARPA Janus Benchmark-B Database

The IARPA Janus Benchmark-B dataset pushes the boundaries of unconstrained face recognition technology with its extensive annotated corpus of facial imagery [47]. The collection includes both video and photos, as well as facial imagery that is Creative Commons licensed (i.e., free to be reused as long as due credit is given to the original data source). Face photos and videos of 1500 more people were gathered as an addition to the IJB-A dataset [47].

6.16. VGGFACE Database

The 2015 edition of the VGGFace collection, one of the biggest publicly accessible datasets, has 2.6 million photos with 2622 individuals [48]. There are 800,000 photos in the curated edition, with about 305 photos for each identification after label noise was eliminated by human annotators [48].

6.17. CFP Database

The difficult and publicly available CFP (celebrities in frontal-profile) dataset was created in 2016 by Sengupta et al. [49] at the University of Maryland. It has 7000 images with 500 different topics [49]. Every individual has over 4 profile photos and 10 frontal images [49].

6.18. Ms-Celeb-M1 Database

In 2016, Microsoft made available for training and testing the massive Ms-Celeb-1M dataset, which has 10 million photos from 100,000 celebrities [50].

6.19. DMFD Database

The Disguise and Makeup Database (with 410 subjects and 2460 total photos) is presented in [51]. These pictures include celebrities (movie/TV stars, athletes, or politicians) wearing disguises and/or cosmetics that reflect their true selves (beards, mustaches, eyeglasses, goggles, etc.) [51]. The primary task of this database (DMFD) is to match face photos taken in natural settings with many variables, including light, occlusion, distance, position, and expression change [51].

6.20. VGGFACE 2 Database

The VGGFace2 dataset is made up of 3.31 million photos from 9131 celebrities representing a diverse range of professions (such as politicians and athletes) and ethnicities (such as more Chinese and Indian faces than VGGFace, though the distribution of celebrities and public figures still limits the ethnic balance) [52]. Pose, age, backdrop, lighting, and other aspects of the images vary greatly; they were acquired from Google Image Search. The dataset has a gender distribution that is roughly balanced, with 59.3% of the male participants. The average number of photos per identity ranges from 80 to 843 [52].

6.21. IARPA Janus Benchmark C Database

In 2018, Maze et al. developed the IARPA Janus Benchmark-C (IJB-C) database, which is an extension of IJB-B. It has 31,334 still photos (21,294 faces and 10,040 non-faces), with an average of 6 pictures per person, and 11,779 full-motion videos (117,542 frames, with an average of 33 frames and 3 videos per person) [53].

6.22. MF2 Database

Nech and Shlizerman of the University of Washington generated the public facial recognition dataset known as MF2 (MegaFace 2) in 2017 [54]. It features 4.7 million photos and 672,000 identities [54].

6.23. DFW (Disguised Faces in the Wild) Database

The Disguised Faces in the Wild (DFW) collection comprises approximately 11,000 photos and 1000 people [55]. In order to accomplish facial recognition, it was recorded in uncontrolled situations. The dataset includes variants in disguise related to hairstyles, spectacles, facial hair, beards, hats, turbans, veils, masquerades, and ball masks. The dataset is difficult for face identification because of these changes in addition to those related to stance, lighting, emotion, backdrop, ethnicity, age, gender, clothes, hairstyles, and camera quality [55]. There are 1001 normal face images, 903 validation face photos, 4814 disguised face images, and 4440 impersonator images in the DFW collection overall. Any individual who, whether knowingly or unknowingly, assumes the identity of a topic is considered an imposter of that subject.

6.24. LFR Database

The Left–Front–Right Pose dataset was created by combining four datasets—LFW, CFP, CASIAWebFace, and VGGFace2—into one [56].

6.25. CASIA-Mask Database

There are 494,414 photos of 10,575 people with masked faces in the CASIA-Mask database [57].

Table 6 presents a summary of various face recognition databases, including key details such as the number of images, videos, subjects, data accessibility, and whether the data includes clean or occluded faces. These databases, ranging from early datasets like ORL to more recent ones like CASIA Mask, are pivotal in evaluating and training face recognition systems.

The face recognition databases presented in the table exhibit a wide variety of characteristics, offering different datasets for training and testing. These databases, spanning from the early ORL database in 1994 to more recent datasets like CASIA Mask in 2021, provide various combinations of images, videos, and subjects. The ORL and FERET databases, for instance, primarily feature clean data with a relatively smaller number of subjects (40 and 1199, respectively), making them suitable for early face recognition tasks. In contrast, newer datasets like MegaFace (2016) and Ms-Celeb-M1 (2016) include vast numbers of images (up to millions) and subjects (hundreds of thousands), allowing for large-scale training of face recognition models.

Many of the older datasets, such as Yale (1997) and FRGC (2006), focus on clean images, while recent datasets like CASIA Webface (2014) and IARPA Janus Benchmark-A (2015) include a mix of clean and occluded images, providing a more challenging environment for testing face recognition systems under varying conditions. Furthermore, IARPA Janus Benchmark-A and IARPA Janus Benchmark-B (2015, 2017) introduce datasets with both images and videos, adding a dynamic element for evaluating algorithms. The CASIA Mask database (2021), specifically focusing on occluded faces, poses an additional challenge to recognition systems, testing their ability to handle face occlusions effectively.

The majority of these datasets are publicly accessible, which has facilitated the development and benchmarking of various face recognition algorithms, while a few like DMFD (2016) are private, restricting broader accessibility. This diversity in the databases enables researchers to evaluate face recognition models under a range of conditions, including varying numbers of subjects, image quality, and occlusion types, making these datasets valuable for advancing the field of face recognition.

Table 7 provides an overview of the strengths and limitations of various face recognition databases used for training and testing face recognition systems. These databases vary in terms of the diversity of subjects, quality of images, and environmental conditions, each presenting unique advantages and challenges for researchers in the field.

The above table presents a comparison of various face recognition databases used for training and testing systems, focusing on their strengths and limitations. Some datasets, such as ORL, FERET, and Yale, are smaller and suitable for controlled experiments, but they are limited in terms of subject diversity and environmental conditions, often with low-resolution images. On the other hand, databases like FRGC, MegaFace, and Ms-Celeb-M1 offer large-scale datasets with diverse subjects, poses, and lighting conditions, though they can suffer from issues like high computational costs, data quality inconsistencies, or limited diversity in real-world scenarios. Datasets like CASIA Webface and VGGFACE provide large collections of high-quality images, but some have imbalances in data distribution or limited pose variation. Certain specialized databases like AR, CASIA Mask, and DMFD focus on specific challenges, such as occlusion or manipulation detection, but may have limited generalizability to unconstrained environments. Overall, while larger datasets provide better scalability, the limitations related to data quality, environmental conditions, and pose variation must be considered when choosing a database for training face recognition systems.

The next section concerns face recognition methods and will inform us about traditional and deep learning architectures.

7. Face Recognition Methods

7.1. Traditional

7.1.1. Principal Component Analysis (PCA)

Matthew Turk and Alex Pentland created the PCA approach for detecting faces; they coupled the conceptual approach of the Karhunen–Loève theorem with factor analysis to create a linear model [16]. Eigenfaces are determined using global and orthogonal characteristics from human faces. A human face is calculated as a weighted mixture of many Eigenfaces. However, this technique is not extremely accurate when the lighting and position of facial photos change significantly [58]. With PCA, the face database represents all images as long vectors that are correlated, rather than the typical matrix structure [58]. The PCA Eigenface method of face recognition was improved by utilizing linear discriminant analysis (LDA) to obtain Fisherfaces [14]. The most popular applications of LDA are feature extraction and dimensionality reduction [59,60]. For supervised classification problems, it is more reliable than PCA. LDA requires labeled data and struggles with pose variation and large datasets [35].

7.1.2. Gabor Filter

The Gabor filter optimizes spatial and frequency resolution by acting as a band-pass filter for local frequency distributions [61]. In texture analysis, the Gabor filter is a linear filter that effectively determines if the picture contains any certain frequency content in particular directions within a small area surrounding the point or region of examination [62]. Gabor filter-based feature selection is resilient, but computationally expensive due to high-dimensional Gabor features [61]. In a variety of image-based applications, Gabor filters have demonstrated exceptional performance. Gabor filters are capable of achieving excellent results in both the frequency and spatial domains [63].

7.1.3. Viola–Jones Object Detection Framework

Paul Viola and Michael Jones introduced the Viola–Jones object detection framework, a machine learning object identification framework, in 2001 [64]. It is made up of several classifiers. A single perceptron with several binary masks (Haar features) makes up each classifier. Although it is less accurate than more recent techniques like convolutional neural networks, it is effective and computationally inexpensive.

7.1.4. Support Vector Machine (SVM)

The hyperplane with the greatest distance to the closest training-data point of any class (also known as the “functional margin”) intuitively achieves a decent separation since, generally speaking, the bigger the margin, the lower the classifier’s generalization error [65]. Overfitting is less likely to occur for the implementer when the generalization error is lower. There are two types of support vector machines: linear and nonlinear [65].

7.1.5. Histogram of Oriented Gradients (HOG)

An image’s gradient information, or edge-like characteristics, can be captured via the Histogram of Oriented Gradients (HOG) approach. Face recognition and detection have been effectively implemented using it. Each cell’s gradient is calculated once the picture is split up into tiny cells [66]. Histograms, which serve as an image feature descriptor, are created by grouping these gradients. HOG is good at capturing texture information but struggles with handling many poses and occlusion [66].

Modern facial recognition systems have their roots in traditional technologies. Despite their notable achievements in controlled settings, techniques such as Eigenfaces (PCA), Fisherfaces (LDA), LBP, Gabor filter, and HOG are limited in their ability to handle variables in the real world, such as posture, illumination, and occlusion. These conventional techniques have mostly been superseded by more current deep learning techniques, especially CNNs and Transformers, which can extract more intricate and discriminative features from big datasets.

7.2. Deep Learning

7.2.1. AlexNet

The convolutional neural network (CNN) architecture known as AlexNet was created by Alex Krizhevsky in association with Ilya Sutskever and Geoffrey Hinton [67]. On 30 September 2012, AlexNet participated in and won the ImageNet Large Scale Visual Recognition Challenge. Eight layers make up the basic architecture of AlexNet, three of which are completely linked and five of which are convolutional. Similar to LeNet, CNN has a deeper architecture with additional filters and layered convolutional layers. The profundity of the AlexNet paradigm enhanced the efficacy of Levi and Hassner’s studies [68].

7.2.2. VGGNet

In the publication “Very Deep Convolutional Networks for Large-Scale Image Recognition”, K. Simonyan and A. Zisserman introduced the convolutional neural network model known as the VGG-Network [69]. Their primary contribution was to employ the VGGNet design, which has modest (3 × 3) convolution filters and doubles the amount of feature maps following the (2 × 2) pooling. To improve the deep architecture’s ability to learn continuous nonlinear mappings, the network’s depth was raised to 16–19 weight layers. Figure 11 below shows an example of a VGG architecture.

7.2.3. ResNet

In 2016, He et al., designed the residual neural network (also known as a residual network or ResNet) architecture largely to improve the performance of existing CNN architectures such as VGGNet, GoogLeNet, and AlexNet [70]. A ResNet is a deep learning model in which weight layers train residual functions based on the layer inputs. It functions like a highway network, with gates opened using significantly positive bias weights [71]. ResNet uses “global average pooling” instead of “fully connected” layers, resulting in a significantly reduced model size compared to the VGG network [70].

7.2.4. FaceNet

Google researchers Florian Schroff, Dmitry Kalenichenko, and James Philbina created the FaceNet face recognition technology. The technology was initially demonstrated at the 2015 IEEE Conference on Computer Vision and Pattern Recognition [72]. Using the Labeled Faces in the Wild (LFW) and YouTube Faces Databases, FaceNet was able to recognize faces with an accuracy of 99.63% and 95.12%, respectively [72].

7.2.5. LBPNet

LBPNet uses two filters that are based on Principle Component Analysis (PCA) and LBP methodologies, respectively [73]. The two components of LBPNet’s architecture are the (i) regular network for classification and (ii) deep network for feature extraction [73].

7.2.6. Lightweight Convolutional Neural Network (LWCNN)

Lightweight frameworks with shorter time tolerances and parameter values for 256-D compact embedding on large-scale face data with very noisy labels [74].

7.2.7. YOLO

A convolutional neural network architecture called YOLO was developed by Joseph Redmon and his colleagues [75]. It provides a one-stop shop for frame position prediction and the classification of many candidates. Regression is the method used by YOLO to handle object recognition, simplifying the process from picture input to category and position output [75].

7.2.8. MTCNN

MTCNNs or Multi-Task Cascaded Convolutional Neural Networks represent a neural network that detects faces and facial landmarks on images. It was published in 2016 by Zhang et al. [76]. Additionally, this system uses one-shot learning, or just one image of the offender to identify him. The goal is to recognize the criminal’s face, locate the information that has been entered in the database for that criminal, and notify the police of all the facts, including the place where the criminal was being watched by cameras [76].

7.2.9. DeepMaskNet

Ullah et al. presented DeepMaskNet in 2021 during COVID-19 [77]. It is a powerful system that can distinguish between individuals who are wearing face masks and those who are not. With an accuracy of 93.33%, DeepMaskNet outperformed various cutting-edge CNN models, including VGG19, AlexNet, and Resnet18 [77].

7.2.10. DenseNet

DenseNet was suggested as a solution to the vanishing gradient problem and is comparable to ResNet [78]. DenseNet solves the problem with ResNet by using feed-forward connections between each previous layer and the subsequent layer, utilizing cross-layer connectivity. ResNet explicitly retains information through additive identity transformations, which adds to its complexity. It makes use of thick blocks; as a result, all following levels receive their feature maps from all preceding layers [78].

7.2.11. MobileNetV2

A convolutional neural network design called MobileNetV2 aims to function well on mobile devices [79]. Its foundation is an inverted residual structure, in which the bottleneck layers are connected by residuals. Lightweight depthwise convolutions are used by the intermediate expansion layer to filter features as a source of non-linearity. The first complete convolution layer with 32 filters makes up the entirety of MobileNetV2’s design. It is followed by 19 residual bottleneck levels [79].

7.2.12. MobileFaceNets

MobileFaceNets are a type of very efficient CNN model designed for high-accuracy real-time face verification on mobile and embedded devices [80]. These models have less than one million parameters. The superiority of MobileFaceNets over MobileNetV2 has been established. MobileFaceNets require smaller amounts of data while maintaining superior accuracy when compared to other cutting-edge CNNs [80].

7.2.13. Vision Transformer (ViT)

The Vision Transformer treats pictures as a series of patches, adapting the Transformer design from Natural Language Processing to computer vision. Like word embeddings in natural language processing, these patches are flattened and linearly projected into embeddings [81]. In order to preserve spatial information, positional embeddings are inserted. Global relationship capture is then made possible by a typical Transformer encoder with multi-head self-attention layers processing the patch embedding sequence. To classify images, a classification token compiles data [81]. ViT works well for applications requiring comprehensive picture interpretation because of its ability to capture global context and long-range interdependence [81]. In some situations, Vision Transformers (ViTs) have outperformed traditional convolutional neural networks (CNNs) in image identification tests, exhibiting impressive performance [81]. But in order to function at their best, ViTs often demand a lot of data and processing power, but new developments like hybrid models—which combine CNNs and Transformers—have lessened this drawback [82].

7.2.14. Face Transformer for Recognition Model

By treating face photos as a series of patches, Face Transformer for Recognition models, as investigated in [83], make use of the Transformer architecture’s prowess in processing sequential data. These picture patches are flattened and linearly projected into embeddings, much like the tokenization of words in natural language processing. In order to maintain spatial information that is essential for comprehending face features, positional embeddings are used. After that, the patch embedding sequence is sent into a conventional Transformer encoder made up of feed-forward and multi-head self-attention networks. This makes it possible for the model to depict the face holistically by capturing the global correlations between various facial characteristics. Face recognition tasks are subsequently performed using the final representation, which is frequently obtained from a classification token. With this method, Face Transformers may efficiently learn from massive datasets and may even threaten CNNs’ hegemony in face recognition [84].

7.2.15. DeepFace

Facebook’s research group developed DeepFace, a deep learning face recognition technology. It recognizes human faces in digital photos. The approach employs a deep convolutional neural network (CNN) to develop robust face representations for the purpose of face verification [85]. DeepFace achieved a human-level accuracy of 97.35% on the LFW (Labeled Faces in the Wild) dataset, which was a substantial advance above existing approaches at the time [85]. The network comprises nine layers and pre-processes photos using 3D face alignment, which aligns them based on facial landmarks and helps with position variations. It uses a softmax loss function during training to discriminate between distinct identities [85].

7.2.16. Attention Mechanism

Attention mechanisms increase the performance of face recognition models by allowing them to focus on the most relevant facial characteristics, deal with occlusions, illumination differences, and position changes, and improve generalization. Self-attention [86], channel [87] and spatial attention [87] processes, and multi-head self-attention [88] are effective tools for improving feature maps, increasing model resilience, and attaining high accuracy even in difficult situations such as low-resolution inputs or enormous datasets. Attention processes are an important component of contemporary face recognition systems due to their strengths.

The channel mechanism describes the relevance or link between various channels in feature maps. CNNs process input data (such as an image) through many filters (or kernels), resulting in distinct feature maps. Each feature map represents a distinct channel. For example, in a color picture, there might be three channels representing the RGB (Red, Green, Blue) components [87]. However, when pictures progress through deeper layers of the network, each channel carries more abstract and high-level characteristics. The problem is determining which channels carry the most useful features for a specific activity [87]. The spatial mechanism focuses on the interactions between multiple spatial locations (pixels) in a feature map. In CNNs, spatial mechanisms are critical for capturing spatial hierarchies and patterns in the input data, such as the relative location of objects or elements in an image [87]. Self-attention, also known as intra-attention, is a process in which a series of input items attend to themselves. This implies that each element in the sequence is compared to every other element, allowing the model to determine how important other items are for each element in the sequence [86]. Multi-head attention is a variation of the self-attention process that enables the model to pay attention to several sections of the sequence at once. Rather than employing a single set of attention weights, multi-head attention employs many sets (or “heads”) to capture various relationships or aspects in the data [88].

7.2.17. Swin Face Recognition

The Swin Transformer for recognition uses a multi-stage, hierarchical design to handle picture patches [89]. Like a Vision Transformer, it begins by segmenting the vision into tiny sections. However, Swin Transformer creates a hierarchical representation that captures both local and global information by combining neighboring patches at each level, in contrast to ViT [89]. A shifted windowing method is used in this merger to provide effective cross-window information transmission. The Swin Transformer block, which models interactions inside windows via shifting window-based multi-head self-attention, is the central component of each stage [89]. This makes it scalable by lowering the computing complexity in comparison to global attention. The Swin Transformer performs well on tasks like facial recognition because of its hierarchical structure and shifting windows, which let it collect global context and fine details [89].

Table 8 summarizes the strengths and weaknesses of various face recognition methods. These methods, including PCA, Gabor filters, and Viola–Jones, each have distinct advantages, such as resource efficiency and robustness to certain variations, but also face limitations in handling large datasets, non-frontal faces, and occlusions.

The above are the most commonly used CNN architectures for training and testing face recognition systems. We can observe that some CNN’s require small or large datasets in order to achieve high accuracy. We can also observe that some CNNs have more layers compared to others. Some CNNs are best suited for real-time verification and masked faces. The corresponding open-source resources are available at this link: https://github.com/ddlee-cn/awesome_cnn (accessed on 17 December 2024). When it comes to facial recognition, Transformers encounter a number of difficulties. When compared to CNNs, they may be less successful since they need big datasets and a lot of processing power [82]. They may also be computationally costly due to their quadratic self-attention complexity, which restricts real-time applications. To address these shortcomings, hybrid models that combine CNNs and Transformers have been suggested; nevertheless, optimization is still difficult [90]. There are still a small number of studies that have been conducted using Transformers for face recognition.

The next section covers performance measures; the section will inform us about the perform metrics used in face recognition systems.

8. Performance Measures

8.1. Accuracy

Accuracy quantifies the model’s overall accuracy by comparing the number of correct predictions to the total number of predictions. Below is the mathematical computation of accuracy:

\begin{matrix} A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

In face recognition systems, a prediction is regarded as valid when it properly identifies or confirms the person in the image. Incorrect predictions occur when the algorithm does not recognize the person or incorrectly labels them as someone else.

8.2. Precision

Precision refers to the fraction of correctly identified positives. In face recognition systems, it refers to the percentage of correct positive matches out of all those recognized by the system. Below is the mathematical computation of precision:

\begin{matrix} P r e c i s i o n = \frac{T P}{T P + F P} \end{matrix}

8.3. Recall

Recall is a statistic that indicates how often a machine learning model accurately detects positive examples (true positives) among all of the actual positive samples in the dataset. Below is the mathematical computation of recall:

\begin{matrix} R e c a l l = \frac{T P}{T P + F N} \end{matrix}

8.4. F1-Score

The F1-score is a critical assessment parameter for facial recognition algorithms, particularly when dealing with unbalanced datasets. It offers a single measurement that balances accuracy and recall. The F1-score is a more trustworthy estimate of a model’s performance than accuracy since it takes into account both false positives and false negatives. The F1-score represents the harmonic mean of accuracy and recall. The harmonic mean is utilized instead of the standard arithmetic mean since it penalizes extreme values more. Below is the mathematical computation of F1-score:

\begin{matrix} F 1 = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l} = \frac{2 * T P}{2 * T P + F P + F N} \end{matrix}

A perfect F1-score of 1 indicates excellent precision and recall, whereas the poorest possible F1-score is 0.

8.5. Sensitivity

This is the system’s capacity to accurately recognize positives. This metric measures the system’s ability to identify subject persons. Below is the mathematical computation of sensitivity:

\begin{matrix} S e n s i t i v i t y = R e c a l l = \frac{T P}{T P + F N} \end{matrix}

8.6. Specificity

This is the system’s capacity to accurately recognize negatives. This metric measures the system’s ability to identify non-subject persons. Below is the mathematical computation of specificity:

\begin{matrix} S p e c i f i c i t y = \frac{T N}{F P + T N} \end{matrix}

8.7. AUC (Area Under the Curve)

The area under the ROC curve is a single measure that represents a binary classification model’s overall performance.

8.8. ROC Curve (Receiver Operating Characteristic Curve)

A graphical representation of the trade-off between True Positive Rate and False Positive Rate at different categorization levels.

Table 9 provides a summary of the strengths and weaknesses of various performance metrics commonly used in face recognition systems. These metrics, including accuracy, precision, recall, F1-score, ROC curve, and AUC, each have specific advantages and limitations, especially when dealing with skewed datasets, false positives, and false negatives.

The above are the most commonly used performance measures for face recognition systems. The next section is on Face Recognition Loss Functions; the section will inform us about how faces are mapped in order to achieve the best accuracy.

9. Face Recognition Loss Functions

In face recognition training, loss functions play a critical role in directing models to maximize face representations. In order to directly impact model performance, they seek to map faces into a feature space where intra-class distances (same person) are minimized and inter-class distances (different persons) are maximized.

9.1. Softmax Cross-Entropy Loss

L_{softmax} = - \sum_{i = 1}^{N} y_{i} log (p_{i})

where L is the loss, N is the number of classes,

y_{i}

is the ground truth label (1 for the correct class, 0 for others), and

{\hat{y}}_{i}

is the predicted probability for class i, given by the Softmax function. The goal is to minimize this loss function, which means maximizing the likelihood of the true class being predicted with high probability.

9.2. Triplet Loss

L_{triplet} = max (d (a, p) - d (a, n) + α, 0)

In the Triplet Loss function, the variables represent key components for learning discriminative embeddings.

a

is the anchor, which is the reference input, typically a face image.

p

is the positive sample, which is another image of the same identity or class as the anchor, while

n

is the negative sample, from a different class or identity. The term

d (a, p)

measures the distance between the anchor and positive samples, aiming to keep them close, while

d (a, n)

measures the distance between the anchor and the negative sample, aiming to maximize the distance. The parameter

α

is a margin that ensures the anchor-negative distance is sufficiently larger than the anchor-positive distance by a margin of

α

, preventing trivial solutions. The max function ensures the loss is zero if the margin condition is met; otherwise, it penalizes the model for not achieving sufficient separation between positive and negative pairs.

9.3. Center Loss

L_{center} = \frac{1}{2} \sum_{i = 1}^{N} {∥ x_{i} - c_{y_{i}} ∥}_{2}^{2}

In this equation,

L_{center}

represents the center loss. The variable N denotes the number of samples in the batch.

x_{i}

refers to the feature vector of the i-th sample, and

c_{y_{i}}

is the center of the class

y_{i}

to which the i-th sample belongs. The notation

∥ x_{i} - c_{y_{i}} ∥_{2}^{2}

represents the squared Euclidean distance between the feature vector

x_{i}

and the corresponding class center

c_{y_{i}}

. This loss function minimizes the distance between the feature representations of the same class by encouraging each sample to be closer to the center of its corresponding class.

9.4. ArcFace Loss

L_{arcface} = - log (\frac{exp (\cos (θ_{y} + m))}{exp (\cos (θ_{y} + m)) + \sum_{j \neq y} exp (\cos (θ_{j}))})

In this equation,

L_{arcface}

represents the ArcFace loss. The term

θ_{y}

is the angle between the feature vector and the weight vector of the true class y. m is a margin added to the true class angle

θ_{y}

to enforce a larger angular distance for correct classification. The cosine of this modified angle,

cos (θ_{y} + m)

, is used to increase the decision margin for the true class. The denominator consists of the sum of the exponentials of the cosine values of the angles for all classes, including the true class and all other classes

j \neq y

. This loss function encourages the model to distinguish between classes by maximizing the angular margin between the true class and other classes, thereby improving classification accuracy.

9.5. Contrastive Loss

L_{contrastive} = \frac{1}{2} (y \cdot {∥a - b∥}_{2}^{2} + (1 - y) \cdot max {(0, m - {∥a - b∥}_{2})}^{2})

In this equation,

L_{contrastive}

represents the contrastive loss. The term

a

and

b

are feature vectors of two samples (e.g., two images), and

{∥a - b∥}_{2}

is the Euclidean distance between these vectors. The label y is a binary indicator where

y = 1

indicates that the two samples are from the same class (positive pair), and

y = 0

indicates that they are from different classes (negative pair).

For positive pairs, the loss is proportional to the squared Euclidean distance between the feature vectors, encouraging the model to bring similar samples closer. For negative pairs, the loss is based on a margin m, encouraging the model to push dissimilar samples apart. If the distance between the negative pair is smaller than the margin m, the loss will be proportional to the square of the difference from the margin. If the distance exceeds the margin, no penalty is applied. This loss function promotes the model to learn embeddings that minimize the distance for similar samples and maximize the distance for dissimilar ones, thus improving the model’s discriminative ability.

9.6. Margin Cosine Loss

L_{cosine} = 1 - \frac{\cos (θ_{y} + m)}{∥ w_{y} ∥_{2} \cdot {∥ x_{i} ∥}_{2}}

In this equation,

L_{cosine}

represents the cosine loss. The term

\cos (θ_{y} + m)

refers to the cosine similarity between the feature vector

x_{i}

of the input image and the weight vector

w_{y}

corresponding to the target class, adjusted by a margin m.

The vectors

x_{i}

and

w_{y}

are normalized, and the denominator

∥ w_{y} ∥_{2} \cdot {∥ x_{i} ∥}_{2}

ensures that the vectors are normalized to unit length. The cosine similarity measures the angle between two vectors in a high-dimensional space. A value closer to 1 indicates the vectors are similar, while values closer to 0 or negative values indicate dissimilarity.

The margin m is added to the angle

θ_{y}

to introduce a margin between the correct class and other classes, forcing the network to produce more discriminative embeddings. The cosine loss function is designed to encourage the model to maximize the cosine similarity for the correct class while minimizing it for other classes. This makes the model more robust in differentiating between different classes by ensuring a clear margin in the feature space. The above are the most commonly used loss functions for face recognition systems. The next section covers recent applications of CNN architectures and will inform us about the areas where facial recognition systems have been applied and accuracy has been achieved.

10. Recent Applications of CNN Architectures

In 2016, Sharma et al. [91] suggested a new face recognition approach to effectively recognize individuals. They developed a face recognition system utilizing deep learning and a convolutional neural network (CNN) with Dlib face alignment. The suggested system includes four core processes: face identification, alignment, cropping, and feature extraction. The study employed a set of 286 labels on 11,284 cropped grayscale photos with a resolution of 96 × 96. After 20 epochs, FAREC achieved 96% accuracy for FRGC with a false acceptance rate of 0.1% (1 in 100). However, position variation and intensity fluctuation remain a concern.

In 2017, Arsenovic et al. [92] suggested a face recognition model that may be incorporated into other systems as a supporting or major component for monitoring, with or without minimal changes. The suggested technique uses face augmentation to enlarge the dataset and achieve improved accuracy on smaller datasets. The augmentation procedure was divided into two steps. The initial step used typical picture augmentation techniques, such as noise and blurring at various intensities. This model comprises a number of critical steps: Face detection, picture pre-processing (identifying landmarks, positioning, embeddings, and classification). With fewer face photos and the suggested augmentation approach, a high overall accuracy of 95.02% was attained. Despite being mainly automated, these methods are nonetheless prone to mistakes.

Al-Azzawi et al. [93]’s experimental findings in 2018 indicate that utilizing a Localized-Based CNN structure for face recognition and identification improves performance by leveraging the Localized Deep feature map. The model addresses issues with huge expressions, poses, lighting, and poor resolution. The suggested model achieved 98.76% facial recognition accuracy.

In 2018, Lu et al. [94] proposed Deep Coupled ResNet (DCR) for low-resolution face recognition. The suggested DCR model consistently outperformed the state of the art, according to a thorough examination on the LFW database. For a probe size of 8 × 8, when using the LFW database, the DCR achieved an accuracy of 93.6%, which was better compared to 67.7%, 72.7%, and 75.0% achieved by the LightCNN, ResNet, and VGGFace, respectively [94].

In 2018, Qu et al. [95] proposed a real-time facial recognition based on a convolution neural network (CNN) on Field Programmable Gate Array (FPGA), which enhances speed and accuracy. FPGA technology allows parallel computing and unique logic circuit design, resulting in faster processing speeds compared to Central Processing Unit (CPU), Graphics Processing Unit (GPU), and Tensor Processing Unit (TPU) processors. Parallel processing of FPGA speeds up network calculation, enabling real-time face recognition. The network operates at a clock frequency of 50 Megahertz (MHz) and achieves recognition speeds of up to 400 Frames per second (FPS), exceeding previous achievements. The recognition rate was 99.25% greater than the human eye.

In 2019, Talab et al. [96] developed an effective sub-pixel convolution neural network for face recognition at low to high resolutions. This convolutional neural network is used in image pre-processing to improve recognition of low-resolution pictures. The suggested Efficient Sub-Pixel Convolutional Neural Network converts low-resolution pictures to high-resolution for further recognition. Evaluations were conducted using Yale face database and ORL dataset faces. The suggested technique achieved greater accuracy for the Yale and ORL datasets (95.3% and 93.5%, respectively) compared to prior methods.

In 2020, Feng et al. [97] proposed a small sample face recognition approach using ensemble deep learning. This approach improves network performance by increasing the number of training samples, optimizing training parameters, and reducing the disadvantage of small sample sizes. Experiments indicate that this technique outperforms convolutional neural networks in recognizing faces in the ORL database, even with few training sets.

In 2020, Lin et al. [98] suggested LWCNN for small sample data and employed k-fold cross-validation to ensure robustness. Lin et al.’s suggested LWCNN approach outperformed other methods in terms of recognition accuracy and avoidance of overfitting in limited sample spaces.

In 2021, Yudita and colleagues [99] investigated the concept of face recognition utilizing VGG16 based on incomplete or insufficient human face data. The study achieved low accuracy; Yudita et al. suggested that the effectiveness of CNN models for human face recognition may be impacted by the many databases that are employed, each having varying quantities and kinds of data.

In 2021, Szmurlo and Osowski [100] suggested a technique that combines feature selection methods with three classifiers: support vector machine, random forest of decision trees, and Softmax incorporated into a CNN. The system uses an ALEXNET-based CNN. Results indicate that SVM outperforms random forests. However, SVM and Softmax are not as accurate as the standard softmax classifier (96.3% vs. 97.3%, respectively). Classical classifiers perform poorly due to little learning data. CNN generates a huge number of descriptors, making traditional classifiers difficult and requiring a large number of learning samples.

Multi-task cascaded convolutional neural networks (MTCNNs) were used for rapid face identification and alignment, while FaceNet with increased loss function provided high-accuracy face verification and recognition. The study evaluated the performance of their MTCNN and FaceNet hybrid network for face identification and recognition to other deep learning algorithms and approaches. The testing results show that the upgraded FaceNet can handle real-time recognition needs with an accuracy of 99.85% compared to 97.83% achieved by MTCNN. They recommended that the Access Control System’s face detection and recognition functions may be efficiently integrated [101].

The studies conducted by Sarahi et al. [102] in 2021 show that YOLO-Face achieves precision, recall, and accuracy scores of 99.8%, 72.9%, and 72.8%. The only model that outperforms YOLO-Face is MTCNN [102], which achieves an accuracy of 81.5% in the FDDB dataset. Furthermore, the CelebA dataset, which consists of 19,962 image faces, was used to test the same face detection models. The findings indicate that all of the models work well when used with the CelebA dataset. But Face-SSD outperforms YOLO-Face with 99.6% accuracy, with 99.7% accuracy. Lastly, it should be noted that YOLO-Face achieves 95.8%, 94.2%, and 87.4% on the Easy, Medium, and Hard subsets of WIDER FACE validation. It is evident that small-scale faces generate superior results for the YOLO-Face detector [102].

In 2021, Malakar et al. [103] suggested a reconstructive approach to obtain partially restored characteristics of the occluded area of the face, which was subsequently recognized using an existing deep learning method. The proposed technique does not entirely recreate the occluded section, but it does give enough characteristics to enhance recognition accuracy by up to 15%.

In 2022, Marjan et al. [104] provide an improved VGG19 deep model to enhance the accuracy of masked face recognition systems. The experimental findings show that the suggested extended VGG19 method outperforms other techniques. The suggested model accurately detects the frontal face with a mask (96%).

With respect to generative adversarial network (GAN)-based techniques, Li et al. [105] provided an algorithm architecture including de-occlusion and distillation modules. The de-occlusion module utilizes GAN for masked face completion, allowing for the recovery of occluded features and eliminating appearance ambiguity. The distillation module employs a pre-trained model for face categorization. The simulated LFW dataset had the greatest recognition accuracy of 95.44%. GANs offer a way to grow datasets without requiring a lot of real-world data collection, which enhances system performance while resolving privacy and data availability issues [106,107,108]. However, it is difficult for GAN-based algorithms to duplicate the characteristics of the face’s key features, especially when there is broad occlusion, as in the case of a facemask [109].

Using a Convolutional Block Attention Module (CBAM) [110], Li et al. [111] presented a novel technique that blends cropping-based and attention-based methodologies. In order to maximize recognition accuracy, the cropping-based method eliminates the masked face region while experimenting with different cropping proportions. The masked facial features received lower weights in the attention-based procedure, but the characteristics surrounding the eyes received larger weights. The accuracy of this method was 92.61% Masked Face Recognition (MFR). In a different work, Deng et al. [112] improved masked facial recognition by applying cosine loss to create the MF-Cosface algorithm, which outperformed the attention-based approach. In order to improve the model’s emphasis on the uncovered face area, they also developed an Attention–Inception module that combines CBAM with Inception–ResNet. Verification tasks were somewhat improved by this approach. Wu [113] presented a local restricted dictionary learning approach for an attention-based MFR algorithm that distinguishes between the face and the mask. It uses the attention mechanism to lessen information loss and the dilated convolution to increase picture resolution.

Table 10 summarizes various face recognition architectures, their corresponding training sets, and performance metrics, highlighting key approaches such as Convolutional Neural Networks (CNNs), VGG16, FaceNet, and others across different datasets and years. The table provides insights into the evolution of face recognition models, including their verification metrics and accuracy rates, showcasing both traditional and contemporary methods.

When comparing various convolutional neural network (CNN) architectures used for face recognition, several differences in accuracy and training datasets become apparent. For instance, FaceNet, a model developed by Parkhi et al. [48] in 2015, demonstrated an impressive accuracy of 98.87% on the LFW dataset, showcasing its strong performance in large-scale face verification tasks. Similarly, VGG16, introduced by Parkhi et al. [48] in the same year, achieved a slightly higher accuracy of 98.95% on the VGGFace dataset, which indicates its effectiveness in handling facial recognition tasks involving variations in face poses, lighting, and identities. The models trained on these benchmark datasets, LFW and VGGFace, highlight the robustness of these architectures in large-scale real-world conditions.

On the other hand, models like SVM (2003) and Gabor Wavelets (2001), while demonstrating high accuracy on smaller datasets like ORL (96% and 95.25%, respectively), show limitations in terms of scalability to larger, more complex datasets. Their reliance on simpler algorithms and smaller training sets places them at a disadvantage when compared to more recent architectures like FaceNet and VGG16, which leverage deep learning techniques and larger, more diverse datasets. For example, FaceNet’s success on LFW (98.87%) and its application to other datasets like Yale Face Database B (70–80%) emphasize its versatility and high generalization capabilities.

Moreover, architectures like MTCNN, which achieved 99.85% accuracy on the WIDER Face dataset and LFW, demonstrate the advantage of combining multiple models—such as face detection and face recognition—into a single framework. This further underscores the importance of using multi-task learning and ensemble methods to achieve state-of-the-art results. While older models like Gabor Wavelets performed reasonably well on smaller datasets, the increasing accuracy and applicability of newer models on larger datasets signal the rapid progress in the field of face recognition systems.

In conclusion, models like FaceNet, VGG16, and MTCNN, when trained on comprehensive datasets such as LFW and VGGFace, show superior performance, with accuracy rates exceeding 98%. In contrast, earlier models such as SVM and Gabor Wavelets perform well on smaller, less complex datasets but fall short in comparison to modern architectures designed for scalability and robustness. The evolution from simpler models to advanced deep learning approaches highlights the growing demand for sophisticated algorithms capable of handling large-scale, real-world recognition challenges.

11. Limitations of Face Recognition Models

Concerns regarding the violation of individual privacy rights are raised by the use of face data without authorization, particularly in public monitoring systems [118]. Facial recognition systems have been demonstrated to have severe biases, notably regarding ethnicity, gender, and age [8]. These biases can lead to increased mistake rates, such as misidentification or false positives, which can be problematic in law enforcement or security applications [8]. Face recognition systems are subject to spoofing attacks, which occur when an attacker deceives the system with pictures, videos, or 3D models of a person’s face. Although systems are improving with the use of liveness detection, these assaults are still a risk for extremely sensitive applications [119].

12. Challenges, Results, and Discussion

From our review of CNNs for face recognition, we discovered that some researchers are facing challenges with applying face recognition on low-resolution images. Al-Azzawi et al. [93], suggested that using a Localized Deep feature map improves facial recognition accuracy when dealing with low-resolution images. Talab et al. [96] suggested that conversion of low-resolution images to high-resolution ones is the best solution for handling low-resolution images for better accuracy. All the suggested techniques achieved good accuracy; however, they are computationally expensive and most likely to struggle in real-time face recognition.

We also discovered that some researchers are building face recognition systems with small datasets due to challenges having to do with collecting data, over-fitting, and reducing computational time. Feng et al. [97] proposed a small sample face recognition approach using ensemble deep learning. Lin et al. [98] suggested LWCNN for small sample data and employed k-fold cross-validation to ensure robustness. Using small sample dataset raises questions regarding the accuracy of the model in real-life application. Generally, CNNs are known to use large datasets to achieve the best accuracy.

Occlusion, pose, and light are also still a concern. Malakar et al. [103] suggested a reconstructive approach to obtain partially restored characteristics of the occluded area of the face, which was subsequently recognized using an existing deep learning method. Marjan et al. [104] provided an improved VGG19 deep model to enhance the accuracy of masked face recognition systems. The two techniques rely on existing unmasked data to identify an individual. The techniques appear to be ineffective if the unmasked data are not available and they also lack the inability to fully reconstruct the missing parts.

Most researchers are using images to train and test facial recognition systems [8]. Using image face datasets helps them achieve great accuracy reasons provided that the following are met: 1. for the most part, the subject is directly facing the camera; and 2. the environment is controlled. Using video for facial recognition presents a lot of challenges but creates an opportunity for research that explore gait recognition. Gait recognition has advantages of overcoming camera angle problems and the ability to identify a masked person by their movement.

13. Conclusions

In this study, we explored CNNs for face recognition, databases, and their performance metrics. CNNs are very versatile depending on the objective(s), provided that their architecture and layers are not limited. Face recognition systems can identify a person under various conditions; however, most of them struggle with occlusion, camera angle, pose, and light.

13.1. Observations

More sophisticated deep learning architectures like FaceNet, VGG16, and MTCNN have clearly replaced previous, simpler models like SVM and Gabor Wavelets. Newer models increase scalability and accuracy, demonstrating the field of face recognition’s increasing complexity.
There are several CNN designs, databases, and performance measures, as we have shown. In the face recognition field, several CNN architectures are applied to different tasks. Certain designs work well in scenarios such as low-quality images or videos and frontal face recognition. There were more images than videos in most of the datasets used to train and evaluate face recognition algorithms. Most researchers, we also noticed, employed Softmax verification measures.
We observed that most of the databases used for training and testing face recognition models had less black people in comparison to other races. The lack of balance in the datasets leads to bias. The issue of privacy surrounding cameras is still being debated in the US [8,120]. One aspect of the discussion is the inability of facial recognition software to distinguish between black and white faces, resulting in racial profiling and erroneous arrests [8,120].
Occlusion, camera angle, stance, and lighting are all issues that persist [8]. Researchers are attempting to devise solutions to these challenges; however, the varying resolutions provided by the source of images or videos, such as CCTV, creates additional issues. Intelligent and automated face recognition systems work well in controlled contexts, but badly in uncontrolled ones [121]. The two main causes of this are facial alignment and the use of high- or low-resolution face images taken in controlled settings by researchers for training.
Models trained on bigger and more diversified datasets, like LFW and VGGFace, often perform better in real-world, large-scale recognition tasks. The comparison of simpler datasets such as ORL and more complex datasets such as LFW indicates that dataset size and variety are important variables in the performance of face recognition systems.
Little research has been conducted using Transformers and mamba for face recognition.
Most researchers use accuracy as a performance metric. The use of accuracy as a performance measurement without comparing it to other performance metrics raises questions about the validity of results.
The use of several models for face detection and recognition, as demonstrated in MTCNN, highlights the potential advantages of hybrid and ensemble techniques. Research into more advanced hybrid models might increase the robustness and adaptability of face recognition systems.

13.2. Contribution of Article

This work will contribute to the body of knowledge in the face recognition field by providing an updated overview of facial recognition systems, encompassing past, present, and future challenges. The report also summarizes 266 publications which were reviewed and compared. The CNN architectures and performance measures detailed in this paper demonstrate their applicability and challenges.

13.3. Future Work

In order to overcome present constraints and improve the capabilities of current models, future research in face recognition systems should concentrate on a few crucial areas. The creation of hybrid techniques, which combine the scalability and performance of contemporary deep learning architectures with the simplicity and effectiveness of older models like SVM and Gabor Wavelets, is one encouraging avenue. More reliable training methods and data augmentation procedures are also required, as evidenced by the critical requirement to improve the generalization of models such as FaceNet and VGG16 over a variety of datasets with different lighting, postures, and occlusions. As proven by MTCNN, multi-task learning and ensemble techniques provide chances to enhance the integration of face detection and identification inside a single, cohesive framework. Another crucial topic is improving the quality of large-scale datasets by tackling issues that might impair model performance, such as noise, low resolution, and poor picture quality. Additionally, to enhance face identification in unrestricted, real-world settings, models that are more robust to changes in occlusions, extreme positions, and facial emotions must be developed. The requirement for accurate and resource-efficient models that guarantee that computational constraints do not impair performance is increasing as real-time facial recognition system applications proliferate. Finally, more studies should be conducted on the ethical issues of fairness, prejudice, and privacy in face recognition systems. Future studies should concentrate on reducing bias and making sure privacy laws are followed in real-world applications.

Author Contributions

Conceptualization, A.N., C.C. and S.V.; methods, A.N. and C.C.; investigation, A.N. and C.C.; formal Analysis, A.N.; data curation, A.N.; writing—original draft preparation, A.N.; writing—review and editing, A.N., C.C. and S.V.; supervision, S.V. and C.C.; project administration, S.V. and C.C.; funding acquisition, S.V. and C.C. Authorship has been limited to those who have contributed substantially to the work reported. All authors have read and agreed to the published version of the manuscript.

Funding

This work was undertaken within the context of the Centre for Artificial Intelligence Research, which is supported by the Centre for Scientific and Innovation Research (CSIR) under grant number CSIR/BEI/HNP/CAIR/2020/10, supported by the Government of the Republic of South Africa through its Department of Science and Innovation’s University Capacity Development grants.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analysed in this study can be found in the University of Johannesburg repository using the following link: https://figshare.com/s/dbfbe28773afeb71872f (accessed on 12 June 2024).

Acknowledgments

We acknowledge both the moral and technical support given by the University of Johannesburg, University of KwaZulu-Natal, and Sol Plaatje University.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

MDPI	Multidisciplinary Digital Publishing Institute
DOAJ	Directory of open-access journals
TLA	Three-letter acronym
LD	Linear dichroism
CNN	Convolutional neural networks
VGG	Directory of open-access journals
ORL	Olivetti Research Laboratory
FERET	Facial Recognition Technology
VGG16	Visual Geometry Group 16
LFW	Labeled Faces in the Wild
VGGFace	Visual Geometry Group Face
MTCNN	Multi-Task Cascaded Convolutional Networks
FGRC	Face Recognition Grand Challenge
CASIA	Chinese Academy of Sciences Institute of Automation
IARPA	Intelligence Advanced Research Projects Activity
DMFD	Dynamic Multi-Factor Database
PCA	Principal Component Analysis
SVM	Support Vector Machine
HOG	Histogram of Oriented Gradients
AlexNet	AlexanderNet (CNN Architecture)
VGGNet	Visual Geometry Group Network (CNN Architecture)
ResNet	Residual Network
FaceNet	Face Recognition Network
LBPNet	Local Binary Patterns Network
LWCNN	Lightweight Convolutional Neural Network
YOLO	You Only Look Once
ViT	Vision Transformer
AR	Affective Computing (AR)
CVL	Computer Vision Laboratory
AUC	Area Under the Curve
GAN	Generative adversarial networks
QAR	Quality Assessment Rules
CBAM	Convolutional Block Attention Module
DCR	Deep Coupled ResNet
FGPA	Field Programmable Gate Array
CPU	Central Processing Unit (CPU)
GPU	Graphics Processing Unit
TPU	Tensor Processing Unit
MHz	Megahertz
FPS	Frames Per Second

References

Junayed, M.S.; Sadeghzadeh, A.; Islam, M.B. Deep covariance feature and cnn-based end-to-end masked face recognition. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, India, 15–18 December 2021; IEEE: New York, NY, USA, 2021; pp. 1–8. [Google Scholar]
Chien, J.T.; Wu, C.C. Discriminant waveletfaces and nearest feature classifiers for face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1644–1649. [Google Scholar] [CrossRef]
Wan, L.; Liu, N.; Huo, H.; Fang, T. Face recognition with convolutional neural networks and subspace learning. In Proceedings of the 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017; IEEE: New York, NY, USA, 2017; pp. 228–233. [Google Scholar]
Jain, A.K.; Ross, A.A.; Nandakumar, K. Introduction to Biometrics; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Taskiran, M.; Kahraman, N.; Erdem, C.E. Face recognition: Past, present and future (a review). Digit. Signal Process. 2020, 106, 102809. [Google Scholar] [CrossRef]
Kortli, Y.; Jridi, M.; Al Falou, A.; Atri, M. Face Recognition Systems: A Survey. Sensors 2020, 20, 342. [Google Scholar] [CrossRef]
Saini, R.; Rana, N. Comparison of various biometric methods. Int. J. Adv. Sci. Technol. 2014, 2, 24–30. [Google Scholar]
Nemavhola, A.; Viriri, S.; Chibaya, C. A Scoping Review of Literature on Deep Learning Techniques for Face Recognition. Hum. Behav. Emerg. Technol. 2025, 2025, 5979728. [Google Scholar] [CrossRef]
Adjabi, I.; Ouahabi, A.; Benzaoui, A.; Taleb-Ahmed, A. Past, present, and future of face recognition: A review. Electronics 2020, 9, 1188. [Google Scholar] [CrossRef]
Nilsson, N.J. The Quest for Artificial Intelligence; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
de Leeuw, K.M.M.; Bergstra, J. The History of Information Security: A Comprehensive Handbook; Elsevier: Amsterdam, The Netherlands, 2007. [Google Scholar]
Turk, M.; Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 1991, 3, 71–86. [Google Scholar] [CrossRef]
Gates, K.A. Our Biometric Future: Facial Recognition Technology and the Culture of Surveillance; NYU Press: New York, NY, USA, 2011; Volume 2. [Google Scholar]
King, I. Neural Information Processing: 13th International Conference, ICONIP 2006, Hong Kong, China, 3–6 October 2006: Proceedings; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Xue-Fang, L.; Tao, P. Realization of face recognition system based on Gabor wavelet and elastic bunch graph matching. In Proceedings of the 2013 25th Chinese Control and Decision Conference (CCDC), Guiyang, China, 25–27 May 2013; IEEE: New York, NY, USA, 2013; pp. 3384–3386. [Google Scholar]
Kundu, M.K.; Mitra, S.; Mazumdar, D.; Pal, S.K. Perception and Machine Intelligence: First Indo-Japan Conference, PerMIn 2012, Kolkata, India, 12–13 January 2011, Proceedings; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 7143. [Google Scholar]
Guo, G.; Zhang, N. A survey on deep learning based face recognition. Comput. Vis. Image Underst. 2019, 189, 102805. [Google Scholar] [CrossRef]
Datta, A.K.; Datta, M.; Banerjee, P.K. Face Detection and Recognition: Theory and Practice; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
Gofman, M.I.; Villa, M. Identity and War: The Role of Biometrics in the Russia-Ukraine Crisis. Int. J. Eng. Sci. Technol. (IJonEST) 2023, 5, 89. [Google Scholar] [CrossRef]
Barnouti, N.H.; Al-Dabbagh, S.S.M.; Matti, W.E. Face recognition: A literature review. Int. J. Appl. Inf. Syst. 2016, 11, 21–31. [Google Scholar] [CrossRef]
Lal, M.; Kumar, K.; Arain, R.H.; Maitlo, A.; Ruk, S.A.; Shaikh, H. Study of face recognition techniques: A survey. Int. J. Adv. Comput. Sci. Appl. 2018, 1–8. [Google Scholar] [CrossRef]
Shamova, U. Face Recognition in Healthcare: General Overview. Язык в сфере прoфессиoнальнoй кoммуникации.—Екатеринбург. 2020, pp. 748–752. Available online: https://elar.urfu.ru/handle/10995/84113 (accessed on 17 March 2024).
Elngar, A.A.; Kayed, M. Vehicle security systems using face recognition based on internet of things. Open Comput. Sci. 2020, 10, 17–29. [Google Scholar] [CrossRef]
Xing, J.; Fang, G.; Zhong, J.; Li, J. Application of face recognition based on CNN in fatigue driving detection. In Proceedings of the 2019 International Conference on Artificial Intelligence and Advanced Manufacturing, Dublin, Ireland, 17–19 October 2019; pp. 1–5. [Google Scholar]
Pabiania, M.D.; Santos, K.A.P.; Villa-Real, M.M.; Villareal, J.A.N. Face recognition system for electronic medical record to access out-patient information. J. Teknol. 2016, 78. [Google Scholar] [CrossRef]
Aswis, A.; Morsy, M.; Abo-Elsoud, M. Face Recognition Based on PCA and DCT Combination Technique. Int. J. Eng. Res. Technol. 2018, 4, 1295–1298. [Google Scholar]
Ranjan, R.; Sankaranarayanan, S.; Bansal, A.; Bodla, N.; Chen, J.C.; Patel, V.M.; Castillo, C.D.; Chellappa, R. Deep learning for understanding faces: Machines may be just as good, or better, than humans. IEEE Signal Process. Mag. 2018, 35, 66–83. [Google Scholar] [CrossRef]
Loy, C.C. Face detection. In Computer Vision: A Reference Guide; Springer: Berlin/Heidelberg, Germany, 2021; pp. 429–434. [Google Scholar]
Yang, M.H. Face Detection. In Encyclopedia of Biometrics; Li, S.Z., Jain, A.K., Eds.; Springer: Boston, MA, USA, 2015; pp. 447–452. [Google Scholar] [CrossRef]
Calvo, G.; Baruque, B.; Corchado, E. Study of the pre-processing impact in a facial recognition system. In Proceedings of the Hybrid Artificial Intelligent Systems: 8th International Conference, HAIS 2013, Salamanca, Spain, 11–13 September 2013; Proceedings 8. Springer: Berlin/Heidelberg, Germany, 2013; pp. 334–344. [Google Scholar]
Benedict, S.R.; Kumar, J.S. Geometric shaped facial feature extraction for face recognition. In Proceedings of the 2016 IEEE International Conference on Advances in Computer Applications (ICACA), Coimbatore, India, 24 October 2016; pp. 275–278. [Google Scholar] [CrossRef]
Napoléon, T.; Alfalou, A. Pose invariant face recognition: 3D model from single photo. Opt. Lasers Eng. 2017, 89, 150–161. [Google Scholar] [CrossRef]
Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.; Horsley, T.; Weeks, L.; et al. PRISMA extension for scoping reviews (PRISMA-ScR): Checklist and explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef] [PubMed]
Phillips, P.; Moon, H.; Rizvi, S.; Rauss, P. The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1090–1104. [Google Scholar] [CrossRef]
Belhumeur, P.; Hespanha, J.; Kriegman, D. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 711–720. [Google Scholar] [CrossRef]
Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Proceedings of the Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France, 17–20 October 2008. [Google Scholar]
Zou, J.; Ji, Q.; Nagy, G. A comparative study of local matching approach for face recognition. IEEE Trans. Image Process. 2007, 16, 2617–2628. [Google Scholar] [CrossRef]
Peer, P. CVL Face Database; University of Ljubljana: Ljubljana, Slovenia, 2010. [Google Scholar]
Fox, N.; Reilly, R.B. Audio-visual speaker identification based on the use of dynamic audio and visual features. In Proceedings of the International Conference on Audio-and Video-Based Biometric Person Authentication, Guildford, UK, 9–11 June 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 743–751. [Google Scholar]
Bailly-Bailliére, E.; Bengio, S.; Bimbot, F.; Hamouz, M.; Kittler, J.; Mariéthoz, J.; Matas, J.; Messer, K.; Popovici, V.; Porée, F.; et al. The BANCA database and evaluation protocol. In Proceedings of the Audio-and Video-Based Biometric Person Authentication: 4th International Conference, AVBPA 2003, Guildford, UK, 9–11 June 2003; Proceedings 4. Springer: Berlin/Heidelberg, Germany, 2003; pp. 625–638. [Google Scholar]
Phillips, P.; Flynn, P.; Scruggs, T.; Bowyer, K.; Chang, J.; Hoffman, K.; Marques, J.; Min, J.; Worek, W. Overview of the face recognition grand challenge. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 947–954. [Google Scholar] [CrossRef]
Milborrow, S.; Morkel, J.; Nicolls, F. The MUCT Landmarked Face Database. In Pattern Recognition Association of South Africa; 2010; Available online: http://www.milbo.org/muct (accessed on 12 April 2024).
Gross, R.; Matthews, I.; Cohn, J.; Kanade, T.; Baker, S. Multi-PIE. Image Vis. Comput. 2013, 28, 807–813. [Google Scholar] [CrossRef]
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning face representation from scratch. arXiv 2014, arXiv:1411.7923. [Google Scholar]
Klare, B.F.; Klein, B.; Taborsky, E.; Blanton, A.; Cheney, J.; Allen, K.; Grother, P.; Mah, A.; Jain, A.K. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1931–1939. [Google Scholar]
Kemelmacher-Shlizerman, I.; Seitz, S.M.; Miller, D.; Brossard, E. The MegaFace Benchmark: 1 Million Faces for Recognition at Scale. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4873–4882. [Google Scholar] [CrossRef]
Whitelam, C.; Taborsky, E.; Blanton, A.; Maze, B.; Adams, J.; Miller, T.; Kalka, N.; Jain, A.K.; Duncan, J.A.; Allen, K.; et al. Iarpa janus benchmark-b face dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 90–98. [Google Scholar]
Parkhi, O.; Vedaldi, A.; Zisserman, A. Deep face recognition. In Proceedings of the BMVC 2015-Proceedings of the British Machine Vision Conference, Swansea, UK, 7–10 September 2015; British Machine Vision Association: Durham, UK, 2015. [Google Scholar]
Sengupta, S.; Chen, J.C.; Castillo, C.; Patel, V.M.; Chellappa, R.; Jacobs, D.W. Frontal to profile face verification in the wild. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; IEEE: New York, NY, USA, 2016; pp. 1–9. [Google Scholar]
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 87–102. [Google Scholar]
Al-ghanim, F.; Aljuboori, A. Face Recognition with Disguise and Makeup Variations Using Image Processing and Machine Learning. In Proceedings of the Advances in Computing and Data Sciences: 5th International Conference, ICACDS 2021, Nashik, India, 23–24 April 2021; pp. 386–400. [Google Scholar] [CrossRef]
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. Vggface2: A dataset for recognising faces across pose and age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; IEEE: New York, NY, USA, 2018; pp. 67–74. [Google Scholar]
Maze, B.; Adams, J.; Duncan, J.A.; Kalka, N.; Miller, T.; Otto, C.; Jain, A.K.; Niggel, W.T.; Anderson, J.; Cheney, J.; et al. Iarpa janus benchmark-c: Face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB), Gold Coast, Australia, 20–23 February 2018; IEEE: New York, NY, USA, 2018; pp. 158–165. [Google Scholar]
Nech, A.; Kemelmacher-Shlizerman, I. Level Playing Field for Million Scale Face Recognition. arXiv 2017, arXiv:1705.00393. [Google Scholar]
Kushwaha, V.; Singh, M.; Singh, R.; Vatsa, M.; Ratha, N.K.; Chellappa, R. Disguised Faces in the Wild. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1–18. [Google Scholar]
Elharrouss, O.; Almaadeed, N.; Al-Maadeed, S. LFR face dataset:Left-Front-Right dataset for pose-invariant face recognition in the wild. In Proceedings of the 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), Doha, Qatar, 2–5 February 2020; pp. 124–130. [Google Scholar] [CrossRef]
Ayad, W.; Qays, S.; Al-Naji, A. Generating and Improving a Dataset of Masked Faces Using Data Augmentation. J. Tech. 2023, 5, 46–51. [Google Scholar] [CrossRef]
Gottumukkal, R.; Asari, V.K. An improved face recognition technique based on modular PCA approach. Pattern Recognit. Lett. 2004, 25, 429–436. [Google Scholar] [CrossRef]
Yang, J.; Liu, C.; Zhang, L. Color space normalization: Enhancing the discriminating power of color spaces for face recognition. Pattern Recognit. 2010, 43, 1454–1466. [Google Scholar] [CrossRef]
Ye, J.; Janardan, R.; Li, Q. Two-dimensional linear discriminant analysis. Adv. Neural Inf. Process. Syst. 2004, 17, 1–8. [Google Scholar]
Rahman, M.T.; Bhuiyan, M.A. Face recognition using Gabor Filters. In Proceedings of the 2008 11th International Conference on Computer and Information Technology, Khulna, Bangladesh, 24–27 December 2008; pp. 510–515. [Google Scholar] [CrossRef]
Olshausen, B.A.; Field, D.J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 1996, 381, 607–609. [Google Scholar] [CrossRef]
Hammouche, R.; Attia, A.; Akhrouf, S.; Akhtar, Z. Gabor filter bank with deep autoencoder based face recognition system. Expert Syst. Appl. 2022, 197, 116743. [Google Scholar] [CrossRef]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. 1–8. [Google Scholar] [CrossRef]
Ruppert, D. The elements of statistical learning: Data mining, inference, and prediction. J. Am. Stat. Assoc. 2004, 99, 567. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Nice, France, 2012; Volume 25. [Google Scholar]
Levi, G.; Hassner, T. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 34–42. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Srivastava, R.K.; Greff, K.; Schmidhuber, J. Highway networks. arXiv 2015, arXiv:1505.00387. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Xi, M.; Chen, L.; Polajnar, D.; Tong, W. Local binary pattern network: A deep learning approach for face recognition. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: New York, NY, USA, 2016; pp. 3224–3228. [Google Scholar]
Wu, X.; He, R.; Sun, Z.; Tan, T. A light CNN for deep face representation with noisy labels. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2884–2896. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Kumar, K.K.; Kasiviswanadham, Y.; Indira, D.; Priyanka palesetti, P.; Bhargavi, C. Criminal face identification system using deep learning algorithm multi-task cascade neural network (MTCNN). Mater. Today Proc. 2023, 80, 2406–2410, SI:5 NANO 2021. [Google Scholar] [CrossRef]
Ullah, N.; Javed, A.; Ali Ghazanfar, M.; Alsufyani, A.; Bourouis, S. A novel DeepMaskNet model for face mask detection and masked facial recognition. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 9905–9914. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv 2019, arXiv:1801.04381. [Google Scholar]
Chen, S.; Liu, Y.; Gao, X.; Han, Z. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In Proceedings of the Biometric Recognition: 13th Chinese Conference, CCBR 2018, Urumqi, China, 11–12 August 2018; Proceedings 13. Springer: Berlin/Heidelberg, Germany, 2018; pp. 428–438. [Google Scholar]
Alexey, D.; Fischer, P.; Tobias, J.; Springenberg, M.R.; Brox, T. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE TPAMI 2016, 38, 1734–1747. [Google Scholar]
Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Jiang, Z.; Hou, Q.; Feng, J. DeepViT: Towards Deeper Vision Transformer. arXiv 2021, arXiv:2103.11886. [Google Scholar]
Zhong, Y.; Deng, W. Face transformer for recognition. arXiv 2021, arXiv:2103.14803. [Google Scholar]
Sun, Z.; Tzimiropoulos, G. Part-based face recognition with vision transformers. arXiv 2022, arXiv:2212.00057. [Google Scholar]
Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [Google Scholar]
Zhao, H.; Jia, J.; Koltun, V. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10076–10085. [Google Scholar]
Zhu, B.; Li, L.; Hu, X.; Wu, F.; Zhang, Z.; Zhu, S.; Wang, Y.; Wu, J.; Song, J.; Li, F.; et al. DEFOG: Deep Learning with Attention Mechanism Enabled Cross-Age Face Recognition. Tsinghua Sci. Technol. 2024, 30, 1342–1358. [Google Scholar] [CrossRef]
Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv 2019, arXiv:1905.09418. [Google Scholar]
Lin, K.; Li, L.; Lin, C.C.; Ahmed, F.; Gan, Z.; Liu, Z.; Lu, Y.; Wang, L. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17949–17958. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Sharma, S.; Shanmugasundaram, K.; Ramasamy, S.K. FAREC—CNN based efficient face recognition technique using Dlib. In Proceedings of the 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT), Ramanathapuram, India, 25–27 May 2016; pp. 192–195. [Google Scholar] [CrossRef]
Arsenovic, M.; Sladojevic, S.; Anderla, A.; Stefanovic, D. FaceTime—Deep learning based face recognition attendance system. In Proceedings of the 2017 IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY), Subotica, Serbia, 14–16 September 2017; pp. 53–58. [Google Scholar] [CrossRef]
Al-Azzawi, A.; Hind, J.; Cheng, J. Localized Deep-CNN Structure for Face Recognition. In Proceedings of the 2018 11th International Conference on Developments in eSystems Engineering (DeSE), Cambridge, UK, 2–5 September 2018; pp. 52–57. [Google Scholar] [CrossRef]
Lu, Z.; Jiang, X.; Kot, A. Deep coupled resnet for low-resolution face recognition. IEEE Signal Process. Lett. 2018, 25, 526–530. [Google Scholar] [CrossRef]
Qu, X.; Wei, T.; Peng, C.; Du, P. A Fast Face Recognition System Based on Deep Learning. In Proceedings of the 2018 11th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 8–9 December 2018; Volume 1, pp. 289–292. [Google Scholar] [CrossRef]
Talab, M.A.; Awang, S.; Najim, S.A.d.M. Super-Low Resolution Face Recognition using Integrated Efficient Sub-Pixel Convolutional Neural Network (ESPCN) and Convolutional Neural Network (CNN). In Proceedings of the 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), Selangor, Malaysia, 29 June 2019; pp. 331–335. [Google Scholar] [CrossRef]
Feng, Y.; Pang, T.; Li, M.; Guan, Y. Small sample face recognition based on ensemble deep learning. In Proceedings of the 2020 Chinese Control and Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 4402–4406. [Google Scholar] [CrossRef]
Lin, M.; Zhang, Z.; Zheng, W. A Small Sample Face Recognition Method Based on Deep Learning. In Proceedings of the 2020 IEEE 20th International Conference on Communication Technology (ICCT), Nanning, China, 28–31 October 2020; pp. 1394–1398. [Google Scholar] [CrossRef]
Yudita, S.I.; Mantoro, T.; Ayu, M.A. Deep Face Recognition for Imperfect Human Face Images on Social Media using the CNN Method. In Proceedings of the 2021 4th International Conference of Computer and Informatics Engineering (IC2IE), Depok, Indonesia, 14–15 September 2021; pp. 412–417. [Google Scholar] [CrossRef]
Szmurlo, R.; Osowski, S. Deep CNN ensemble for recognition of face images. In Proceedings of the 2021 22nd International Conference on Computational Problems of Electrical Engineering (CPEE), Hradek u Susice, Czech Republic, 15–17 September 2021; pp. 1–4. [Google Scholar] [CrossRef]
Wu, C.; Zhang, Y. MTCNN and FACENET based access control system for face detection and recognition. Autom. Control. Comput. Sci. 2021, 55, 102–112. [Google Scholar]
Sanchez-Moreno, A.S.; Olivares-Mercado, J.; Hernandez-Suarez, A.; Toscano-Medina, K.; Sanchez-Perez, G.; Benitez-Garcia, G. Efficient face recognition system for operating in unconstrained environments. J. Imaging 2021, 7, 161. [Google Scholar] [CrossRef] [PubMed]
Malakar, S.; Chiracharit, W.; Chamnongthai, K.; Charoenpong, T. Masked Face Recognition Using Principal component analysis and Deep learning. In Proceedings of the 2021 18th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Online, 19–22 May 2021; pp. 785–788. [Google Scholar] [CrossRef]
Marjan, M.A.; Hasan, M.; Islam, M.Z.; Uddin, M.P.; Afjal, M.I. Masked Face Recognition System using Extended VGG-19. In Proceedings of the 2022 4th International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE), Rajshahi, Bangladesh, 29–31 December 2022; pp. 1–4. [Google Scholar] [CrossRef]
Li, Y.; Liu, S.; Yang, J.; Yang, M.H. Generative face completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3911–3919. [Google Scholar]
Pann, V.; Lee, H.J. Effective attention-based mechanism for masked face recognition. Appl. Sci. 2022, 12, 5590. [Google Scholar] [CrossRef]
Yuan, L.; Li, F. Face recognition with occlusion via support vector discrimination dictionary and occlusion dictionary based sparse representation classification. In Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), Wuhan, China, 11–13 November 2016; IEEE: New York, NY, USA, 2016; pp. 110–115. [Google Scholar]
Deng, W.; Hu, J.; Guo, J. Extended SRC: Undersampled face recognition via intraclass variant dictionary. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1864–1870. [Google Scholar] [CrossRef]
Alzu’bi, A.; Albalas, F.; Al-Hadhrami, T.; Younis, L.B.; Bashayreh, A. Masked face recognition using deep learning: A review. Electronics 2021, 10, 2666. [Google Scholar] [CrossRef]
Li, S.; Lee, H.J. Effective Attention-Based Feature Decomposition for Cross-Age Face Recognition. Appl. Sci. 2022, 12, 4816. [Google Scholar] [CrossRef]
Boutros, F.; Damer, N.; Kirchbuchner, F.; Kuijper, A. Self-restrained triplet loss for accurate masked face recognition. Pattern Recognit. 2022, 124, 108473. [Google Scholar] [CrossRef]
Deng, H.; Feng, Z.; Qian, G.; Lv, X.; Li, H.; Li, G. MFCosface: A masked-face recognition algorithm based on large margin cosine loss. Appl. Sci. 2021, 11, 7310. [Google Scholar] [CrossRef]
Wu, G. Masked face recognition algorithm for a contactless distribution cabinet. Math. Probl. Eng. 2021, 2021, 5591020. [Google Scholar] [CrossRef]
Yanhun, Z.; Chongqing, L. Face recognition based on support vector machine and nearest neighbor classifier. J. Syst. Eng. Electron. 2003, 14, 73–76. [Google Scholar]
Kepenekci, B. Face Recognition Using Gabor Wavelet Transform. Master’s Thesis, Middle East Technical University, Ankara, Turkey, 2001. [Google Scholar]
Lou, G.; Shi, H. Face image recognition based on convolutional neural network. China Commun. 2020, 17, 117–124. [Google Scholar] [CrossRef]
Mahesh, S.; Ramkumar, G. Smart Face Detection and Recognition in Illumination Invariant Images using AlexNet CNN Compare Accuracy with SVM. In Proceedings of the 2022 3rd International Conference on Intelligent Engineering and Management (ICIEM), London, UK, 27–29 April 2022; pp. 572–575. [Google Scholar] [CrossRef]
Garvie, C.; Bedoya, A.; Frankle, J. Unregulated Police Face Recognition in America. Perpetual Line Up. 2016. Available online: https://www.perpetuallineup.org/ (accessed on 17 March 2024).
Korshunov, P.; Marcel, S. DeepFakes: A New Threat to Face Recognition? Assessment and Detection. arXiv 2018, arXiv:1812.08685. [Google Scholar]
Sikhakhane, N. Joburg hostels and Townships coming under surveillance by facial recognition cameras. Drones 2023. Available online: https://www.dailymaverick.co.za/article/2023-08-13-joburg-hostels-and-townships-coming-under-surveillance-by-facial-recognition-cameras-and-drones/ (accessed on 12 April 2024).
Masud, M.; Muhammad, G.; Alhumyani, H.; Alshamrani, S.S.; Cheikhrouhou, O.; Ibrahim, S.; Hossain, M.S. Deep learning-based intelligent face recognition in IoT-cloud environment. Comput. Commun. 2020, 152, 215–222. [Google Scholar] [CrossRef]

Figure 1. Face recognition steps [26].

Figure 2. PRISMA-ScR diagram showing all the steps taken to filter out articles [8].

Figure 3. Databases used in face recognition systems.

Figure 4. ORL Database samples.

Figure 5. FERET database samples.

Figure 6. AR database samples.

Figure 7. XM2VTS database samples.

Figure 8. FGRC database samples.

Figure 9. LFW database samples.

Figure 10. CMU Multi-PIE database samples.

Figure 11. VGG architecture.

Table 1. Summary of applications of face recognition.

Application Areas	Use
Security	Office access, email authentication on multimedia workstations, flight boarding systems, and building access management [20].
Surveillance	CCTV control, power grid surveillance, portal control, and drug offender monitoring and search [20].
Health	To identify patients and manage patients’ medical records [25]
Cell phone and gaming consoles	Unlocking device, gaming, and mobile banking [10]

Table 2. Research questions.

Question	Purpose
Q1 What are the most prevalent methods used for face recognition? How do they compare in performance?	To determine which techniques have been applied in the face recognition domain
Q2 What databases are used in face recognition?	To identify which databases of face recognition are most common
Q3 In which areas in the real world have face recognition techniques been applied?	To find out which areas have adopted face recognition
Q4 What are the most common evaluation metrics of face recognition systems?	To assess and identify suitable evaluation metrics to use when comparative studies are carried out in the field of face recognition

Table 3. Inclusion criteria.

Inclusion	Criteria	Explanation
IC1	Publicly available articles.	This criterion guarantees that only publicly available papers or those published in open-access journals are examined. The purpose is to make the articles accessible to all researchers and readers, thus fostering transparency and reproducibility.
IC2	English-language studies.	This promotes uniformity and avoids problems with linguistic hurdles. It also streamlines the review process because all research can be examined in a single language, making it more efficient and useful.
IC3	Papers about face recognition research.	The review’s goal is to give particular insights on facial recognition. This assures that the gathered research immediately adds to the knowledge and improvement of face recognition systems.
IC4	Articles published between 2013 and 2023.	Face recognition has advanced significantly, particularly with the introduction of deep learning and convolutional neural networks (CNNs). Research published during the previous decade is more likely to represent the most recent advances, trends, and breakthroughs in facial recognition. Excluding older papers guarantees that the evaluation focuses on current methodologies and technology.

Table 4. Exclusion criteria.

Exclusion	Criteria	Explanation
EC1	Articles published in languages other than English.	This exclusion is mostly intended for efficiency. English-language studies are significantly more accessible to a worldwide audience, ensuring uniformity in terminology, research methodologies, and conclusions. Translating non-English papers would take time and might include mistakes or misinterpretations, jeopardizing the review’s credibility.
EC2	Studies that do not offer the complete article.	The complete paper is required for a comprehensive evaluation because it offers specific information about the approach, findings, and conclusions. Relying just on abstracts or summaries may result in a misleading or partial assessment of the study’s quality and significance. Excluding incomplete papers guarantees that the review includes only fully accessible and transparent research.
EC3	Studies without an abstract.	An abstract is a short overview of a study’s aims, methodology, findings, and conclusions. It enables reviewers to swiftly assess a study’s relevance to the research subject. Without an abstract, it is impossible to evaluate if the study is aligned with the review’s aims, and such studies may lack clarity or organization, resulting in exclusion.
EC4	Articles published before 2013.	The removal of studies published before 2013 assures that the study is focused on the most current developments in facial recognition technology. Over the last decade, there have been major advances in deep learning, notably the use of convolutional neural networks (CNNs) for facial recognition. Older articles may not reflect these improvements and may be out of date in comparison to current cutting-edge procedures.

Table 5. Quality assessment questions.

QAR	Quality Questions	Explanation
QAR1	Are the study’s objectives well defined?	The study’s objectives must be well-defined in order to focus the research and link it with the research topic. Ambiguity in the study’s aims might result in imprecise or inconclusive findings.
QAR2	Is the study design appropriate for the research question?	This question asks if the study design is appropriate for solving the research topic. A well-chosen design (e.g., experimental, observational, etc.) assures that the study will provide valid and trustworthy findings.
QAR3	Is there any comparative study conducted on deep learning methods for video processing?	This determines whether the study compares different deep learning algorithms used for image and video processing. Comparative studies assist in determining which strategies are most effective and give deeper insights into the topic matter.
QAR4	Are the facial recognition algorithms or methods clearly described?	This question determines if the facial recognition techniques or methodologies utilized in the study are adequately presented. A thorough explanation is essential for understanding the technique used, reproducing the study, and assessing its success.
QAR5	Does the study have an adequate average citation count per year?	This question assesses the study’s academic effect by assessing its average citation count annually. A greater citation count might suggest that the study is well known and significant in the area, but it should also be seen in context (for example, the study’s age and field of research).
QAR6	Are the outcome measures clearly defined and valid?	This evaluates if the study’s outcome measures (the variables or metrics used to assess success) are properly defined and valid (meaning they accurately assess what they are meant to measure). Valid outcome measurements guarantee that the study’s findings are significant and dependable.

Table 6. Summary of the databases used for training and testing face recognition systems.

Database	Year	Images	Videos	Subjects	Clean/ Occlusion	Accessible
ORL [5]	1994	400	0	40	Both	Public
FERET [5,34]	1996	14,126	0	1199	Clean	Public
Yale [35]	1997	165	0	15	Data	Public
AR [20,36,37]	1998	>3000	0	116	Both	Public
CVL [38]	1999	798	0	114	Clean	Public
XM2VTS [39]	1999	2360	0	295	Both	Public
BANCA [40]	2003	Data	0	208	Clean	Public
FRGC [41]	2006	50,000	0	7143	Clean	Public
LFW [36]	2007	13,233	0	5749	Both	Public
MUCT [42]	2008	3755	-	0	Both	Public
CMU Multi PIE [44]	2009	750,000	0	337	Both	Public
CASIA Webface [45]	2014	494,414	0	10,575	Both	Public
IARPA Janus Benchmark-A [45]	2015	5712	2085	500	Both	Public
MegaFace [46]	2016	1,000,000	0	690,572	Both	Public
CFP [49]	2016	7000	0	500	Both	Public
Ms-Celeb-M1 [50]	2016	10,000,000	0	100,000	Both	Public
DMFD [51]	2016	2460	0	410	Both	Private
VGGFACE [48]	2016	2,600,000	0	2600	Both	Public
VGGFACE 2 [52]	2017	3,310,000	0	9131	Both	Public
IARPA janus Benchmark-B [47]	2017	21,798	7011	1845	Both	Public
MF2 [54]	2017	4,700,000	0	672,000	Both	Public
DFW [55]	2018	11,157	0	1000	Both	Public
IARPA janus Benchmark-C [53]	2018	31,334	11,779	3531	Both	Public
CASIA mask [57]	2021	494,414	0	10,575	Occluded	Public

Table 7. Strengths and limitations of face recognition databases used for training and testing face recognition systems.

Database	Strengths	Limitations
ORL [5]	Small dataset, good for controlled experiments	Limited number of subjects (40), low-resolution images, restricted pose variation
FERET [5,34]	Diverse faces, widely used in face recognition research	Limited pose variation, restricted illumination conditions, outdated data
Yale [35]	Good for face recognition in controlled settings	Limited number of images, poses and expressions not varied enough
AR [20,36,37]	Large number of subjects and images, includes both clean and occluded faces	Significant noise due to occlusion, limited ethnic diversity, low-quality images
CVL [38]	Includes a variety of ethnicities, poses, and lighting conditions	Limited number of subjects (114), less variation in environmental conditions
XM2VTS [39]	High-quality and -resolution images, widely used in benchmarking	Small sample of subjects, data are not diverse enough for real-world scenarios
BANCA [40]	Focused on low-impersonation tasks, balanced dataset	Restricted in terms of pose variation, focused mostly on controlled settings
FRGC [41]	Large scale, high-resolution images, variety of facial expressions and lighting	High computational cost due to the large number of images, limited in diversity of subjects
LFW [36]	Large-scale dataset, commonly used for benchmarking face recognition	Limited to frontal face images, performance drops in challenging real-world scenarios
MUCT [42]	Diverse dataset in terms of ethnicity, good for real-world scenarios	Limited to images with visible faces, low variation in poses
CMU Multi PIE [43]	Includes a variety of poses, lighting, and expressions	Faces with extreme poses or occlusions underrepresented, limited lighting conditions
CASIA Webface [44]	Large dataset with a variety of subjects and images	Imbalanced data distribution, low-quality images, faces mostly frontal with limited lighting
IARPA Janus Benchmark-A [45]	High-quality data, diverse subjects and poses	Relatively limited number of subjects, focused mostly on controlled settings
MegaFace [46]	Extremely large-scale dataset with many identities	Faces may be poorly annotated or of low resolution, data quality varies
CFP [49]	High diversity of subjects and images, challenging tasks	Limited diversity, performance drops in challenging real-world conditions
Ms-Celeb-M1 [50]	Very large dataset, includes a wide variety of subjects	Mislabeled or noisy data, limited to celebrity faces, limited diversity in real-world scenarios
DMFD [51]	Focuses on face manipulation detection, high-quality data	Small number of subjects (410), limited ethnic diversity, focuses on facial manipulation detection
VGGFACE [48]	Large-scale dataset with diverse identities, popular for face recognition	Limited variation in lighting conditions, faces mostly frontal with minimal pose changes
VGGFACE 2 [52]	Large dataset with a good variety of subjects and poses	Faces are mostly frontal, variation in poses and lighting conditions not well covered
IARPA Janus Benchmark-B [47]	High-quality images, good variety of facial poses and conditions	Mislabeled data in some cases, faces from controlled settings, limited pose variation
MF2 [54]	Large-scale dataset, useful for testing large-scale face recognition systems	Faces are of low resolution and poor quality in certain instances, not diverse enough
DFW [55]	Large-scale dataset, includes challenging scenarios for face recognition	Faces with extreme poses or occlusions underrepresented, limited facial expressions
IARPA Janus Benchmark-C [53]	Includes high-quality data, diverse set of subjects	Data may be noisy or mislabeled, limited variation in facial expressions
CASIA mask [57]	Focuses on occlusion, valuable for studying occluded faces	Occlusion focus limits the dataset’s applicability to face recognition in unconstrained environments

Table 8. Summary of face recognition methods’ strengths and weaknesses.

Algorithm	Strength	Weakness
PCA [14,16,58,59,60]	Works well in low-dimensional spaces and is easy on resources for small, well-aligned face datasets	Difficulties with big datasets and non-frontal faces; sensitive to changes in illumination, emotion, and posture
Gabor filter [61,62,63]	Withstands variations in illumination and is capable of capturing spatial details and face texture at various sizes.	Complex and resource-intensive, with challenges managing wide stance variations and non-frontal faces
Viola–Jones [64]	Fast, efficient, real-time face detection that is resistant to lighting fluctuations and effective for frontal faces.	Low accuracy with non-frontal faces, sensitive to position changes, and challenged by occlusions and crowded backdrops.
SVM [65]	Outstanding generalization, adaptation to classification challenges, handling of non-linear data using kernels, and accuracy.	Computationally costly, sensitive to feature selection, has trouble handling big datasets, and needs precise parameter adjustment.
HOG [66]	Excellent at capturing edge and texture details, and resilient to minor changes in pose and lighting.	Demands precise adjustment of parameters (e.g., cell size, block normalization). Less efficient when there are significant pose variations or occlusions.
AlexNet [67,68]	High accuracy, can handle enormous datasets, and is resistant to position, lighting, and expression changes.	High computing costs, extensive training data needs, and sensitivity to overfitting with limited datasets.
VGGNet [69]	High accuracy, deep architecture, and robust performance on complicated datasets with a variety of faces.	Computationally costly, requires huge datasets, and may suffer from overfitting with insufficient data.
ResNet [70,71]	Deep architecture improves accuracy, handles complicated features, and reduces vanishing gradient concerns.	Computationally complex, requires massive datasets, and can be difficult to train and infer.
FaceNet [72]	Real-time performance, strong feature extraction, high accuracy, and efficacy in large-scale face recognition.	Needs a lot of computing power, big datasets, and could have trouble with significant occlusions or position changes.
LBPNet [73]	Excels in texture categorization, capturing local patterns with great performance and economical calculation.	Difficulties with high computational cost and may underperform on complicated, extremely diverse materials.
LWCNN [74]	Excels in capturing spatial data and using lightweight, effective convolutional layers to increase classification accuracy.	Due to its lightweight design and parameter limitations, it struggles with extremely complicated patterns.
YOLO [75]	Excels in real-time object identification, providing quick, precise, and effective results for a range of activities.	Has trouble detecting little objects, being accurate in busy environments, and having limited precision in complicated situations.
MTCNN [76]	Demonstrates exceptional proficiency in multi-task facial detection, providing great precision in facial alignment and identification.	Performs poorly in complicated or obstructed face circumstances and has trouble in real time.
DeepMaskNet [77]	Specializes in precise object segmentation and uses deep learning to produce high-quality, accurate mask predictions.	Has significant computing needs and may struggle to execute in real time with complicated situations.
DenseNet [78]	Excels in feature reuse, increasing efficiency by densely linking layers for efficient information flow.	Has significant memory usage and computational complexity, which limits scalability to large-scale models or datasets.
MobileNetV2 [79]	Specializes in lightweight, efficient architecture, providing quick performance with minimal computational and memory expenses.	May compromise accuracy for efficiency, difficulty with complicated jobs that need great precision and intricacy.
MobileFaceNets [80]	Excels at facial recognition in real time, combining cheap computational cost and great accuracy.	Decreased accuracy under difficult circumstances, like rapid changes in posture and occlusions.
ViT [81,82]	Uses the Transformer architecture to achieve high accuracy in collecting global context for picture recognition.	Struggles with smaller datasets or a lack of training data, and demands huge datasets and computing resources.
Face Transformer [83,84]	Excels in facial recognition, using a Transformer-based architecture to capture context and fine features.	Demands a lot of processing power and big datasets, having trouble with efficiency or smaller datasets.
DeepFace [85]	Achieves great accuracy even in difficult conditions like low-resolution inputs or big datasets.	High computing cost and complexity, especially with huge datasets or lengthy sequences, which might be a constraint.
Attention [86,87]	Excels in end-to-end learning for effective face recognition, precision, and robustness to variances.	Requires huge datasets, has trouble with high occlusions, and demands a lot of computing power during training.
Swin [89]	Is excellent at collecting hierarchical features and provides great scalability and accuracy for image processing.	Requires a lot of resources, has high computational complexity, and might have trouble with jobs that need to be completed in real time.

Table 9. Summary of the strengths and weaknesses of the performance metrics for face recognition.

Perfomance Metric	Strength	Weakness
Accuracy	Simple, intuitive, and easy to compute and comprehend.	Does not offer a whole view, particularly in skewed datasets. High accuracy might be deceptive in situations such as facial blockage or aging.
Precision and Recall	Useful for unbalanced datasets. Aids in determining false positives (precision) and proper identification (recall).	Precision and recall are negatively connected. Does not provide a fair perspective when both false positives and false negatives must be reduced.
F1-Score	Balances precision and recall. Effective when both false positives and false negatives are essential.	Does not discriminate between the relative relevance of precision and recall. In some circumstances, discrepancies may be masked.
Receiver Operating Characteristic (ROC) Curve	Visualizes the trade-off between FAR and FRR. Aids in comparing models across various thresholds.	Does not offer direct insight into absolute performance. AUC-ROC might be deceptive in unbalanced datasets.
Area Under the Curve (AUC)	Provides a single number that summarizes performance. Resistant to skewed datasets.	Does not consider operational thresholds. Some subgroups may be overestimated in terms of model performance.

Table 10. Summary of CNN architectures used for face recognition systems.

Architectures	Training Set	Year	Authors	Convolutional Layers	Verif. Metric	Accuracy
SVM	ORL	2003	Yanhun and Chongqing [114]	-	-	96%
Gabor Wavelets	ORL	2001	Kepenekci. [115]	-	-	95.25%
ESP-CNN + CNN	ORL	2019	Talab et al. [96]	-	-	93.5%
Ensemble CNN	ORL	2020	Feng et al. [97]	4	Softmax	88.5%
VGG16	ORL	2020	Lou and Shi [116]	16	Center Loss and Softmax	99.02%
DeepFace	LFW	2014	Taigman et al. [48,85]	-	Softmax	97.35%
FaceNet	LFW	2015	Parkhi et al. [48]	-	Triplet Loss	98.87%
DCR	LFW	2018	Lu et al. [94]	-	CM Loss	93.6%
ResNet	LFW	2018	Lu et al. [94]	-	CM Loss	72.7%
Localized Deep-CNN	LFW	2018	Al-Azzawi et al. [93]	-	Softmax	97.13%
FaceNet	LFW	2021	Malakar et al. [103]	-	-	70–80%
MTCNN	LFW	2021	Wu and Zhang [101]	9	Triplet Loss and ArcFace Loss	97.83%
MTCNN + FaceNet	LFW	2021	Wu and Zhang [101]	9	Triplet Loss and ArcFace Loss	99.85%
VGG16	VGGFace	2015	Parkhi et al. [85]	-	Triplet Loss	98.95%
Light-CNN	MS-Celeb-1M	2015	Parkhi et al. [48]	-	Softmax	98.8%
Traditional CNN	CMU-PIE	2018	Qu et al. [95]	5	Sigmoid	99.25%
ESPCN + CNN	Yale	2019	Talab et al. [96]	-	-	95.3%
VGG16	Yale	2020	Lou and Shi [116]	16	Center Loss and Softmax	97.62%
LWCNN	Yale Face Database	2020	Lin et al. [98]	9	Softmax	96.19%
VGG16	CASIA	2020	Lou and Shi [116]	16	Center Loss + Softmax	98.65%
CASIA Mask	CASIA	2021	-	-	Occluded Faces	-
AlexNet	Own dataset	2021	Szmurlo and Osowski [100]	9	Softmax	97.8%
AlexNet	Own dataset	2022	Mahesh and Ramkumar [117]	8	Softmax	96%
Face Transformer	-	2022	Sun and Tzimiropoulos [84]	-	-	99.83%
YOLO-Face	FDDB	2021	Sarahi et al. [102]	-	-	72.8%
PCA + FaceNet	Yale Face Database B	2021	Malakar et al. [103]	-	-	85–95%
Extended VGG19	-	2022	Marjan et al. [104]	19	Softmax	96%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nemavhola, A.; Chibaya, C.; Viriri, S. A Systematic Review of CNN Architectures, Databases, Performance Metrics, and Applications in Face Recognition. Information 2025, 16, 107. https://doi.org/10.3390/info16020107

AMA Style

Nemavhola A, Chibaya C, Viriri S. A Systematic Review of CNN Architectures, Databases, Performance Metrics, and Applications in Face Recognition. Information. 2025; 16(2):107. https://doi.org/10.3390/info16020107

Chicago/Turabian Style

Nemavhola, Andisani, Colin Chibaya, and Serestina Viriri. 2025. "A Systematic Review of CNN Architectures, Databases, Performance Metrics, and Applications in Face Recognition" Information 16, no. 2: 107. https://doi.org/10.3390/info16020107

APA Style

Nemavhola, A., Chibaya, C., & Viriri, S. (2025). A Systematic Review of CNN Architectures, Databases, Performance Metrics, and Applications in Face Recognition. Information, 16(2), 107. https://doi.org/10.3390/info16020107

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Systematic Review of CNN Architectures, Databases, Performance Metrics, and Applications in Face Recognition

Abstract

1. Introduction

2. Face Recognition History

3. Application of Face Recognition

3.1. Security and Surveillance

3.2. Law Enforcement

3.3. Healthcare

3.4. Access Control

3.5. Automotive Industry

4. Face Recognition Systems

4.1. Face Recognition Systems Traditionally Consist of Four Main Stages

4.1.1. Face Detection

4.1.2. Face Preprocessing

4.1.3. Face Extraction

4.1.4. Face Matching

5. Methodology

5.1. Data Selection

5.1.1. Inclusion and Exclusion Criteria

5.1.2. Search Strategy

5.1.3. Screening

5.1.4. Eligibility and Inclusion of Articles

5.1.5. Quality Assessment Rules (QAR)

6. Databases

6.1. ORL Database

6.2. FERET Database

6.3. Yale Face Database

6.4. AR Database

6.5. CVL Database

6.6. XM2VTS Databases

6.7. BANCA Database

6.8. FRGC (Face Recognition Grand Challenge) Database

6.9. LFW (Labeled Faces in the Wild) Database

6.10. The MUCT Face Database

6.11. CMU Multi-PIE Database

6.12. CASIA-Webface Database

6.13. IARPA Janus Benchmark-A Database

6.14. Megaface

6.15. IARPA Janus Benchmark-B Database

6.16. VGGFACE Database

6.17. CFP Database

6.18. Ms-Celeb-M1 Database

6.19. DMFD Database

6.20. VGGFACE 2 Database

6.21. IARPA Janus Benchmark C Database

6.22. MF2 Database

6.23. DFW (Disguised Faces in the Wild) Database

6.24. LFR Database

6.25. CASIA-Mask Database

7. Face Recognition Methods

7.1. Traditional

7.1.1. Principal Component Analysis (PCA)

7.1.2. Gabor Filter

7.1.3. Viola–Jones Object Detection Framework

7.1.4. Support Vector Machine (SVM)

7.1.5. Histogram of Oriented Gradients (HOG)

7.2. Deep Learning

7.2.1. AlexNet

7.2.2. VGGNet

7.2.3. ResNet

7.2.4. FaceNet

7.2.5. LBPNet

7.2.6. Lightweight Convolutional Neural Network (LWCNN)

7.2.7. YOLO

7.2.8. MTCNN

7.2.9. DeepMaskNet

7.2.10. DenseNet

7.2.11. MobileNetV2

7.2.12. MobileFaceNets

7.2.13. Vision Transformer (ViT)

7.2.14. Face Transformer for Recognition Model

7.2.15. DeepFace

7.2.16. Attention Mechanism

7.2.17. Swin Face Recognition

8. Performance Measures

8.1. Accuracy

8.2. Precision

8.3. Recall

8.4. F1-Score