Novel Hybrid Optimization Techniques for Enhanced Generalization and Faster Convergence in Deep Learning Models: The NestYogi Approach to Facial Biometrics

Altaher, Raoof; Koyuncu, Hakan

doi:10.3390/math12182919

Open AccessArticle

Novel Hybrid Optimization Techniques for Enhanced Generalization and Faster Convergence in Deep Learning Models: The NestYogi Approach to Facial Biometrics

by

Raoof Altaher

^1,*,†

and

Hakan Koyuncu

^2,†

¹

Electrical and Computer Engineering Department, Altinbas University, Istanbul 34217, Turkey

²

Computer Engineering Department, Altinbas University, Istanbul 34217, Turkey

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(18), 2919; https://doi.org/10.3390/math12182919

Submission received: 27 June 2024 / Revised: 14 September 2024 / Accepted: 15 September 2024 / Published: 20 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

In the rapidly evolving field of biometric authentication, deep learning has become a cornerstone technology for face detection and recognition tasks. However, traditional optimizers often struggle with challenges such as overfitting, slow convergence, and limited generalization across diverse datasets. To address these issues, this paper introduces NestYogi, a novel hybrid optimization algorithm that integrates the adaptive learning capabilities of the Yogi optimizer, anticipatory updates of Nesterov momentum, and the generalization power of stochastic weight averaging (SWA). This combination significantly improves both the convergence rate and overall accuracy of deep neural networks, even when trained from scratch. Extensive data augmentation techniques, including noise and blur, were employed to ensure the robustness of the model across diverse conditions. NestYogi was rigorously evaluated on two benchmark datasets Labeled Faces in the Wild (LFW) and YouTube Faces (YTF), demonstrating superior performance with a detection accuracy reaching 98% and a recognition accuracy up to 98.6%, outperforming existing optimization strategies. These results emphasize NestYogi’s potential to overcome critical challenges in face detection and recognition, offering a robust solution for achieving state-of-the-art performance in real-world applications.

Keywords:

face detection; face recognition; Yogi algorithm; Nesterov momentum; stochastic weight averaging (SWA); triplet loss; biometric authentication

MSC:

68T01; 68T07

1. Introduction

Facial recognition and detection technologies have become integral components in a myriad of applications, ranging from security systems and forensic investigations to personalized user experiences in consumer electronics. The proliferation of digital devices and the increasing demand for biometric authentication have propelled research in this domain to the forefront of computer vision and machine learning [1]. The effectiveness and reliability of these technologies are heavily dependent on the underlying algorithms used to train the models, particularly in how they optimize and generalize from the data.

Deep convolutional neural networks (CNNs) have emerged as the de facto standard for facial biometric tasks due to their exceptional ability to learn hierarchical feature representations from raw pixel data [2]. However, the performance of CNNs is intrinsically linked to the optimization processes employed during training, which adjust the network’s weight parameters to minimize a loss function. Traditional optimization methods, such as stochastic gradient descent (SGD), have been extensively utilized but are hindered by limitations like slow convergence rates and sensitivity to hyperparameter settings [3]. These drawbacks can lead to suboptimal solutions, especially in complex, high-dimensional, and non-convex optimization landscapes typical of deep learning models.

To address these challenges, adaptive optimization algorithms such as Adam [4] and Yogi [5] have been developed. Adam introduces adaptive learning rates for each parameter by computing the first and second moment estimates of gradients, which accelerates convergence and improves performance in some scenarios. Yogi builds upon Adam by providing better control over the learning rate adaptation, particularly in non-convex settings, thus stabilizing the optimization trajectory and preventing issues like overshooting minima or premature convergence.

Despite these advancements, facial biometric models continue to grapple with significant challenges such as overfitting, limited generalization across diverse datasets, and slow convergence in complex environments [6]. These problems are exacerbated in real-world conditions characterized by factors like occlusions, varying lighting conditions, and diverse facial expressions and orientations. In critical applications such as surveillance, authentication, and forensic analysis, even minor inaccuracies can have substantial consequences, highlighting the necessity for optimization methods that not only accelerate convergence but also enhance generalization and robustness [7].

In the broader context of optimization algorithms for neural networks, a variety of methods have been developed, each with distinct characteristics concerning convergence rate and the ability to explore the solution space effectively. Figure 1 presents a conceptual diagram positioning various popular optimization algorithms along the axes of convergence rate and solution space exploration.

Traditional methods like SGD are known for their simplicity and ease of implementation but are often plagued by slow convergence and limited exploration of the solution space, primarily due to their fixed learning rates and lack of momentum [8]. Momentum-based methods, such as SGD with momentum [9] and Nesterov Accelerated Gradient (NAG) [10], introduce momentum terms that help in accelerating convergence by dampening oscillations in the optimization path, thereby moving more swiftly towards minima.

Adaptive learning rate methods, including Adagrad [11], RMSProp [12], and Adam [4], adjust the learning rates dynamically based on historical gradients, allowing for more nuanced updates that can improve convergence rates and facilitate better navigation of the loss landscape. However, these methods can sometimes suffer from issues like vanishing or exploding gradients and may converge prematurely, especially in the presence of saddle points or when dealing with highly non-convex loss surfaces [13].

Yogi [5] enhances Adam by introducing a more stable adaptive learning rate mechanism that prevents the exponential moving averages from becoming too large, thereby offering better performance in non-convex settings. Stochastic weight averaging (SWA) [14] focuses on improving generalization by averaging weights collected at different points along the optimization trajectory, effectively leading the model to converge to flatter minima that are associated with better generalization properties [15].

In this work, we introduce NestYogi, a novel hybrid optimization algorithm that integrates the strengths of Yogi, Nesterov momentum, and SWA. NestYogi aims to accelerate convergence while enhancing the exploration of the solution space, thereby improving both the training efficiency and generalization capability of deep learning models. Figure 2 illustrates the developments in first-order optimization algorithms over the years, including the proposed method.

To empirically evaluate the efficacy of different optimization techniques for non-convex problems such as facial recognition, we conducted experiments on the MNIST digit classification task, a standard benchmark in machine learning. The MNIST dataset consists of 70,000 grayscale images of handwritten digits (0–9), presenting a challenging non-convex optimization problem due to the variations in handwriting styles and digit representations [16].

We utilized two key performance indicators for assessing the optimizers: convergence rateand solution space exploration. The convergence rate measures how quickly an optimizer reduces the loss function during training, which is crucial for computational efficiency and timely deployment of models. Solution space exploration refers to the optimizer’s ability to navigate the loss landscape effectively to avoid local minima and saddle points, thereby ensuring better generalization to unseen data.

A Convolutional Neural Network (CNN) architecture with two convolutional layers, max-pooling, and two fully connected layers was employed. The network was trained over 40 epochs, incorporating batch normalization and dropout layers to enhance generalization and mitigate overfitting. The optimizers evaluated include the following:

Stochastic Gradient Descent (SGD): A baseline optimizer with a fixed learning rate and no adaptive mechanisms.
Adam: An optimizer that computes adaptive learning rates for each parameter based on estimates of first and second moments of the gradients.
Yogi: An optimizer that addresses some of Adam’s shortcomings by controlling the adaptive learning rates to prevent them from becoming too large.
NestYogi (Proposed Optimizer): A hybrid optimizer that combines Yogi’s adaptive learning rates with Nesterov momentum and SWA to improve both convergence rate and solution space exploration.

The experimental results indicate that SGD exhibited the slowest convergence, with noticeable oscillations and a propensity to become stuck in local minima, as shown in Figure 3. Adam achieved faster convergence but occasionally converged prematurely, limiting its ability to explore the solution space thoroughly. Yogi demonstrated improved stability in non-convex settings but did not match the rapid convergence and comprehensive exploration achieved by NestYogi. By integrating Nesterov momentum and SWA, NestYogi effectively accelerated convergence and enhanced the exploration of the loss landscape, leading to better performance on the MNIST classification task.

The conceptual framework and empirical evidence suggest that NestYogi is not only beneficial for facial recognition tasks but can also significantly enhance performance in other deep learning domains that require efficient optimization in high-dimensional, non-convex spaces. These include object detection, semantic segmentation, and autonomous navigation, where optimization challenges are prevalent.

Through extensive experimentation on facial biometric tasks, NestYogi demonstrated significant gains in face detection and recognition accuracy across various architectures, achieving up to 98% in detection tasks and 98.6% in recognition tasks. These results are particularly noteworthy given the complexities involved in training from scratch on large datasets augmented to simulate challenging scenarios such as low-light conditions and occlusions. Furthermore, the optimizer’s scalability and adaptability make it suitable for a wide range of deep learning architectures beyond facial biometrics, including VGG16 [17], ResNet [18], YOLO [19], and RetinaNet [20].

This paper systematically explores the implementation of NestYogi, detailing how the integration of Yogi, Nesterov momentum, and SWA contributes to improved optimization in deep CNN architectures for face detection and recognition applications. The combination of these techniques not only accelerates convergence but also enhances precision and generalization, making it a valuable contribution to the field of machine learning optimization.

Related Work

The domain of face detection and recognition has witnessed substantial advancements with the advent of deep learning techniques. Various methods, algorithms, and optimization strategies have been proposed to improve the accuracy and efficiency of facial biometric systems. Despite the progress, challenges such as computational efficiency, generalization across diverse conditions, and optimization in non-convex landscapes remain areas of active research.

Zhang et al. [21] introduced the Multi-Task Cascaded Convolutional Network (MTCNN) framework, which performs joint face detection and alignment with high accuracy on challenging datasets like WIDER FACE and CelebA. The model employs a cascade of networks to predict face and landmark locations in a coarse-to-fine manner. While achieving impressive accuracy, MTCNN faces limitations in computational efficiency, making it less suitable for real-time applications where speed is critical.

Redmon and Farhadi [22] developed YOLOv3, an object detection algorithm that employs a single neural network to predict bounding boxes and class probabilities directly from full images in one evaluation. YOLOv3 achieves high accuracy with fast inference times, making it suitable for real-time applications. However, training such models from scratch requires significant computational resources and their performance can be sensitive to the optimization strategies employed.

Guo et al. [23] proposed the SCRFD architecture, focusing on efficient face detection by leveraging a set of carefully designed receptive field blocks and modules. The model strikes a balance between accuracy and efficiency, making it suitable for deployment on mobile and edge devices. Nonetheless, scaling this approach to extremely large datasets or more complex tasks may present challenges.

In the realm of face recognition, Schroff et al. [24] introduced FaceNet, which learns a mapping from face images to a compact Euclidean space where distances correspond to a measure of face similarity. By using a triplet loss function during training, FaceNet effectively handles variations in facial appearance. However, the model’s generalization across diverse conditions such as varying lighting, poses, and occlusions can still be improved.

Hasan et al. [25] conducted a comprehensive review of recent advances in deep learning for face recognition, highlighting the importance of optimization algorithms in enhancing model performance. They pointed out that while advanced architectures and loss functions contribute to accuracy improvements, optimization strategies play a crucial role in training deep networks effectively.

Despite these developments, there remain critical challenges that need to be addressed. Computational efficiency is a significant concern, particularly for models intended for real-time applications or deployment on resource-constrained devices. Furthermore, existing optimization methods may not adequately handle the complexities of non-convex loss landscapes associated with deep neural networks used in facial biometric tasks. This can lead to issues like slow convergence, overfitting, and limited generalization, especially when training on large, diverse datasets.

Table 1 summarizes the key contributions and limitations of selected works in the field.

In light of these challenges, our proposed NestYogi optimizer aims to enhance the optimization process in deep learning models for facial biometrics. By integrating adaptive learning rates, momentum-based acceleration, and weight averaging techniques, NestYogi addresses the issues of slow convergence, overfitting, and limited generalization. The subsequent sections will detail the NestYogi optimizer, its components, and how it integrates with modern deep learning architectures for face detection and recognition, providing empirical evidence of its effectiveness.

2. Materials and Methods

This section explains the deep learning techniques, optimization methods, and strategies used in our study to improve facial recognition and detection problems.

The ADAM optimizer has become known for its ease and wide range of use and ability to handle large datasets and model parameters, among other optimization strategies. For instance, to find an exponentially declining average of earlier gradients (second moments), ADAM incorporates momentum and RMSProp components. It then constantly adjusts the learning rate for each model parameter.

An improved version of ADAM called Yogi offers better convergence in non-convex optimization problems by fine-tuning the learning rate utilizing a bias-corrected second-moment estimate. Equation (1) describes how Yogi’s method adjusts by slowing down the rise of the effective learning rate. This enables improved model performance.

\begin{matrix} v_{t} & \leftarrow β_{1} v_{t - 1} + (1 - β_{1}) g_{t} \\ s_{t} & \leftarrow s_{t - 1} + (1 - β_{2}) g_{t}^{2} \oplus sign (g_{t}^{2} - s) \end{matrix}

(1)

The Yogi optimizer builds upon Adam by fine-tuning learning rates for each parameter and handling noisy gradients in non-convex landscapes. This is particularly useful for facial recognition and detection tasks where feature scales vary significantly. Yogi stabilizes the optimization by ensuring that parameters with large gradients receive smaller learning rates while parameters with smaller gradients receive larger learning rates, allowing more efficient exploration of the loss landscape.

Proposed Optimizer

In this work, we introduce the NestYogi optimizer, a novel hybrid optimization algorithm that enhances the Yogi optimizer by combining several optimization strategies. NestYogi integrates Yogi’s optimizer, enhancing consistency during training by adjusting the learning rate via a bias-corrected estimate of the second momentum. Two components are built into NestYogi to expand on this foundation: Nesterov momentum, which provides a (look-ahead) gradient that adds an acceleration option to the optimizer to speed up convergence while avoiding oscillations in complex loss landscapes, and stochastic weight averaging (SWA) for improved generalization. Each of these components plays a unique role in enhancing the stability, convergence rate, and accuracy of deep learning models. Equation (2) [26] explains how we obtain a stable and efficient learning trajectory via the incorporation of Nesterov momentum into Yogi’s algorithm.

\begin{matrix} \hat{m} = \frac{m_{t}}{1 - β_{1}^{t}} \end{matrix}

(2)

Nesterov momentum plays a crucial role in accelerating the convergence of the optimizer by looking ahead at future gradients. This allows the optimizer to anticipate changes in the loss landscape and adjust accordingly, leading to faster convergence while reducing oscillations in complex regions of the landscape. This is particularly important in non-convex problems like face detection, where the optimizer needs to avoid getting stuck in suboptimal minima.

Additionally, stochastic weight averaging (SWA) is integrated into NestYogi to improve generalization. SWA works by averaging the model weights over several epochs which tends to locate solutions within flatter regions of the loss landscape. These regions are associated with better generalization properties, as they represent optima that are less sensitive to minor perturbations in weights. By reducing the variability in parameter values across the loss landscape, SWA helps achieve more stable and robust solutions, leading to improved generalization performance (as illustrated in Figure 4 and detailed workflow in Algorithm 1). Equation (3) shows how SWA updates the model weights over time.

w_{swa} \leftarrow \frac{w_{swa} \times N u m_{swa} Updates + Current Model Weights}{N u m_{swa} Updates + 1}

(3)

Algorithm 1 Stochastic weight averaging (SWA).

1:: Initialize model parameters, $w_{swa}$ as the current model weights at epoch S
2:: Training phase:
3:: for epoch in range $n u m_{epochs}$ do
4:: for batch in $D a t a_{train}$ do
5:: Compute gradients and update model using the optimizer
6:: optimizer.step()
7:: end for
8:: if epoch ≥ S then
9:: Update $w_{swa}$ with current model weights:
10:: $w_{swa} \leftarrow \frac{w_{swa} \times N u m_{swa} Updates + Current Model Weights}{N u m_{swa} Updates + 1}$
11:: end if
12:: end for
13:: Weight averaging phase:
14:: for epoch in range $s w a_{epochs}$ do
15:: for batch in $D a t a_{train}$ do
16:: Compute gradients and update model using the optimizer
17:: optimizer.step()
18:: end for
19:: Update SWA weights with a running average:
20:: $w_{swa} \leftarrow α \times w_{swa} + (1 - α) \times Current Model Weights$
21:: end for
22:: return $w_{swa}$

The combined effect of these components—Yogi for adaptive learning, Nesterov momentum for acceleration, and SWA for generalization—creates a highly robust optimization strategy. This method allows the optimizer to efficiently handle both the instability of noisy gradients and the complexity of non-convex loss landscapes, common in face recognition and detection tasks.

Furthermore, L1 and L2 regularization are applied to minimize overfitting by penalizing large weights. These regularization terms encourage sparsity and prevent the model from becoming too complex, reducing the risk of overfitting to the training data. Additionally, gradient clipping is used to prevent any large updates to the weights, ensuring stability in the optimization process. Overall, the combination of these elements efficiently addresses the optimization challenges present in deep learning models, and this will be evaluated on several face detection and recognition tasks using well-established deep learning architectures and benchmark datasets.

3. Experimental Section

3.1. Training Configuration

The proposed optimization was applied to three well-known architectures for the face detection experiments: VGG16, RetinaNet, and YOLOv9 [27]. The experiments leveraged the CelebA dataset, which includes 202,599 images split into three sets: training (70%), validation (15%), and testing (15%) [28], as detailed in Table 2. To ensure model robustness, the images were preprocessed with normalization and augmentation techniques, such as random cropping, horizontal flipping, brightness and hue adjustments, blurring, and addition of salt-and-pepper noise to 1.3% of the pixels. These preprocessing steps simulated real-world variations in lighting, occlusion, and position.

VGG16: Customized with a batch size of 32 and an initial learning rate of 0.001. To mitigate overfitting, a weight decay of 0.0005 was used. Training was carried out for 100 epochs, with a 0.1 reduction in learning rate every 15 epochs. Stochastic weight averaging (SWA) was initiated from epoch 75.
RetinaNet: This model was trained with a batch size of 32 and an initial learning rate of 0.001, using ResNet50 as the backbone. Focal loss was used to address class imbalance, which is especially important for detecting small or occluded faces. The model was trained over 100 epochs with learning rate reductions every 15 epochs and SWA starting from epoch 75, similar to VGG16.
YOLOv9: YOLOv9 was trained for 100 epochs using a batch size of 32 and an initial learning rate of 0.01, with SWA starting at epoch 75. A multi-scale training approach was used to improve face detection accuracy across various scales and conditions.

For face recognition tasks, three different backbones (ResNet50, VGG16, and InceptionV3) were employed in Siamese networks. The CASIA WebFace dataset, consisting of 10,575 individuals and 494,414 images [29] as detailed in Table 2, was split into 80% for training and 20% for validation to ensure ample data for accurate evaluation.

VGG16: Customized with a batch size of 64 and a learning rate of 0.0001. The triplet loss function was used with a margin of 0.2 to separate embeddings of identical and different identities. The learning rate was adjusted to 0.001 at epoch 200, and training continued for 500 epochs with SWA initialized at epoch 400.
ResNet50: Configured with a batch size of 64 and a starting learning rate of 0.0001. Training was conducted similarly to VGG16, using the triplet loss function to optimize feature differentiation over 500 epochs.
InceptionV3 [30]: Trained with a learning rate of 0.0001 and a batch size of 64, using gradient accumulation to stabilize training due to its complex structure. The model was trained for 500 epochs, with checkpoints saved at each epoch to prevent overfitting.

Bounding boxes were used during the training process to identify and locate faces in images for detection tasks. For face recognition, the triplet loss function was used to distinguish between anchor, positive, and negative samples (same identity vs. different identity). This loss function, as shown in Equations (4)–(6), played a crucial role in learning distance metrics that distinguish between similar and distinct faces.

{∥ f (A) - f (P) ∥}^{2}

(4)

{∥ f (A) - f (N) ∥}^{2}

(5)

L (A, P, N) = max (∥ f (A) - f (P) ∥^{2} - ∥ f (A) - f (N) ∥^{2} + margin, 0)

(6)

Figure 5 illustrates the triplet loss architecture used to maximize recognition by distinguishing between facial feature similarities and differences.

3.2. Hardware and Scalability Considerations

The experiments were performed on an NVIDIA A100 GPU, which provides efficient training for large scale datasets and complex models. Nesterov momentum was set to 0.9 in the NestYogi configuration, which improved the optimizer’s ability to predict gradient directions and stabilize convergence. SWA was introduced after 75% of the training, averaging weights over the final 25% of epochs to enhance generalization.

Given that real-world applications often demand both high accuracy and computational efficiency, the scalability of NestYogi was tested across different architectures and datasets. Future work will further explore the potential of NestYogi in edge AI and resource constrained environments, where smaller datasets and limited hardware may challenge the performance of deep learning models.

The study was carefully designed to evaluate the effect of the proposed methods on convergence rate and accuracy of a range of activities. To guarantee that advances were not limited to the training phase, particular emphasis was given to improving the models’ performance on validation and test datasets as its one of the research objectives. NestYogi prioritizes generalization for more reliable models which operate well on unknown data an essential element in real-world applications.

3.3. Results

Notable benefits in model performance have resulted in both face detection and recognition tasks, the proposed methods enhanced key performance metrics, including accuracy, recall, mean average precision (mAP), and intersection over union (IoU), in detecting faces, as Table 3 shows a detailed training and validation experiments concluded for the face detection phase. More specifically, the proposed method improved the detection precision on the VGG16 architecture by around 3%, starting at a baseline of 91.8% (with Yogi + SWA) and rising to 93.0% (training) and 90.5% (validation). Additionally, there has been a rise in the model’s performance, as seen by the mAP increasing and the total loss decreasing. Experiment results are shown in Figure 6a–d.

Following the integration of NestYogi, RetinaNet’s accuracy went up by 2–3%. In the training phase, it increased from 93.6% to 95.0%. More importantly, during the validation phase the accuracy improved from 91.9% to 93.2%. The gains related to the optimization’s proficiency in handling tiny and partly obscured faces. This improvement is mostly caused by the combination of complex learning rate manipulations and the optimizer’s ability to adjust the learning rate dynamically according to model parameters, stochastic weight averaging (SWA) is helpful in accessing larger weights pool over few epochs and locating a wider and more stable minimum over the loss landscape, and Nesterov momentum, offering a look-ahead, gradient acceleration and capacity to prevent oscillations.

A notable improvement in processing performance was found in YOLOv9, lowering the inference time by approximately 7–10% while keeping an excellent degree of accuracy. This gain is especially useful in real-time applications where accuracy is crucial in additional to speed. During the training of YOLOv9, the accuracy increased from 94.5% to 95.9% and during validation, the accuracy increased from 90.5% to 92.7%, handling dynamic changes in object motions and scales. Because of this, YOLOv9 with NestYogi is very useful in cases where accurate and timely face identification is required.

Detailed examinations of the model’s training and validation experiment results for most of the backbones used are clearly shown in Table 4 and Figure 7a,b and Figure 8a–c. The optimizer enhances different backbones in face recognition tests too. The optimizer raised training accuracy from 87.9% to 89.5% and more importantly raised validation accuracy from 85.8% to 87.0%, leading to an estimated 2% gain in accuracy when deployed to the VGG16 backbone. This is an important for face recognition, where precise individual recognition is essential, especially in congested settings and when training from scratch. Enabling the model to modify its parameters produces better decision boundaries and reduces classification errors, due to dynamic learning rate adaptation and anticipatory updates.

The proposed methods improved generalization over unseen data with ResNet50, resulting in a 3% boost in precision over traditional architecture. Here, the SWA element works best since it enables the model to identify a more stable and wider minimum in the loss landscape. Through doing this, the model performs better on unobserved data and performs better in a range of face recognition settings. A 15–20% savings in time spent training was accomplished by InceptionV3, another backbone evaluated with NestYogi, despite its ability to adapt to changes in face orientation and size. The effectiveness in handling complex structures is demonstrated its ability to execute convergence faster while maintaining accuracy, where hardware and software limitations may be present and prompt, precise training stability is required.

The proposed methods are examples of sophisticated optimization approaches that have been included in convolutional neural networks (CNNs) in order to enhance their performance and generalization in detecting faces and recognition. These approaches increase measures like accuracy, recall, mAP, and IoU across different architectures by tackling important issues like training stability and convergence rate. During training, the algorithm may adapt dynamically due to adaptive learning rates, which helps to prevent overfitting and enhance performance with unknown data. These developments show how well-constructed optimization algorithms may enhance CNNs in a variety of applications including as public safety, surveillance, and mobile biometrics. Through model optimization in order to handle changes in dataset complexity and size, this method establishes a new benchmark for solid and flexible intelligent video analytics systems.

4. Evaluation

To crop faces from the pictures in the evaluation datasets, the most accurate face detection model, YOLOv9, was utilized for the evaluation process. Preprocessing is essential because it makes sure that correctly identified faces serve as a foundation for the subsequent face recognition process, which optimizes the overall precision and reliability of the recognition results. Following that, feature embeddings were extracted from these cropped faces by the ResNet50 architecture as an embedding model, which was shown to be the best face recognition backbone among the tested networks. Two popular public datasets, YouTube Faces (YTF) [31] and Labeled Faces in the Wild (LFW) [32], were used for the thorough examination. By ensuring that the model’s performance is evaluated over different data types and sizes and examining the proposed method’s flexibility, this dual-dataset evaluation supports the findings and enables reproducibility.

4.1. Evaluation Metrics

In the following evaluation process, we will utilize the following metrics to examine the models’ performances in more detail, precision measures the percentage of positively recognized instances that were accurately detected across all occurrences that were expected to be positive. This is calculated in Equation (7).

Precision = \frac{T P}{T P + F P}

(7)

The proportion of accurately detected positive instances out of all real positive results is measured by recall (true positive rate or sensitivity). The equation for recall is Equation (8):

Recall = \frac{T P}{T P + F N}

(8)

The harmonic mean of precision and recall is the F1 Score, shown in Equation (9). To provide a single score that takes into account both false positives and false negatives, it balances both metrics.

F 1 Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(9)

Accuracy determines the percentage of both positive and negative instances that are successfully detected out of all instances, as represented in Equation (10):

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(10)

A general performance indicator called the Matthews correlation coefficient (MCC) is used to gauge the quality of binary classification models, based on four elements (TP, TN, FP, and FN) [33]. Compared to measurements like accuracy, MCC can offer a fairer evaluation, especially when the classes (face/non-face in detection, correct/incorrect recognition) might be unbalanced. This is calculated using Equation (11):

MCC = \frac{(T P \times T N) - (F P \times F N)}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(11)

These examinations will provide an in-depth evaluation of the models’ performances in tasks integrating both face detection and recognition. Using the application of these metrics to the outcomes of the LFW and YTF dataset evaluations, our objective is to deliver the most informative numerical evaluation of the value of the suggested methods.

4.2. Face Detection Evaluation

To ensure the YOLOv9 model’s effectiveness in the context of face detection, the evaluation covered many critical patterns. First, we compute the intersection over union (IoU), an effective statistic for evaluating the level of accuracy of object detection models. YOLOv9 proved to have an amazing ability to precisely identify and locate faces across a diverse and complex dataset by scoring an amazing average IoU score of 98% on the LFW dataset. The model obtained a high IoU score of 95.7% on the YTF dataset, which involves video footage with rapid changes in lighting and face expressions; this demonstrates its ability to address challenging real-world conditions.

By analyzing the recall and accuracy metrics provide insight into the model’s ability to recognize faces accurately while reducing false positives and false negatives. Both precision and recall for the LFW dataset were 94.1% and 92.7%, respectively. The YTF dataset showed good performance considering the extra complexity of video data, as seen by the slightly lower accuracy of 93.5% and recall of 91.9%.

On the LFW dataset and the YTF dataset, the F1 score—a single quality parameter that strikes a balance among accuracy and recall—was found to be 93.4% and 92.7%, respectively. The findings show that the YOLOv9 model was able to detect faces with consistency and efficiency, even in the face of challenges presented by the video-based YTF dataset.

With an IoU threshold of 0.5, mean average precision (mAP), a common metric for object detectors, was also calculated. Our model achieved mAP values of 96.0% on the LFW dataset and 95.2% on the YTF dataset.

MCC is the binary classification for a face (positive class) vs non-face (negative class), There are four types of detections: true positives; false positives, where a non-face region is identified as a face; false negatives, where a face fails to be identified; and true negatives, where a non-facial region is identified as such. Since all images in LFW contain at least one face, meaning there are no true negatives in this case, we only measured the MCC for the YTF dataset, which was 0.894.

The integration of the above-mentioned elements of optimization during the training phase allows for the YOLOv9 model’s better performance on these parameters. The model was adept at faster convergence to an optimal solution, even in the face of noisy or challenging input, because of the optimizer’s adaptive learning rate advance strategy. Because of the flexible nature of this optimizer, the model was capable of dynamically adapting to the changing gradients for different parameters during training. This is particularly helpful in cases where there is a lot of visual information that moves quickly, as seen in the YTF dataset. In Figure 9, we present some of the model’s predictions and confidence from both evaluation datasets, and the results of the evaluating phase are shown in Table 5.

4.3. Face Recognition Evaluation

Similar standards were adopted to evaluate the ResNet50 model’s face recognition. The model’s ability to correctly identify individuals from static pictures is shown by its recognition accuracy of 98.61% on the LFW dataset and its lower accuracy of 96.5% on the YTF dataset.

Evaluating the model’s performance further, we applied the Euclidean distance and cosine similarity measures, which are especially beneficial for assessing how well face embeddings distinguish unique people. These metrics, which are described in Equations (12)–(14), are just as good as the embeddings that the trained model produced [34,35].

C S_{(u, v)} = \frac{u \cdot v}{| | u | | \cdot | | v | |}

(12)

C D_{(u, v)} = 1 - C S_{(u, v)}

(13)

D_{(A, B)} = \sqrt{\sum_{i = 1}^{n} {(a_{i} - b_{i})}^{2}}

(14)

In Algorithm 2, we present the workflow of the evaluation process.

Algorithm 2 Model Evaluation.

1:: Start
2:: Load dataset
3:: Load face detection model (YOLOv9) and face recognition model (ResNet50-based Encoder)
4:: Initialize dictionary for storing encodings
5:: for each image in dataset do
6:: Detect face using detection model
7:: if face is detected then
8:: Crop the face from the image using bounding box coordinates
9:: Encode the cropped face using ResNet50-based encoder
10:: Store the encoding in the dictionary, along with the true label (person’s identity)
11:: end if
12:: end for
13:: Initialize counter K for correctly predicted identities
14:: for each person in the dictionary do
15:: Calculate cosine and Euclidean distances with every other person
16:: Find the person with the smallest distance (most similar person)
17:: if the person is most similar to themselves then
18:: Increment the correct prediction counter K
19:: end if
20:: end for
21:: Calculate metrics
22:: End

With an average cosine similarity score of 0.96 and a Euclidean distance of 0.25 from the LFW dataset, there was an important level of identity differentiation. The YTF dataset’s cosine similarity score of 0.94 and the Euclidean distance of 0.27.

The accuracy, recall, and F1 score metrics present a more detailed picture of the model’s performance. The precision, recall, and F1 score on the LFW dataset are 97.79%, 97.58%, and 97.68%, respectively. These indicators show that there is an elevated degree of accuracy when classifying between individuals. The accuracy, recall, and F1 score on the YTF dataset are 96.25%, 96.07%, and 96.16%, respectively. MCC may also be utilized for face recognition problems by matching a specific identity (i.e., whether a given face matches a known individual or not). True positives in this case would be accurate person identifications, while false positives would be inaccurate identifications. The accuracy of matching or recognizing certain persons measured with MCC is 0.977 on the LFW dataset and 0.896 on YTF, which indicates how accurate the model is in matching, identifying, and recognizing correct/incorrect individuals. The evaluation results for face recognition are detailed in Table 6.

These findings again highlight that the model is able to maintain a very high level of performance when challenged with different datasets. The entire model that utilizes detection, recognition, and prediction from both models is shown in Figure 10.

Achieving these excellent metrics became possible due to the optimizer’s capabilities, which included enhancing the learning process, preventing overfitting, and convergence. This study shows that the suggested models are useful in real-world scenarios by thoroughly evaluating them on public datasets such as the LFW and YTF datasets, usually used for evaluating face detection and recognition models.

5. Discussion and Limitations

The experimental results demonstrate that NestYogi significantly enhances both the speed and accuracy of face detection and recognition tasks. It blends the Yogi optimizer’s adaptive learning rate mechanisms with Nesterov momentum and stochastic weight averaging’s stabilizing combination. Across a range of advanced deep learning architectures, this synergistic approach was shown to be crucial for maximizing convergence rates and model validity, assuring consistent performance in the training and evaluation phases.

Yogi’s adaptive learning rate ensures that model parameters with high variance gradients are updated more cautiously, thus stabilizing training. Nesterov momentum accelerates convergence by predicting future gradients, allowing the optimizer to avoid oscillations in complex loss landscapes. SWA contributes to better generalization by averaging weights over multiple epochs, ensuring that the model converges to flatter regions of the loss landscape that are less sensitive to noisy gradients.

Real-world applications of systems often encounter common obstacles such as changing lighting conditions, awkward body positions, and occlusion. This adaptability is especially important for systems deployed to dynamic situations where conditions may change unexpectedly. Rapid changes in data distribution, such as changing lighting or occlusion, are a common challenge for traditional or standard optimizers like Adam and SGD. As opposed to this, by using the Yogi’s sign in the second momentum and utilizing a Nesterov-adjusted gradient in the bias-corrected first momentum, NestYogi adjusts its learning rate dynamically in response to real-time gradients and momentum adjustments for each parameter. This guarantees a better solution search and greater accuracy in unpredictable situations by enabling the model to react to such changes faster.

During the training process, smoothing, and overfitting reduction, the combination of SWA and adaptive learning rate adjustments has successfully optimized the requirement for extended training periods. Because of the higher convergence rate, this efficiency not only minimizes computing overhead but also dramatically lowers energy consumption—a critical feature in cases where resource efficiency is crucial—allowing for a short time period and a low number of training epochs with a very nice impact on the validation and evaluation phases.

Enhancements are possible due to the combined techniques’ shown generalization ability on unseen datasets, such as YouTubeFacesDB (YTF) and Labeled Faces in the Wild (LFW). The combination of SWA and Nesterov momentum plays a crucial role in improving generalization on unseen datasets. SWA ensures that the model weights converge to a broader, flatter region of the loss landscape, which is critical for maintaining stability across different datasets. The inclusion of Nesterov momentum helps the model rapidly adjust to new data distributions, making NestYogi particularly effective in handling diverse datasets like LFW and YTF. The systems have promise to operate well not only in controlled testing conditions but also in more unexpected and variable real-world scenarios thanks to this improved generalization. Additionally, it improves model performance on unseen data by achieving a wider and more stable minimum within the loss landscape, which is made possible by having the ability to stabilize the learning process using SWA and adaptive momentum adjustments.

However, there are still challenges that may arise when not carefully implementing the suggested approach. In certain instances, the quality of training data can impact the efficacy. Another significant limitation is the dependency on high-quality, representative training data. NestYogi’s performance can be hindered if the data are imbalanced or do not sufficiently cover all variations present in real-world scenarios. This dependence—required for it to reach its full potential—may highlight the need for thorough and representative data-collecting procedures. Additionally, the optimizer introduces a degree of complexity when tuning multiple hyperparameters, which may raise the learning curve for practitioners who are less familiar with such advanced techniques. The combined effect and other factors create a set of hyperparameters that the optimizer must accurately tune. This sophistication raises the learning curve and makes the training process slightly harder, especially for practitioners with shorter training.

Auto-tuning could significantly enhance the optimizer’s sensitivity to hyperparameters. Future iterations of the NestYogi optimizer could incorporate automatic hyperparameter tuning mechanisms, allowing the optimizer to adjust itself dynamically based on model performance. This would significantly reduce the need for manual intervention and make it more accessible to practitioners with less experience in hyperparameter optimization. Such developments would further broaden the application of NestYogi in real-time environments, improving both its efficiency and usability. Its ability to adapt easily to various running environments could limit its application in every scenario. Future research ought to focus on guaranteeing that the optimizer automatically adjusts its parameters in response to changes in the behavior of the model and real-time data inputs.

Also, it could be beneficial to look into how the approaches suggested can be integrated with other, more efficient architectures and employed in other domains, like semantic segmentation. In addition to facial recognition, NestYogi could be adapted for use in tasks like semantic segmentation, where precise boundary detection is critical. Its dynamic learning adjustments and momentum-based convergence would be particularly useful for real-time object detection systems in resource-constrained environments, such as edge AI applications. These extensions would further validate the robustness and flexibility of the NestYogi optimizer across multiple domains.

Future work will also focus on improving the combination of features in optimization strategy and flexibility in light of these findings, ensuring that it can dynamically adapt each parameter in response to changes in the behavior of the model. Additional exploration will investigate how this optimizer could be integrated into different and more effective designs and whether it is useful in other fields like semantic segmentation and object detection. Such efforts will support the findings of our study as a potential tool in the development of machine learning optimization algorithms by confirming and maybe increasing its importance. Future work would look into creating more self-adjusting and flexible iterations of the NestYogi optimizer that can dynamically update its parameters in response to differences in the properties of the data and the behavior of the model in an effort to overcome those limitations. Further expanding the utilization of method combinations might involve investigating strategies to integrate it with more recent and efficient systems and applying it to other fields like semantic segmentation. Table 7 shows a comparison of the methods most used in face detection and recognition studies.

6. Conclusions

The optimization introduced in this study represents a substantive advancement in the optimization of deep learning models tailored for facial biometric applications. The novel contribution of this work lies in the unique integration of Yogi, Nesterov momentum, and SWA, which together address the common limitations faced by traditional optimizers like Adam and SGD. This hybrid approach not only accelerates convergence but also improves generalization, making it particularly effective for complex facial biometric tasks involving occlusion, varying lighting, and pose variations.

By integrating the adaptive learning capabilities of the Yogi optimizer with the anticipatory momentum of Nesterov and the stabilization effects of stochastic weight averaging (SWA), NestYogi has demonstrably enhanced the convergence rate and accuracy of neural networks. This makes NestYogi especially suited for real-world applications, such as security surveillance, forensic analysis, and biometric authentication systems, where fast and accurate facial recognition is crucial despite changing environmental conditions. The ability to adapt dynamically during training ensures robust performance even when confronted with fluctuating factors like lighting and occlusion. This is particularly significant in facial recognition and detection tasks, where variability in data—such as changes in illumination, pose, and occlusion—presents persistent challenges.

By dynamically altering the learning rates in accordance with the moving averages of the first and second moments, the adaptive learning rate mechanism solves the gradient descent optimization issue. With the aid of that ability, the optimizer may accurately adjust the step sizes, enhancing convergence in non-convex loss landscapes, which frequently appear in complex systems. Nesterov momentum is applied to further improve step speed, giving the optimizer a lookahead gradient to accelerate. This leads to more efficient training and prevents premature convergence to suboptimal solutions.

The incorporation of SWA serves the models by making them more broadly applicable. Flatter minima in the loss landscape are known to correlate with higher generalization performance, and SWA assists the model in finding them by leveraging the weights across numerous epochs. This combination has shown to have very good effects for mitigating the overfitting commonly seen in face biometric model development. Furthermore, the scalability and flexibility of the NestYogi optimizer opens avenues for its application beyond facial recognition, such as in object detection and semantic segmentation tasks. Its ability to handle large datasets and complex architectures while maintaining robust generalization and faster convergence makes it a promising tool for broader machine learning applications, specially when a shorter training period is needed.

Despite its significant contributions, NestYogi’s performance can be sensitive to hyperparameter settings, particularly in high-dimensional models. Future work should focus on developing automated hyperparameter tuning methods, making the optimizer more accessible for a wider range of applications. Additionally, as real-time processing becomes increasingly important in facial recognition systems, optimizing the computational efficiency of the NestYogi optimizer for edge AI applications will be a critical area of future research.

In conclusion, these combinations of momentum-based approaches, weight averaging methods, and adaptive learning rates offer robust solutions for tackling challenging optimization problems. Still, further advances could be required as deep learning systems keep developing and real-time processing and scalability become increasingly crucial.

Author Contributions

Conceptualization, H.K.; methodology, H.K. and R.A.; software, R.A.; validation, H.K.; formal analysis, H.K.; investigation, R.A.; resources, H.K. and R.A.; data curation, R.A.; writing—original draft preparation, R.A.; writing—review and editing, H.K. and R.A.; visualization, H.K.; supervision, H.K.; project administration, H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are available: the Labeled Faces in the Wild (LFW) dataset can be accessed at http://vis-www.cs.umass.edu/lfw/ (accessed on 1 July 2023 ), the Celeb A Dataset (Large-scale CelebFaces Attributes (CelebA) Dataset) can be accessed at https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html (accessed on 1 July 2023), the CASIA WebFace Dataset (Learning Face Representation from Scratch) can be found at https://paperswithcode.com/paper/learning-face-representation-from-scratch (accessed on 1 July 2023), and the YTF Dataset (YouTube Faces DB) can be found at https://www.cs.tau.ac.il/~wolf/papers/lvfw.pdf (accessed on 1 July 2023) and https://www.cs.tau.ac.il/~wolf/ytfaces/ (accessed on 1 July 2023). The supporting data, code, and findings of this study are available in a public GitHub repository at https://github.com/raoofaltaher (accessed on 1 July 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jain, A.K.; Ross, A.A.; Nandakumar, K. Introduction to Biometrics; Springer: New York, NY, USA, 2011; p. XVI, 312. ISBN 978-0-387-77325-4. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Bottou, L. Large-Scale Machine Learning with Stochastic Gradient Descent. In Proceedings of the International Conference on Computational Statistics, Paris, France, 22–27 August 2010; Available online: https://api.semanticscholar.org/CorpusID:115963355 (accessed on 1 June 2023).
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. CoRR 2014, abs/1412.6980. Available online: https://api.semanticscholar.org/CorpusID:6628106 (accessed on 1 June 2023).
Zaheer, M.; Reddi, S.; Sachan, D.; Kale, S.; Kumar, S. Adaptive Methods for Nonconvex Optimization. In Advances in Neural Information Processing Systems, 31st ed.; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
Guo, G.; Zhang, N. A survey on deep learning based face recognition. Comput. Vis. Image Underst. 2019, 189, 102805. [Google Scholar] [CrossRef]
O’Toole, A.; Phillips, P.J.; Narvekar, A.; Jiang, F.; Ayyad, J. Face recognition algorithms and the other-race effect. J. Vis. 2010, 8, 256. [Google Scholar] [CrossRef]
Robbins, H.E. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. Available online: https://api.semanticscholar.org/CorpusID:16945044 (accessed on 1 June 2023). [CrossRef]
Polyak, B. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
Nesterov, Y. A method of solving a convex programming problem with convergence rate O(1/k²). Sov. Math. Dokl. 1983, 372–376. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop: Divide the Gradient by a Running Average of Its Recent Magnitude. Neural Netw. Mach. Learn. 2012, 4, 26–31. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Izmailov, P.; Podoprikhin, D.; Garipov, T.; Vetrov, D.; Wilson, A.G. Averaging weights leads to wider optima and better generalization. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI), Monterey, CA, USA, 6–10 August 2018. [Google Scholar]
Abdulkadirov, R.; Lyakhov, P.; Nagornov, N. Survey of Optimization Algorithms in Modern Neural Networks. Mathematics 2023, 11, 2466. [Google Scholar] [CrossRef]
LeCun, Y.; Cortes, C.; Burges, C.J.C. The MNIST Database of Handwritten Digits; New York, NY, USA, 1998; Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 1 June 2023).
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556v6. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Guo, Z.; Han, J.; Wu, L.; Yu, J.; Liang, Y.; Chen, X.; Lu, C.; Qiu, Q. SCRFD: Towards efficient face detection via learning spatial correlation. IEEE Trans. Image Process. 2021, 30, 7438–7452. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Hasan, M.; Akhter, R.; Alam, M. A review on recent advances in deep learning and their applications in face recognition. J. King Saud-Univ.-Comput. Inf. Sci. 2019, 31, 451–464. [Google Scholar]
Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1139–1147. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning Face Representation from Scratch. arXiv 2014, arXiv:1411.7923. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. Available online: https://api.semanticscholar.org/CorpusID:206593880 (accessed on 1 June 2023).
Wolf, L.; Hassner, T.; Maoz, I. Face Recognition in Unconstrained Videos with Matched Background Similarity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 529–534. [Google Scholar]
Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments; Technical Report 07-49; University of Massachusetts Amherst: Amherst, MA, USA, 2008. [Google Scholar]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
Hieu, L.N.; Bai, B. Cosine Similarity Metric Learning for Face Verification. In Computer Vision—ACCV 2010; Springer: Berlin/Heidelberg, Germany, 2011; pp. 709–720. [Google Scholar] [CrossRef]
Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
Zhu, Y.; Cai, H.; Zhang, S.; Wang, C.; Xiong, Y. TinaFace: Strong but Simple Baseline for Face Detection. arXiv 2020, arXiv:2011.13183. [Google Scholar]
Kousalya, K.; Mohana, R.S.; Jithendiran, E.K.; Kanishk, R.C.; Logesh, T. Prediction of Best Optimizer for Facial Expression Detection using Convolutional Neural Network. In Proceedings of the 2022 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 25–27 January 2022. [Google Scholar] [CrossRef]
Tang, X.; Wang, X.; Hou, J.; Han, Y.; Huang, Y. Research on Face Recognition Algorithm Based on Improved Residual Neural Network. Autom. Control. Intell. Syst. 2021, 9, 46. [Google Scholar] [CrossRef]
Sikder, J.; Chakma, R.; Chakma, R.J.; Das, U.K. Intelligent face detection and recognition system. In Proceedings of the 2021 International Conference on Intelligent Technologies (CONIT), Hubli, India, 25–27 June 2021. [Google Scholar] [CrossRef]
Qi, D.; Tan, W.; Yao, Q.; Liu, J. YOLO5Face: Why reinventing a face detector. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 228–244, ISBN 978-3-031-25072-9. [Google Scholar]
Gao, J.; Yang, T. Face detection algorithm based on improved TinyYOLOv3 and attention mechanism. Comput. Commun. 2022, 181, 329–337. [Google Scholar] [CrossRef]
Wang, G.; Li, J.; Wu, Z.; Xu, J.; Shen, J.; Yang, W. EfficientFace: An Efficient Deep Network with Feature Enhancement for Accurate Face Detection. arXiv 2023, arXiv:2302.11816. [Google Scholar] [CrossRef]

Figure 1. Position of optimization algorithms based on convergence rate and solution space exploration. The proposed NestYogi optimizer integrates Yogi, Nesterov momentum, and SWA, achieving superior performance in both dimensions.

Figure 2. First-order optimization algorithms.

Figure 3. Comparison of optimization algorithms based on convergence rate and solution space exploration on the MNIST dataset.

Figure 4. SWA.

Figure 5. Triplet loss architecture.

Figure 6. Overview of the proposed model’s performance. (a) Proposed model face detection total loss. (b) Proposed model face detection precision. (c) Proposed model face detection recall. (d) Proposed model face detection mAP50.

Figure 7. Overview of the proposed model’s face recognition loss and F1 score. (a) Proposed model’s face recognition loss. (b) Proposed model’s face recognition F1 score.

Figure 8. Overview of the proposed model’s accuracy with various architectures. (a) Proposed face recognition model’s accuracy using VGG16. (b) Proposed face recognition model’s accuracy using InceptionV3. (c) Proposed face recognition model’s accuracy using ResNet50.

Figure 9. Proposed face detection predictions.

Figure 10. Model evaluation utilizing the final detection and recognition model.

Table 1. Analysis of reviewed works.

Reference	Method	Contributions	Limitations
Zhang et al. [21]	MTCNN	High accuracy in alignment and detection on challenging datasets.	Computational efficiency concerns in real-time applications.
Redmon and Farhadi [22]	YOLOv3, RetinaNet	High accuracy with transfer learning and object detection.	Computational demands.
Guo et al. [23]	SCRFD	Balances accuracy and efficiency for mobile devices.	May not scale well to extremely large datasets.
Schroff et al. [24]	FaceNet	Advanced face recognition through a unified embedding approach.	Challenges in generalization across diverse conditions.

Table 2. Face detection and recognition training dataset details.

Dataset	Images	Use For	Identities No.
CelebA	202,599	Face Detection	10,177 identities
CASIA WebFace	494,414	Face Recognition	10,575 identities

Table 3. Face detection experiments and results.

Model	Optimizer	Phase	Total Loss	Precision (%)	Recall (%)	mAP @ 0.5 (%)	IoU (%)
VGG16	ADAM	Training	0.0052	89.7	88.3	91.4
	ADAM	Validation		87.5	85.9	89.7	76.4
	Yogi	Training	0.0048	91.0	89.5	92.5
	Yogi	Validation		88.7	87.3	90.8	78.1
	Yogi + SWA	Training	0.0045	91.8	90.4	93.3
	Yogi + SWA	Validation		89.4	88.0	91.6	79.2
	NestYogi	Training	0.0041	93.0	91.6	94.4
	NestYogi	Validation		90.5	89.3	93.0	81.5
RetinaNet	ADAM	Training	0.0037	91.2	90.1	93.5
	ADAM	Validation		89.7	88.3	92.1	80.7
	Yogi	Training	0.0034	92.8	91.5	94.3
	Yogi	Validation		91.0	89.7	93.2	82.6
	Yogi + SWA	Training	0.0031	93.6	92.3	95.1
	Yogi + SWA	Validation		91.9	90.7	94.0	84.2
	NestYogi	Training	0.0028	95.0	93.6	96.0
	NestYogi	Validation		93.2	91.9	95.2	86.2
YOLOv9	ADAM	Training	0.0026	92.1	90.8	94.0
	ADAM	Validation		90.4	89.0	92.7	83.0
	Yogi	Training	0.0024	93.7	92.3	95.3
	Yogi	Validation		92.0	90.7	94.2	84.7
	Yogi + SWA	Training	0.0021	94.5	93.1	96.1
	Yogi + SWA	Validation		93.0	91.6	95.0	86.3
	NestYogi	Training	0.0019	95.6	94.2	96.9
	NestYogi	Validation		94.1	92.7	96.0	88.0

The best results are highlighted in bold.

Table 4. Face recognition experiments and results with different backbones.

Backbone	Optimizer	Phase	Loss	Accuracy (%)	F1 Score (%)
VGG16	ADAM	Training	0.0072	85.5	84.7
	ADAM	Validation	0.0080	83.1	82.3
	Yogi	Training	0.0065	86.8	85.9
	Yogi	Validation	0.0073	84.6	83.7
	Yogi + SWA	Training	0.0060	87.9	86.8
	Yogi + SWA	Validation	0.0067	85.8	84.9
	NestYogi	Training	0.0054	89.5	88.3
	NestYogi	Validation	0.0061	87.0	86.1
InceptionV3	ADAM	Training	0.0058	86.5	85.4
	ADAM	Validation	0.0065	84.1	83.1
	Yogi	Training	0.0051	88.0	86.9
	Yogi	Validation	0.0058	85.4	84.3
	Yogi + SWA	Training	0.0046	89.2	88.0
	Yogi + SWA	Validation	0.0052	86.6	85.6
	NestYogi	Training	0.0041	90.7	89.5
	NestYogi	Validation	0.0048	88.2	87.2
ResNet50	ADAM	Training	0.0061	88.2	87.1
	ADAM	Validation	0.0070	85.5	84.5
	Yogi	Training	0.0054	89.6	88.5
	Yogi	Validation	0.0062	86.8	85.8
	Yogi + SWA	Training	0.0049	90.8	89.6
	Yogi + SWA	Validation	0.0056	88.0	86.9
	NestYogi	Training	0.0045	92.2	91.0
	NestYogi	Validation	0.0052	89.1	88.0

The best results are highlighted in bold.

Table 5. Face detection evaluation results.

Dataset	IoU (%)	Precision (%)	Recall (%)	F1 Score (%)	mAP @ 0.5 (%)	MCC
LFW	98.00	94.17	92.74	93.48	96.00	N/A
YTF	95.72	93.57	91.90	92.79	95.21	0.894

Table 6. Face recognition evaluation results.

Dataset	Acc. (%)	Prec. (%)	Recall (%)	F1 (%)	MCC
LFW	98.61	97.79	97.58	97.68	0.977
YTF	96.53	96.25	96.07	96.16	0.896

Table 7. Comparison of methods used in face detection and recognition studies.

Reference	Method	Dataset	Detection Accuracy (IoU %)	Recognition Accuracy (%)
Zhang et al. [21]	MTCNN	WIDER FACE	85.7	N/A
Zhu et al. [36]	Generic Object Detection, ResNet50	WIDER FACE	92.1	N/A
Guo et al. [23]	SCRFD, ResNet50, FPN	WIDER FACE	89.9	N/A
Kousalya et al. [37]	Viola-Jones, CNN	Custom	80.8	80.8
Xiaolin et al. [38]	RNN	Yale B	N/A	97.5
Sikder et al. [39]	Viola–Jones, Haar Cascade, PCA	FACES94, FACES95, FACES96, Grimace	96.4	96.0
Qi et al. [40]	YOLOv5	WIDER FACE	91.5	N/A
Gao et al. [41]	TinyYOLOv3	WIDER FACE	95.5	N/A
Wang et al. [42]	EfficientNet	WIDER FACE	95.1	N/A
Redmon and Farhadi [22]	YOLOv3, RetinaNet	COCO	90.1	N/A
Proposed methods	YOLOv9, ResNet50, NestYogi	LFW, YTF	98.0 (LFW), 95.7 (YTF)	98.6 (LFW), 96.5 (YTF)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Altaher, R.; Koyuncu, H. Novel Hybrid Optimization Techniques for Enhanced Generalization and Faster Convergence in Deep Learning Models: The NestYogi Approach to Facial Biometrics. Mathematics 2024, 12, 2919. https://doi.org/10.3390/math12182919

AMA Style

Altaher R, Koyuncu H. Novel Hybrid Optimization Techniques for Enhanced Generalization and Faster Convergence in Deep Learning Models: The NestYogi Approach to Facial Biometrics. Mathematics. 2024; 12(18):2919. https://doi.org/10.3390/math12182919

Chicago/Turabian Style

Altaher, Raoof, and Hakan Koyuncu. 2024. "Novel Hybrid Optimization Techniques for Enhanced Generalization and Faster Convergence in Deep Learning Models: The NestYogi Approach to Facial Biometrics" Mathematics 12, no. 18: 2919. https://doi.org/10.3390/math12182919

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Novel Hybrid Optimization Techniques for Enhanced Generalization and Faster Convergence in Deep Learning Models: The NestYogi Approach to Facial Biometrics

Abstract

1. Introduction

Related Work

2. Materials and Methods

Proposed Optimizer

3. Experimental Section

3.1. Training Configuration

3.2. Hardware and Scalability Considerations

3.3. Results

4. Evaluation

4.1. Evaluation Metrics

4.2. Face Detection Evaluation

4.3. Face Recognition Evaluation

5. Discussion and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI