1. Introduction
Facial recognition and detection technologies have become integral components in a myriad of applications, ranging from security systems and forensic investigations to personalized user experiences in consumer electronics. The proliferation of digital devices and the increasing demand for biometric authentication have propelled research in this domain to the forefront of computer vision and machine learning [
1]. The effectiveness and reliability of these technologies are heavily dependent on the underlying algorithms used to train the models, particularly in how they optimize and generalize from the data.
Deep convolutional neural networks (CNNs) have emerged as the de facto standard for facial biometric tasks due to their exceptional ability to learn hierarchical feature representations from raw pixel data [
2]. However, the performance of CNNs is intrinsically linked to the optimization processes employed during training, which adjust the network’s weight parameters to minimize a loss function. Traditional optimization methods, such as stochastic gradient descent (SGD), have been extensively utilized but are hindered by limitations like slow convergence rates and sensitivity to hyperparameter settings [
3]. These drawbacks can lead to suboptimal solutions, especially in complex, high-dimensional, and non-convex optimization landscapes typical of deep learning models.
To address these challenges, adaptive optimization algorithms such as Adam [
4] and Yogi [
5] have been developed. Adam introduces adaptive learning rates for each parameter by computing the first and second moment estimates of gradients, which accelerates convergence and improves performance in some scenarios. Yogi builds upon Adam by providing better control over the learning rate adaptation, particularly in non-convex settings, thus stabilizing the optimization trajectory and preventing issues like overshooting minima or premature convergence.
Despite these advancements, facial biometric models continue to grapple with significant challenges such as overfitting, limited generalization across diverse datasets, and slow convergence in complex environments [
6]. These problems are exacerbated in real-world conditions characterized by factors like occlusions, varying lighting conditions, and diverse facial expressions and orientations. In critical applications such as surveillance, authentication, and forensic analysis, even minor inaccuracies can have substantial consequences, highlighting the necessity for optimization methods that not only accelerate convergence but also enhance generalization and robustness [
7].
In the broader context of optimization algorithms for neural networks, a variety of methods have been developed, each with distinct characteristics concerning convergence rate and the ability to explore the solution space effectively.
Figure 1 presents a conceptual diagram positioning various popular optimization algorithms along the axes of convergence rate and solution space exploration.
Traditional methods like SGD are known for their simplicity and ease of implementation but are often plagued by slow convergence and limited exploration of the solution space, primarily due to their fixed learning rates and lack of momentum [
8]. Momentum-based methods, such as SGD with momentum [
9] and Nesterov Accelerated Gradient (NAG) [
10], introduce momentum terms that help in accelerating convergence by dampening oscillations in the optimization path, thereby moving more swiftly towards minima.
Adaptive learning rate methods, including Adagrad [
11], RMSProp [
12], and Adam [
4], adjust the learning rates dynamically based on historical gradients, allowing for more nuanced updates that can improve convergence rates and facilitate better navigation of the loss landscape. However, these methods can sometimes suffer from issues like vanishing or exploding gradients and may converge prematurely, especially in the presence of saddle points or when dealing with highly non-convex loss surfaces [
13].
Yogi [
5] enhances Adam by introducing a more stable adaptive learning rate mechanism that prevents the exponential moving averages from becoming too large, thereby offering better performance in non-convex settings. Stochastic weight averaging (SWA) [
14] focuses on improving generalization by averaging weights collected at different points along the optimization trajectory, effectively leading the model to converge to flatter minima that are associated with better generalization properties [
15].
In this work, we introduce NestYogi, a novel hybrid optimization algorithm that integrates the strengths of Yogi, Nesterov momentum, and SWA. NestYogi aims to accelerate convergence while enhancing the exploration of the solution space, thereby improving both the training efficiency and generalization capability of deep learning models.
Figure 2 illustrates the developments in first-order optimization algorithms over the years, including the proposed method.
To empirically evaluate the efficacy of different optimization techniques for non-convex problems such as facial recognition, we conducted experiments on the MNIST digit classification task, a standard benchmark in machine learning. The MNIST dataset consists of 70,000 grayscale images of handwritten digits (0–9), presenting a challenging non-convex optimization problem due to the variations in handwriting styles and digit representations [
16].
We utilized two key performance indicators for assessing the optimizers: convergence rateand solution space exploration. The convergence rate measures how quickly an optimizer reduces the loss function during training, which is crucial for computational efficiency and timely deployment of models. Solution space exploration refers to the optimizer’s ability to navigate the loss landscape effectively to avoid local minima and saddle points, thereby ensuring better generalization to unseen data.
A Convolutional Neural Network (CNN) architecture with two convolutional layers, max-pooling, and two fully connected layers was employed. The network was trained over 40 epochs, incorporating batch normalization and dropout layers to enhance generalization and mitigate overfitting. The optimizers evaluated include the following:
Stochastic Gradient Descent (SGD): A baseline optimizer with a fixed learning rate and no adaptive mechanisms.
Adam: An optimizer that computes adaptive learning rates for each parameter based on estimates of first and second moments of the gradients.
Yogi: An optimizer that addresses some of Adam’s shortcomings by controlling the adaptive learning rates to prevent them from becoming too large.
NestYogi (Proposed Optimizer): A hybrid optimizer that combines Yogi’s adaptive learning rates with Nesterov momentum and SWA to improve both convergence rate and solution space exploration.
The experimental results indicate that SGD exhibited the slowest convergence, with noticeable oscillations and a propensity to become stuck in local minima, as shown in
Figure 3. Adam achieved faster convergence but occasionally converged prematurely, limiting its ability to explore the solution space thoroughly. Yogi demonstrated improved stability in non-convex settings but did not match the rapid convergence and comprehensive exploration achieved by NestYogi. By integrating Nesterov momentum and SWA, NestYogi effectively accelerated convergence and enhanced the exploration of the loss landscape, leading to better performance on the MNIST classification task.
The conceptual framework and empirical evidence suggest that NestYogi is not only beneficial for facial recognition tasks but can also significantly enhance performance in other deep learning domains that require efficient optimization in high-dimensional, non-convex spaces. These include object detection, semantic segmentation, and autonomous navigation, where optimization challenges are prevalent.
Through extensive experimentation on facial biometric tasks, NestYogi demonstrated significant gains in face detection and recognition accuracy across various architectures, achieving up to 98% in detection tasks and 98.6% in recognition tasks. These results are particularly noteworthy given the complexities involved in training from scratch on large datasets augmented to simulate challenging scenarios such as low-light conditions and occlusions. Furthermore, the optimizer’s scalability and adaptability make it suitable for a wide range of deep learning architectures beyond facial biometrics, including VGG16 [
17], ResNet [
18], YOLO [
19], and RetinaNet [
20].
This paper systematically explores the implementation of NestYogi, detailing how the integration of Yogi, Nesterov momentum, and SWA contributes to improved optimization in deep CNN architectures for face detection and recognition applications. The combination of these techniques not only accelerates convergence but also enhances precision and generalization, making it a valuable contribution to the field of machine learning optimization.
Related Work
The domain of face detection and recognition has witnessed substantial advancements with the advent of deep learning techniques. Various methods, algorithms, and optimization strategies have been proposed to improve the accuracy and efficiency of facial biometric systems. Despite the progress, challenges such as computational efficiency, generalization across diverse conditions, and optimization in non-convex landscapes remain areas of active research.
Zhang et al. [
21] introduced the Multi-Task Cascaded Convolutional Network (MTCNN) framework, which performs joint face detection and alignment with high accuracy on challenging datasets like WIDER FACE and CelebA. The model employs a cascade of networks to predict face and landmark locations in a coarse-to-fine manner. While achieving impressive accuracy, MTCNN faces limitations in computational efficiency, making it less suitable for real-time applications where speed is critical.
Redmon and Farhadi [
22] developed YOLOv3, an object detection algorithm that employs a single neural network to predict bounding boxes and class probabilities directly from full images in one evaluation. YOLOv3 achieves high accuracy with fast inference times, making it suitable for real-time applications. However, training such models from scratch requires significant computational resources and their performance can be sensitive to the optimization strategies employed.
Guo et al. [
23] proposed the SCRFD architecture, focusing on efficient face detection by leveraging a set of carefully designed receptive field blocks and modules. The model strikes a balance between accuracy and efficiency, making it suitable for deployment on mobile and edge devices. Nonetheless, scaling this approach to extremely large datasets or more complex tasks may present challenges.
In the realm of face recognition, Schroff et al. [
24] introduced FaceNet, which learns a mapping from face images to a compact Euclidean space where distances correspond to a measure of face similarity. By using a triplet loss function during training, FaceNet effectively handles variations in facial appearance. However, the model’s generalization across diverse conditions such as varying lighting, poses, and occlusions can still be improved.
Hasan et al. [
25] conducted a comprehensive review of recent advances in deep learning for face recognition, highlighting the importance of optimization algorithms in enhancing model performance. They pointed out that while advanced architectures and loss functions contribute to accuracy improvements, optimization strategies play a crucial role in training deep networks effectively.
Despite these developments, there remain critical challenges that need to be addressed. Computational efficiency is a significant concern, particularly for models intended for real-time applications or deployment on resource-constrained devices. Furthermore, existing optimization methods may not adequately handle the complexities of non-convex loss landscapes associated with deep neural networks used in facial biometric tasks. This can lead to issues like slow convergence, overfitting, and limited generalization, especially when training on large, diverse datasets.
Table 1 summarizes the key contributions and limitations of selected works in the field.
In light of these challenges, our proposed NestYogi optimizer aims to enhance the optimization process in deep learning models for facial biometrics. By integrating adaptive learning rates, momentum-based acceleration, and weight averaging techniques, NestYogi addresses the issues of slow convergence, overfitting, and limited generalization. The subsequent sections will detail the NestYogi optimizer, its components, and how it integrates with modern deep learning architectures for face detection and recognition, providing empirical evidence of its effectiveness.
2. Materials and Methods
This section explains the deep learning techniques, optimization methods, and strategies used in our study to improve facial recognition and detection problems.
The ADAM optimizer has become known for its ease and wide range of use and ability to handle large datasets and model parameters, among other optimization strategies. For instance, to find an exponentially declining average of earlier gradients (second moments), ADAM incorporates momentum and RMSProp components. It then constantly adjusts the learning rate for each model parameter.
An improved version of ADAM called Yogi offers better convergence in non-convex optimization problems by fine-tuning the learning rate utilizing a bias-corrected second-moment estimate. Equation (
1) describes how Yogi’s method adjusts by slowing down the rise of the effective learning rate. This enables improved model performance.
The Yogi optimizer builds upon Adam by fine-tuning learning rates for each parameter and handling noisy gradients in non-convex landscapes. This is particularly useful for facial recognition and detection tasks where feature scales vary significantly. Yogi stabilizes the optimization by ensuring that parameters with large gradients receive smaller learning rates while parameters with smaller gradients receive larger learning rates, allowing more efficient exploration of the loss landscape.
Proposed Optimizer
In this work, we introduce the NestYogi optimizer, a novel hybrid optimization algorithm that enhances the Yogi optimizer by combining several optimization strategies. NestYogi integrates Yogi’s optimizer, enhancing consistency during training by adjusting the learning rate via a bias-corrected estimate of the second momentum. Two components are built into NestYogi to expand on this foundation: Nesterov momentum, which provides a (look-ahead) gradient that adds an acceleration option to the optimizer to speed up convergence while avoiding oscillations in complex loss landscapes, and stochastic weight averaging (SWA) for improved generalization. Each of these components plays a unique role in enhancing the stability, convergence rate, and accuracy of deep learning models. Equation (
2) [
26] explains how we obtain a stable and efficient learning trajectory via the incorporation of Nesterov momentum into Yogi’s algorithm.
Nesterov momentum plays a crucial role in accelerating the convergence of the optimizer by looking ahead at future gradients. This allows the optimizer to anticipate changes in the loss landscape and adjust accordingly, leading to faster convergence while reducing oscillations in complex regions of the landscape. This is particularly important in non-convex problems like face detection, where the optimizer needs to avoid getting stuck in suboptimal minima.
Additionally, stochastic weight averaging (SWA) is integrated into NestYogi to improve generalization. SWA works by averaging the model weights over several epochs which tends to locate solutions within flatter regions of the loss landscape. These regions are associated with better generalization properties, as they represent optima that are less sensitive to minor perturbations in weights. By reducing the variability in parameter values across the loss landscape, SWA helps achieve more stable and robust solutions, leading to improved generalization performance (as illustrated in
Figure 4 and detailed workflow in Algorithm 1). Equation (
3) shows how SWA updates the model weights over time.
Algorithm 1 Stochastic weight averaging (SWA). |
- 1:
Initialize model parameters, as the current model weights at epoch S - 2:
Training phase: - 3:
for epoch in range do - 4:
for batch in do - 5:
Compute gradients and update model using the optimizer - 6:
optimizer.step() - 7:
end for - 8:
if epoch ≥ S then - 9:
Update with current model weights: - 10:
- 11:
end if - 12:
end for - 13:
Weight averaging phase: - 14:
for epoch in range do - 15:
for batch in do - 16:
Compute gradients and update model using the optimizer - 17:
optimizer.step() - 18:
end for - 19:
Update SWA weights with a running average: - 20:
- 21:
end for - 22:
return
|
The combined effect of these components—Yogi for adaptive learning, Nesterov momentum for acceleration, and SWA for generalization—creates a highly robust optimization strategy. This method allows the optimizer to efficiently handle both the instability of noisy gradients and the complexity of non-convex loss landscapes, common in face recognition and detection tasks.
Furthermore, L1 and L2 regularization are applied to minimize overfitting by penalizing large weights. These regularization terms encourage sparsity and prevent the model from becoming too complex, reducing the risk of overfitting to the training data. Additionally, gradient clipping is used to prevent any large updates to the weights, ensuring stability in the optimization process. Overall, the combination of these elements efficiently addresses the optimization challenges present in deep learning models, and this will be evaluated on several face detection and recognition tasks using well-established deep learning architectures and benchmark datasets.
3. Experimental Section
3.1. Training Configuration
The proposed optimization was applied to three well-known architectures for the face detection experiments: VGG16, RetinaNet, and YOLOv9 [
27]. The experiments leveraged the CelebA dataset, which includes 202,599 images split into three sets: training (70%), validation (15%), and testing (15%) [
28], as detailed in
Table 2. To ensure model robustness, the images were preprocessed with normalization and augmentation techniques, such as random cropping, horizontal flipping, brightness and hue adjustments, blurring, and addition of salt-and-pepper noise to 1.3% of the pixels. These preprocessing steps simulated real-world variations in lighting, occlusion, and position.
VGG16: Customized with a batch size of 32 and an initial learning rate of 0.001. To mitigate overfitting, a weight decay of 0.0005 was used. Training was carried out for 100 epochs, with a 0.1 reduction in learning rate every 15 epochs. Stochastic weight averaging (SWA) was initiated from epoch 75.
RetinaNet: This model was trained with a batch size of 32 and an initial learning rate of 0.001, using ResNet50 as the backbone. Focal loss was used to address class imbalance, which is especially important for detecting small or occluded faces. The model was trained over 100 epochs with learning rate reductions every 15 epochs and SWA starting from epoch 75, similar to VGG16.
YOLOv9: YOLOv9 was trained for 100 epochs using a batch size of 32 and an initial learning rate of 0.01, with SWA starting at epoch 75. A multi-scale training approach was used to improve face detection accuracy across various scales and conditions.
For face recognition tasks, three different backbones (ResNet50, VGG16, and InceptionV3) were employed in Siamese networks. The CASIA WebFace dataset, consisting of 10,575 individuals and 494,414 images [
29] as detailed in
Table 2, was split into 80% for training and 20% for validation to ensure ample data for accurate evaluation.
VGG16: Customized with a batch size of 64 and a learning rate of 0.0001. The triplet loss function was used with a margin of 0.2 to separate embeddings of identical and different identities. The learning rate was adjusted to 0.001 at epoch 200, and training continued for 500 epochs with SWA initialized at epoch 400.
ResNet50: Configured with a batch size of 64 and a starting learning rate of 0.0001. Training was conducted similarly to VGG16, using the triplet loss function to optimize feature differentiation over 500 epochs.
InceptionV3 [
30]: Trained with a learning rate of 0.0001 and a batch size of 64, using gradient accumulation to stabilize training due to its complex structure. The model was trained for 500 epochs, with checkpoints saved at each epoch to prevent overfitting.
Bounding boxes were used during the training process to identify and locate faces in images for detection tasks. For face recognition, the triplet loss function was used to distinguish between anchor, positive, and negative samples (same identity vs. different identity). This loss function, as shown in Equations (
4)–(
6), played a crucial role in learning distance metrics that distinguish between similar and distinct faces.
Figure 5 illustrates the triplet loss architecture used to maximize recognition by distinguishing between facial feature similarities and differences.
3.2. Hardware and Scalability Considerations
The experiments were performed on an NVIDIA A100 GPU, which provides efficient training for large scale datasets and complex models. Nesterov momentum was set to 0.9 in the NestYogi configuration, which improved the optimizer’s ability to predict gradient directions and stabilize convergence. SWA was introduced after 75% of the training, averaging weights over the final 25% of epochs to enhance generalization.
Given that real-world applications often demand both high accuracy and computational efficiency, the scalability of NestYogi was tested across different architectures and datasets. Future work will further explore the potential of NestYogi in edge AI and resource constrained environments, where smaller datasets and limited hardware may challenge the performance of deep learning models.
The study was carefully designed to evaluate the effect of the proposed methods on convergence rate and accuracy of a range of activities. To guarantee that advances were not limited to the training phase, particular emphasis was given to improving the models’ performance on validation and test datasets as its one of the research objectives. NestYogi prioritizes generalization for more reliable models which operate well on unknown data an essential element in real-world applications.
3.3. Results
Notable benefits in model performance have resulted in both face detection and recognition tasks, the proposed methods enhanced key performance metrics, including accuracy, recall, mean average precision (mAP), and intersection over union (IoU), in detecting faces, as
Table 3 shows a detailed training and validation experiments concluded for the face detection phase. More specifically, the proposed method improved the detection precision on the VGG16 architecture by around 3%, starting at a baseline of 91.8% (with Yogi + SWA) and rising to 93.0% (training) and 90.5% (validation). Additionally, there has been a rise in the model’s performance, as seen by the mAP increasing and the total loss decreasing. Experiment results are shown in
Figure 6a–d.
Following the integration of NestYogi, RetinaNet’s accuracy went up by 2–3%. In the training phase, it increased from 93.6% to 95.0%. More importantly, during the validation phase the accuracy improved from 91.9% to 93.2%. The gains related to the optimization’s proficiency in handling tiny and partly obscured faces. This improvement is mostly caused by the combination of complex learning rate manipulations and the optimizer’s ability to adjust the learning rate dynamically according to model parameters, stochastic weight averaging (SWA) is helpful in accessing larger weights pool over few epochs and locating a wider and more stable minimum over the loss landscape, and Nesterov momentum, offering a look-ahead, gradient acceleration and capacity to prevent oscillations.
A notable improvement in processing performance was found in YOLOv9, lowering the inference time by approximately 7–10% while keeping an excellent degree of accuracy. This gain is especially useful in real-time applications where accuracy is crucial in additional to speed. During the training of YOLOv9, the accuracy increased from 94.5% to 95.9% and during validation, the accuracy increased from 90.5% to 92.7%, handling dynamic changes in object motions and scales. Because of this, YOLOv9 with NestYogi is very useful in cases where accurate and timely face identification is required.
Detailed examinations of the model’s training and validation experiment results for most of the backbones used are clearly shown in
Table 4 and
Figure 7a,b and
Figure 8a–c. The optimizer enhances different backbones in face recognition tests too. The optimizer raised training accuracy from 87.9% to 89.5% and more importantly raised validation accuracy from 85.8% to 87.0%, leading to an estimated 2% gain in accuracy when deployed to the VGG16 backbone. This is an important for face recognition, where precise individual recognition is essential, especially in congested settings and when training from scratch. Enabling the model to modify its parameters produces better decision boundaries and reduces classification errors, due to dynamic learning rate adaptation and anticipatory updates.
The proposed methods improved generalization over unseen data with ResNet50, resulting in a 3% boost in precision over traditional architecture. Here, the SWA element works best since it enables the model to identify a more stable and wider minimum in the loss landscape. Through doing this, the model performs better on unobserved data and performs better in a range of face recognition settings. A 15–20% savings in time spent training was accomplished by InceptionV3, another backbone evaluated with NestYogi, despite its ability to adapt to changes in face orientation and size. The effectiveness in handling complex structures is demonstrated its ability to execute convergence faster while maintaining accuracy, where hardware and software limitations may be present and prompt, precise training stability is required.
The proposed methods are examples of sophisticated optimization approaches that have been included in convolutional neural networks (CNNs) in order to enhance their performance and generalization in detecting faces and recognition. These approaches increase measures like accuracy, recall, mAP, and IoU across different architectures by tackling important issues like training stability and convergence rate. During training, the algorithm may adapt dynamically due to adaptive learning rates, which helps to prevent overfitting and enhance performance with unknown data. These developments show how well-constructed optimization algorithms may enhance CNNs in a variety of applications including as public safety, surveillance, and mobile biometrics. Through model optimization in order to handle changes in dataset complexity and size, this method establishes a new benchmark for solid and flexible intelligent video analytics systems.
4. Evaluation
To crop faces from the pictures in the evaluation datasets, the most accurate face detection model, YOLOv9, was utilized for the evaluation process. Preprocessing is essential because it makes sure that correctly identified faces serve as a foundation for the subsequent face recognition process, which optimizes the overall precision and reliability of the recognition results. Following that, feature embeddings were extracted from these cropped faces by the ResNet50 architecture as an embedding model, which was shown to be the best face recognition backbone among the tested networks. Two popular public datasets, YouTube Faces (YTF) [
31] and Labeled Faces in the Wild (LFW) [
32], were used for the thorough examination. By ensuring that the model’s performance is evaluated over different data types and sizes and examining the proposed method’s flexibility, this dual-dataset evaluation supports the findings and enables reproducibility.
4.1. Evaluation Metrics
In the following evaluation process, we will utilize the following metrics to examine the models’ performances in more detail, precision measures the percentage of positively recognized instances that were accurately detected across all occurrences that were expected to be positive. This is calculated in Equation (
7).
The proportion of accurately detected positive instances out of all real positive results is measured by recall (true positive rate or sensitivity). The equation for recall is Equation (
8):
The harmonic mean of precision and recall is the F1 Score, shown in Equation (
9). To provide a single score that takes into account both false positives and false negatives, it balances both metrics.
Accuracy determines the percentage of both positive and negative instances that are successfully detected out of all instances, as represented in Equation (
10):
A general performance indicator called the Matthews correlation coefficient (MCC) is used to gauge the quality of binary classification models, based on four elements (
TP,
TN,
FP, and
FN) [
33]. Compared to measurements like accuracy, MCC can offer a fairer evaluation, especially when the classes (face/non-face in detection, correct/incorrect recognition) might be unbalanced. This is calculated using Equation (
11):
These examinations will provide an in-depth evaluation of the models’ performances in tasks integrating both face detection and recognition. Using the application of these metrics to the outcomes of the LFW and YTF dataset evaluations, our objective is to deliver the most informative numerical evaluation of the value of the suggested methods.
4.2. Face Detection Evaluation
To ensure the YOLOv9 model’s effectiveness in the context of face detection, the evaluation covered many critical patterns. First, we compute the intersection over union (IoU), an effective statistic for evaluating the level of accuracy of object detection models. YOLOv9 proved to have an amazing ability to precisely identify and locate faces across a diverse and complex dataset by scoring an amazing average IoU score of 98% on the LFW dataset. The model obtained a high IoU score of 95.7% on the YTF dataset, which involves video footage with rapid changes in lighting and face expressions; this demonstrates its ability to address challenging real-world conditions.
By analyzing the recall and accuracy metrics provide insight into the model’s ability to recognize faces accurately while reducing false positives and false negatives. Both precision and recall for the LFW dataset were 94.1% and 92.7%, respectively. The YTF dataset showed good performance considering the extra complexity of video data, as seen by the slightly lower accuracy of 93.5% and recall of 91.9%.
On the LFW dataset and the YTF dataset, the F1 score—a single quality parameter that strikes a balance among accuracy and recall—was found to be 93.4% and 92.7%, respectively. The findings show that the YOLOv9 model was able to detect faces with consistency and efficiency, even in the face of challenges presented by the video-based YTF dataset.
With an IoU threshold of 0.5, mean average precision (mAP), a common metric for object detectors, was also calculated. Our model achieved mAP values of 96.0% on the LFW dataset and 95.2% on the YTF dataset.
MCC is the binary classification for a face (positive class) vs non-face (negative class), There are four types of detections: true positives; false positives, where a non-face region is identified as a face; false negatives, where a face fails to be identified; and true negatives, where a non-facial region is identified as such. Since all images in LFW contain at least one face, meaning there are no true negatives in this case, we only measured the MCC for the YTF dataset, which was 0.894.
The integration of the above-mentioned elements of optimization during the training phase allows for the YOLOv9 model’s better performance on these parameters. The model was adept at faster convergence to an optimal solution, even in the face of noisy or challenging input, because of the optimizer’s adaptive learning rate advance strategy. Because of the flexible nature of this optimizer, the model was capable of dynamically adapting to the changing gradients for different parameters during training. This is particularly helpful in cases where there is a lot of visual information that moves quickly, as seen in the YTF dataset. In
Figure 9, we present some of the model’s predictions and confidence from both evaluation datasets, and the results of the evaluating phase are shown in
Table 5.
4.3. Face Recognition Evaluation
Similar standards were adopted to evaluate the ResNet50 model’s face recognition. The model’s ability to correctly identify individuals from static pictures is shown by its recognition accuracy of 98.61% on the LFW dataset and its lower accuracy of 96.5% on the YTF dataset.
Evaluating the model’s performance further, we applied the Euclidean distance and cosine similarity measures, which are especially beneficial for assessing how well face embeddings distinguish unique people. These metrics, which are described in Equations (
12)–(
14), are just as good as the embeddings that the trained model produced [
34,
35].
In Algorithm 2, we present the workflow of the evaluation process.
Algorithm 2 Model Evaluation. |
- 1:
Start - 2:
Load dataset - 3:
Load face detection model (YOLOv9) and face recognition model (ResNet50-based Encoder) - 4:
Initialize dictionary for storing encodings - 5:
for each image in dataset do - 6:
Detect face using detection model - 7:
if face is detected then - 8:
Crop the face from the image using bounding box coordinates - 9:
Encode the cropped face using ResNet50-based encoder - 10:
Store the encoding in the dictionary, along with the true label (person’s identity) - 11:
end if - 12:
end for - 13:
Initialize counter K for correctly predicted identities - 14:
for each person in the dictionary do - 15:
Calculate cosine and Euclidean distances with every other person - 16:
Find the person with the smallest distance (most similar person) - 17:
if the person is most similar to themselves then - 18:
Increment the correct prediction counter K - 19:
end if - 20:
end for - 21:
Calculate metrics - 22:
End
|
With an average cosine similarity score of 0.96 and a Euclidean distance of 0.25 from the LFW dataset, there was an important level of identity differentiation. The YTF dataset’s cosine similarity score of 0.94 and the Euclidean distance of 0.27.
The accuracy, recall, and F1 score metrics present a more detailed picture of the model’s performance. The precision, recall, and F1 score on the LFW dataset are 97.79%, 97.58%, and 97.68%, respectively. These indicators show that there is an elevated degree of accuracy when classifying between individuals. The accuracy, recall, and F1 score on the YTF dataset are 96.25%, 96.07%, and 96.16%, respectively. MCC may also be utilized for face recognition problems by matching a specific identity (i.e., whether a given face matches a known individual or not). True positives in this case would be accurate person identifications, while false positives would be inaccurate identifications. The accuracy of matching or recognizing certain persons measured with MCC is 0.977 on the LFW dataset and 0.896 on YTF, which indicates how accurate the model is in matching, identifying, and recognizing correct/incorrect individuals. The evaluation results for face recognition are detailed in
Table 6.
These findings again highlight that the model is able to maintain a very high level of performance when challenged with different datasets. The entire model that utilizes detection, recognition, and prediction from both models is shown in
Figure 10.
Achieving these excellent metrics became possible due to the optimizer’s capabilities, which included enhancing the learning process, preventing overfitting, and convergence. This study shows that the suggested models are useful in real-world scenarios by thoroughly evaluating them on public datasets such as the LFW and YTF datasets, usually used for evaluating face detection and recognition models.
5. Discussion and Limitations
The experimental results demonstrate that NestYogi significantly enhances both the speed and accuracy of face detection and recognition tasks. It blends the Yogi optimizer’s adaptive learning rate mechanisms with Nesterov momentum and stochastic weight averaging’s stabilizing combination. Across a range of advanced deep learning architectures, this synergistic approach was shown to be crucial for maximizing convergence rates and model validity, assuring consistent performance in the training and evaluation phases.
Yogi’s adaptive learning rate ensures that model parameters with high variance gradients are updated more cautiously, thus stabilizing training. Nesterov momentum accelerates convergence by predicting future gradients, allowing the optimizer to avoid oscillations in complex loss landscapes. SWA contributes to better generalization by averaging weights over multiple epochs, ensuring that the model converges to flatter regions of the loss landscape that are less sensitive to noisy gradients.
Real-world applications of systems often encounter common obstacles such as changing lighting conditions, awkward body positions, and occlusion. This adaptability is especially important for systems deployed to dynamic situations where conditions may change unexpectedly. Rapid changes in data distribution, such as changing lighting or occlusion, are a common challenge for traditional or standard optimizers like Adam and SGD. As opposed to this, by using the Yogi’s sign in the second momentum and utilizing a Nesterov-adjusted gradient in the bias-corrected first momentum, NestYogi adjusts its learning rate dynamically in response to real-time gradients and momentum adjustments for each parameter. This guarantees a better solution search and greater accuracy in unpredictable situations by enabling the model to react to such changes faster.
During the training process, smoothing, and overfitting reduction, the combination of SWA and adaptive learning rate adjustments has successfully optimized the requirement for extended training periods. Because of the higher convergence rate, this efficiency not only minimizes computing overhead but also dramatically lowers energy consumption—a critical feature in cases where resource efficiency is crucial—allowing for a short time period and a low number of training epochs with a very nice impact on the validation and evaluation phases.
Enhancements are possible due to the combined techniques’ shown generalization ability on unseen datasets, such as YouTubeFacesDB (YTF) and Labeled Faces in the Wild (LFW). The combination of SWA and Nesterov momentum plays a crucial role in improving generalization on unseen datasets. SWA ensures that the model weights converge to a broader, flatter region of the loss landscape, which is critical for maintaining stability across different datasets. The inclusion of Nesterov momentum helps the model rapidly adjust to new data distributions, making NestYogi particularly effective in handling diverse datasets like LFW and YTF. The systems have promise to operate well not only in controlled testing conditions but also in more unexpected and variable real-world scenarios thanks to this improved generalization. Additionally, it improves model performance on unseen data by achieving a wider and more stable minimum within the loss landscape, which is made possible by having the ability to stabilize the learning process using SWA and adaptive momentum adjustments.
However, there are still challenges that may arise when not carefully implementing the suggested approach. In certain instances, the quality of training data can impact the efficacy. Another significant limitation is the dependency on high-quality, representative training data. NestYogi’s performance can be hindered if the data are imbalanced or do not sufficiently cover all variations present in real-world scenarios. This dependence—required for it to reach its full potential—may highlight the need for thorough and representative data-collecting procedures. Additionally, the optimizer introduces a degree of complexity when tuning multiple hyperparameters, which may raise the learning curve for practitioners who are less familiar with such advanced techniques. The combined effect and other factors create a set of hyperparameters that the optimizer must accurately tune. This sophistication raises the learning curve and makes the training process slightly harder, especially for practitioners with shorter training.
Auto-tuning could significantly enhance the optimizer’s sensitivity to hyperparameters. Future iterations of the NestYogi optimizer could incorporate automatic hyperparameter tuning mechanisms, allowing the optimizer to adjust itself dynamically based on model performance. This would significantly reduce the need for manual intervention and make it more accessible to practitioners with less experience in hyperparameter optimization. Such developments would further broaden the application of NestYogi in real-time environments, improving both its efficiency and usability. Its ability to adapt easily to various running environments could limit its application in every scenario. Future research ought to focus on guaranteeing that the optimizer automatically adjusts its parameters in response to changes in the behavior of the model and real-time data inputs.
Also, it could be beneficial to look into how the approaches suggested can be integrated with other, more efficient architectures and employed in other domains, like semantic segmentation. In addition to facial recognition, NestYogi could be adapted for use in tasks like semantic segmentation, where precise boundary detection is critical. Its dynamic learning adjustments and momentum-based convergence would be particularly useful for real-time object detection systems in resource-constrained environments, such as edge AI applications. These extensions would further validate the robustness and flexibility of the NestYogi optimizer across multiple domains.
Future work will also focus on improving the combination of features in optimization strategy and flexibility in light of these findings, ensuring that it can dynamically adapt each parameter in response to changes in the behavior of the model. Additional exploration will investigate how this optimizer could be integrated into different and more effective designs and whether it is useful in other fields like semantic segmentation and object detection. Such efforts will support the findings of our study as a potential tool in the development of machine learning optimization algorithms by confirming and maybe increasing its importance. Future work would look into creating more self-adjusting and flexible iterations of the NestYogi optimizer that can dynamically update its parameters in response to differences in the properties of the data and the behavior of the model in an effort to overcome those limitations. Further expanding the utilization of method combinations might involve investigating strategies to integrate it with more recent and efficient systems and applying it to other fields like semantic segmentation.
Table 7 shows a comparison of the methods most used in face detection and recognition studies.
6. Conclusions
The optimization introduced in this study represents a substantive advancement in the optimization of deep learning models tailored for facial biometric applications. The novel contribution of this work lies in the unique integration of Yogi, Nesterov momentum, and SWA, which together address the common limitations faced by traditional optimizers like Adam and SGD. This hybrid approach not only accelerates convergence but also improves generalization, making it particularly effective for complex facial biometric tasks involving occlusion, varying lighting, and pose variations.
By integrating the adaptive learning capabilities of the Yogi optimizer with the anticipatory momentum of Nesterov and the stabilization effects of stochastic weight averaging (SWA), NestYogi has demonstrably enhanced the convergence rate and accuracy of neural networks. This makes NestYogi especially suited for real-world applications, such as security surveillance, forensic analysis, and biometric authentication systems, where fast and accurate facial recognition is crucial despite changing environmental conditions. The ability to adapt dynamically during training ensures robust performance even when confronted with fluctuating factors like lighting and occlusion. This is particularly significant in facial recognition and detection tasks, where variability in data—such as changes in illumination, pose, and occlusion—presents persistent challenges.
By dynamically altering the learning rates in accordance with the moving averages of the first and second moments, the adaptive learning rate mechanism solves the gradient descent optimization issue. With the aid of that ability, the optimizer may accurately adjust the step sizes, enhancing convergence in non-convex loss landscapes, which frequently appear in complex systems. Nesterov momentum is applied to further improve step speed, giving the optimizer a lookahead gradient to accelerate. This leads to more efficient training and prevents premature convergence to suboptimal solutions.
The incorporation of SWA serves the models by making them more broadly applicable. Flatter minima in the loss landscape are known to correlate with higher generalization performance, and SWA assists the model in finding them by leveraging the weights across numerous epochs. This combination has shown to have very good effects for mitigating the overfitting commonly seen in face biometric model development. Furthermore, the scalability and flexibility of the NestYogi optimizer opens avenues for its application beyond facial recognition, such as in object detection and semantic segmentation tasks. Its ability to handle large datasets and complex architectures while maintaining robust generalization and faster convergence makes it a promising tool for broader machine learning applications, specially when a shorter training period is needed.
Despite its significant contributions, NestYogi’s performance can be sensitive to hyperparameter settings, particularly in high-dimensional models. Future work should focus on developing automated hyperparameter tuning methods, making the optimizer more accessible for a wider range of applications. Additionally, as real-time processing becomes increasingly important in facial recognition systems, optimizing the computational efficiency of the NestYogi optimizer for edge AI applications will be a critical area of future research.
In conclusion, these combinations of momentum-based approaches, weight averaging methods, and adaptive learning rates offer robust solutions for tackling challenging optimization problems. Still, further advances could be required as deep learning systems keep developing and real-time processing and scalability become increasingly crucial.