1. Introduction
Emotions play a significant role in human interactions, serving as essential mediators in social communication systems [
1]. Humans’ expression of emotions incorporates diverse modalities, including facial expressions, speech patterns [
2], and body language [
3]. According to Darwin and Prodger [
4], human facial expressions indicate humans’ emotional states and intentions. Recently, automatic emotion detection through computer vision techniques has shown a growth in interest and application across many domains, including hospital patient care [
5], neuroscience research [
6], smart home technologies [
7], and even in cancer treatment [
8,
9]. This diversity has established emotion recognition as a distinct and growing field in research fields, primarily due to its wide range of applications and intense impact on various phases of human life.
Emotion recognition from images mainly consists of two steps: feature extraction and classification. Facial images encompass a multitude of features including geometric, texture, color, intensity, landmark, shape, and histogram-based features. Handcrafted techniques for feature extraction in facial images involve the manual identification of landmarks for geometric features, texture analysis using methods like Local Binary Patterns (LBPs) [
10], and color distribution analysis through histograms. To enhance the feature extraction process, dimensionality reduction techniques such as PCA (Principal Component Analysis) [
11]/t-SNE (t-Distributed Stochastic Neighbor Embedding) [
12] have been employed to obtain crucial features for classification. Traditional machine learning algorithms like Support Vector Machine (SVM) [
13] and Random Forest (RF) [
14] have been used to classify the emotions from these features. However, hand-crafted features often struggle to capture the important information required for effective face identification. Moreover, kernel-based methods frequently produce feature vectors that are excessively large, leading to the overfitting of the model [
15].
Deep learning models, particularly convolutional neural networks (CNNs) [
16], are renowned for their capability to automatically learn hierarchical features and complex patterns. However, CNNs frequently face challenges such as overfitting, which arises from limited data availability and computational complexity. Additionally, issues like vanishing or exploding gradients can undermine the stability of the training processes [
17].
Transfer learning has gained popularity in machine learning as a method for accelerating tasks. Transfer learning provides a framework for leveraging well-known pre-trained models, such as VGG (Visual Geometry Group) [
18], ResNet [
19], and DenseNet121 [
20], trained on millions of image data and is particularly relevant for FER application in the related domain of facial images.
A schematic representation of the proposed framework is presented in
Figure 1, utilizing images sourced from the Karolinska Directed Emotional Faces (KDEF) [
21] dataset for illustrative purposes. As presented in
Figure 1, the first step (1) showcases facial images retrieved from the KDEF, Filtered Facial Expression Recognition 2013 (FER2013) [
22], and Cohn-Kanade (CK+) [
23] datasets. In the subsequent step, step (2), the data undergo preprocessing, wherein data augmentation techniques such as horizontal flipping, zooming, rotation, and histogram equalization [
24] methods are applied. These techniques serve to augment the dataset, enhancing image contrast and thereby facilitating improved feature extraction. In moving forward to step (3), the fine-tuning and modification of the pre-trained VGG19 and VGG16 models are undertaken. During this process, the last convolutional block of the models is kept unfrozen, while the remaining layers of the base models (pre-trained VGG16, VGG19) are frozen. Additionally, the fully connected layers are then connected to these models. Moreover, diverse learning rate schedulers, including cosine annealing [
25] are implemented in the models. The evaluation phase encompasses training the models on datasets such as KDEF, Filtered FER2013, and CK+, followed by an assessment using various evaluation metrics. These metrics include the accuracy, AUC-ROC [
26], AUC-PRC, and Weighted F1 score [
27].
A significant contribution of this study is on optimizing the performance of a pre-trained simple architecture such as a VGG on well-known FER image datasets. Instead of opting for complex deep neural network models, this study demonstrates that careful fine-tuning can lead to better classification accuracy with simpler architectures. This study investigated the efficacy of histogram equalization and data augmentation in improving the FER accuracy, alongside optimizing the performances of pre-trained architectures like VGG on three benchmark FER datasets. Additionally, this study showcases the effectiveness of different regularization techniques, callbacks, and learning schedulers in enhancing the model performance for FER by conducting extensive experiments.
The subsequent sections of this paper are structured as follows.
Section 2 presents a review of the related literature in the field.
Section 3 introduces histogram equalization and cosine annealing to aid in understanding the proposed models.
Section 4 elaborates on the transfer learning-based model, datasets used, and the experiment pipeline.
Section 5 presents the experimental results, while
Section 6 provides a thorough discussion of these results. Finally,
Section 7 concludes this paper by discussing its significance and outlining potential future work.
2. Related Works
Numerous studies conducted in recent years have focused on FER, employing various techniques. Traditional machine learning approaches have been used alongside CNN models to obtain important information for classifying emotions extracted from visual objects.
Xiao-Xu et al. [
28] employed an ensemble approach using Wavelet Energy Features (WEFs) and Fisher’s Linear Discriminants (FLD) for the feature extraction and classification of seven facial expressions (anger, disgust, fear, happiness, neutral, sadness, surprise) within the Japanese Female Facial Expression (JAFFE) dataset [
29]. Abhinav Dhall et al. utilized the Pyramid of Histogram of Gradients (PHOG) [
30] and Local Phase Quantization (LPQ) [
31] features to encode shape and appearance information. They selected keyframes through the K-means clustering [
32] of normalized shape vectors from Constrained Local Models (CLMs) based face tracking. Emotion classification on the SSPNET [
33] and GEMEP-FERA [
34] datasets was conducted using an SVM and the Largest Margin Nearest Neighbor (LMNN) algorithm [
35]. Pu et al. proposed a framework employing two-fold RF classifiers to recognize Action Units (AUs) from image sequences. Facial motion measurements involved tracking Active Appearance Model (AAM) [
36] facial feature points with Lucas–Kanade optical flow [
37], using displacement vectors between the neutral and peak expressions as motion features. These features were fed into a first-level RF for AU determination, followed by a second-level RF for facial expression classification [
38]. Golzadeh et al. focused on spatio-temporal feature extraction based on tracked facial landmarks, aiming to develop an automatic emotion recognition system [
39]. They employed the KDEF dataset to identify features that represent different human facial expressions, subsequently evaluating them through various classification methods. Through experimentation and employing K-fold cross-validation, they achieved the precise recognition of facial expressions, attaining up to 87% accuracy with the newly devised features and a multiclass SVM classifier. Liew et al. proposed five feature characteristics for FER and compared their performances using different classifiers and datasets (KDEF, CK+, JAFFE, and MUG [
40]). Among Gaussian-based filtering and response (GABOR) methods, Haar [
41], LBP, and histogram of oriented gradients (HOG) [
42] classifiers, HOG classifiers perform best for FER with higher image resolutions (above 48 × 48 pixels), averaging an 80% accuracy in these datasets [
43].
The researchers found that the most straightforward method for classifying emotions is through CNN models. CNNs are well suited for image tasks due to their ability to capture various levels of features efficiently and recognize patterns and objects in images regardless of their positions or sizes. Thakare et al. used several classifiers such as ConvNet, RF classifiers, and Extreme Gradient Boosting (XGBoost) classifiers [
44], with the CNN model ConvNet consistently yielding the highest accuracy in emotion classification [
45]. The researchers proposed a novel FER approach that integrates a CNN with image edge detection to bypass traditional feature extraction. This method involves normalizing facial images, extracting edges, and merging this information with features to preserve the structural composition. Subsequently, implicit features are reduced using maximum pooling, followed by a softmax classification for emotion recognition. Testing on the FER2013 [
46] and LFW datasets [
47] resulted in an average emotion detection rate of 88.56% with faster training, approximately 1.5 times quicker than comparative models [
48]. Badrulhisham et al. focused on real-time FER, employing MobileNet [
49] to train their model, achieving an 85% recognition accuracy for four emotions (happy, sad, surprise, disgust) on their custom dataset [
50]. Experimental validation across multiple databases and facial orientations resulted in significant findings: achieving an accuracy of 89.58% on the KDEF dataset, 100% accuracy on the JAFFE dataset, and 71.975% accuracy on the combined dataset (KDEF + JAFFE + SFEW). These results were obtained using cross-validation techniques to minimize bias.
In [
51], researchers explored visual emotion recognition through social media images by employing pre-trained VGG19, ResNet50V2, and DenseNet-121 architectures as their base. Through fine-tuning and regularization techniques, these models demonstrated improved performances on Twitter images from the Crowdflower dataset, with DenseNet-121 exhibiting superior accuracies of 73%, 75%, and 89%, respectively. Furthermore, Subudhiray et al. investigated dual transfer learning for facial emotion classification, experimenting with pre-trained CNN architectures including VGG16, ResNet50, Inception ResNet [
52], Wide ResNet [
53], and AlexNet. By combining extracted feature vectors into various pairs and inputting them into an SVM classifier, this approach showed promising results in terms of accuracy, kappa, and overall accuracy compared to state-of-the-art methods across benchmark datasets such as JAFFE, CK+, KDEF, and FER2013 [
54]. Kaur et al. introduce FERFM, a novel approach using a fine-tuned MobileNetV2 [
55] for FER on mobile devices. A pipeline strategy was introduced, where the pre-trained MobileNetV2 architecture is fine-tuned by eliminating the last six layers and adding a dropout, max pooling, and dense layer. Using transfer learning from ImageNet, the method achieved an accuracy of 85.7% on the RGB-KDEF dataset. It surpasses VGG16 with faster processing at 43 ms per image and fewer trainable parameters, totaling 1,510,599 [
56]. In another research, they proposed a system that employs a CNN framework using AlexNet’s features, achieving higher accuracy compared to other methods across various datasets like JAFFE, KDEF, CK+, FER2013, and AffectNet [
57]. Moreover, they proved it is more efficient and requires fewer device resources than other state-of-the-art deep learning models like VGG16, GoogleNet [
58], and ResNet [
59]. In another study, Zavarez et al. fine-tuned the VGG-Face Deep CNN model pre-trained for face recognition. The study investigated the impact of a cross-database approach [
60]. The results revealed significant accuracy improvements, with average accuracies of 88.58%, 67.03%, 85.97%, and 72.55% on the CK+, MMI, RaFD, and KDEF databases, respectively.
Puthanidam et al. proposed a hybrid facial expression recognition model combining image pre-processing and convolutional neural network (CNN) structures to enhance accuracy and reduce training time. Across various databases and facial orientations, the model achieved high accuracies, with notable results, including 100% accuracy for the JAFFE dataset and 89.58% accuracy for the KDEF dataset [
61]. Chen et al. introduced the Attentive Cascaded Network (ACD) method, which enhances the discriminative power of facial expression recognition models by selectively focusing on important feature elements [
62]. By integrating multiple feature extractors with smooth center loss, ACD achieves intra-class compactness and inter-class separation, improving the generalization ability of the learning algorithm. In their experiment, the proposed method achieved a notable performance on the RAF-DB and KDEF datasets, with accuracies of 86.42% and 99.12%, respectively. In [
63], the researchers introduced a novel approach to facial expression recognition by combining deep metric loss and softmax loss in a unified framework, enhancing the performance by addressing intra- and inter-class variations. Using a generalized adaptive (N+M)-tuplet cluster loss function and identity-aware mining schemes, the proposed method achieved an accuracy of approximately 97.1% on the CK+ dataset and 78.53% on the MMI dataset [
64]. Dar et al. used EfficientNet-b0 for feature extraction and transfer learning due to its accuracy and computational efficiency balance [
65]. They customized the EfficientNet-b0 architecture by incorporating Swish activation functions after every 2D convolution layer, enhancing performance through non-monotonic, smooth unbounded above/bounded below properties. The researchers assessed the effectiveness of their model across five varied datasets: CK+, JAFFE, FER-2013, KDEF, and FERG. They reported classification accuracies of 100%, 95.02%, 63.4%, 88.3%, and 100% respectively for these datasets. Zahara et al. proposed a system design that utilizes convolutional neural networks (CNNs) with the OpenCV library to predict and classify facial emotions in real time [
66]. Implemented on Raspberry Pi, the system comprises three main processes: face detection, facial feature extraction, and emotion classification. The Xception model achieved a prediction accuracy of 65.97% on the FER-2013 dataset for facial expression recognition. Minaee et al. introduced a deep learning approach employing attentional convolutional networks for facial expression recognition, surpassing previous models on various datasets, including FER-2013, CK+, FERG, and JAFFE, achieving accuracies of 70.02%, 98%, 99.3%, and 92.8%, respectively [
67]. Fie et al. introduced a novel deep neural network-based system for the early detection of cognitive impairment by analyzing the evolution of facial emotions in response to video stimuli. The system incorporates a facial expression recognition algorithm using layers from MobileNet and a Support Vector Machine (SVM), demonstrating satisfactory performances across three datasets like the KDEF dataset, Chinese Adults Dataset, and Chinese Elderly People Dataset [
68]. A significant amount of work has focused on employing transfer learning techniques with CNN models such as AlexNet [
69], SqueezeNet [
70], and VGG19, evaluating their efficacy on benchmark datasets including FER2013, JAFFE, KDEF, CK+, SFEW [
71], and KMU-FED. VGG19 demonstrated a notable performance, achieving 99.7% accuracy on the KMU-FED database and competitive results across other benchmark datasets. Specifically, VGG19 attained performance accuracies of 98.98% for the CK+ dataset, 92.99% for the KDEF dataset with all data variations, 91.5% for the selected KDEF Frontal View dataset, 84.38% for JAFFE, 66.58% for FER2013, and 56.02% for SFEW [
72]. Bialek et al. explored emotion recognition through convolutional neural networks (CNNs), proposing various models including custom and transfer learning types, as well as ensemble approaches, alongside FER2013 dataset modifications. Emotion classification has been examined using both multi-class and binary approaches, with results and comparative analyses provided for the different methods and models [
22]. In [
73], the authors proposed a method for facial expression recognition by concatenating spatial pyramid Zernike moment-based shape features with Law’s texture features, capturing both macro and micro details of facial expressions. Using multilayer perceptron (MLP) and radial basis function feed-forward artificial neural networks, the method achieved a high recognition accuracy, with average rates of 95.86% and 88.87% on the JAFFE and KDEF datasets, respectively.
4. Implementation
In this section, we describe the datasets and the deep CNNs used in our work.
4.1. Datasets and Augmentation Techniques
The KDEF dataset comprises 4900 colored images depicting various human facial emotions. Additionally, the Averaged KDEF (AKDEF) dataset consists of averaged images derived from the original KDEF photos. Both the KDEF and AKDEF were built in 1998 and have since been made freely available to the academic community. Over the years, the KDEF has become widely utilized in research, with over 1500 publications using its data. The KDEF dataset encompasses seven distinct emotion classes: anger, neutral, disgust, fear, happy, sad, and surprise. Each image in the dataset is carefully labeled to denote the specific emotion portrayed by the individual. The images are in RGB format with a resolution of 224 × 224 pixels.
The CK+ dataset [
23] serves as a prominent benchmark dataset in the field of facial expression recognition research. It comprises a total of 981 images collected from 123 subjects, with each sequence depicting one of seven facial expressions:
anger, contempt, disgust, fear, happy, sadness, and surprise. These expressions were elicited using the Facial Action Coding System (FACS), a standardized method for analyzing facial movements.
Each sequence within the CK+ dataset begins with a neutral expression, transitions to the target expression, and concludes with a return to the neutral expression. The images are captured under controlled laboratory conditions and are presented in a grayscale format, with a resolution of 640 by 490 pixels [
77].
The original FER2013 dataset comprises 35,887 grayscale images, each depicting cropped faces with dimensions of 48 × 48 pixels. These images are categorized into seven emotions: angry, disgust, fear, happy, neutral, sad, or surprise.
One notable aspect of the FER2013 dataset is its class imbalance, where the number of images varies significantly across emotion categories. Despite this, the dataset captures a diverse range of facial expressions encountered in real-life scenarios, including variations in lighting conditions, camera distance, and facial poses. The individuals depicted in the images represent diverse demographics, encompassing differences in age, race, and gender. Additionally, the dataset exhibits variations in the intensity of expressed emotions.
To address issues such as non-class-associated photos or non-face images, a filtered version of the FER2013 dataset was created by Bialek et al. [
22]. This involved manual cleaning, removing images that did not correspond to any specific emotion category or were not depicting faces. Furthermore, instances of mislabeling were corrected, ensuring that images were assigned to the appropriate emotion group.
We used various data augmentation techniques to enhance the diversity of the training dataset of the CK+ and KDEF and improve the generalization performances of the models. During the preprocessing phase, we applied augmentation methods such as rotation, horizontal shifting, and vertical shifting to the input images. Specifically, we configured the rotation_range parameter to allow random rotations within a range of −20 to +20 degrees. Additionally, we used width_shift_range and height_shift_range to introduce random horizontal and vertical shifts to the images, respectively, with a maximum displacement of 20% of the total width and height. Furthermore, we enabled horizontal flipping using the horizontal_flip parameter to further increase dataset variability. To ensure seamless augmentation, we used the ’nearest’ fill mode to interpolate pixel values for newly created pixels. Lastly, we applied pixel normalization by rescaling the pixel values of all images to a range between 0 and 1 using the rescale parameter, aiding in the convergence of the model during training.
For the FER2013 dataset, we employed different augmentation techniques compared to the KDEF and CK+ datasets. Specifically, we applied a rotation range of 10 degrees clockwise or counterclockwise, along with horizontal flipping. Additionally, we utilized a zoom range of [1.1, 1.2], allowing for random zooming between 1.1× and 1.2× the original size during training.
Regarding the data split, we adopted an 80% training, 10% validation, and 10% test split for both the CK+ and KDEF datasets. For the KDEF dataset specifically, we used three facial postures instead of the original five, resulting in 420 images per class, and this yielded 2940 images in total. For the filtered FER2013 dataset, we preserved the identical data split as described by Bialek et al. [
22], comprising training (27,310), validation (3410), and test (3420) sets.
Table 1 presents the counts of training, testing, and validation samples for the datasets used in our experiment. All experiments were conducted using subject-independent data across all datasets.
4.2. Experimental Setup
This work was conducted within a Docker environment, using the NVIDIA RTX 2080 GPU on a Windows 10 Education 64-bit system. The system was equipped with 32 GB of RAM and an Intel Core i7-9700k 3.60 GHz CPU. The transfer learning model was developed and executed using the Keras Python library (
https://keras.io/api/ (accessed on 23 January 2024)). Visualizations of the results were generated using the matplotlib library (
https://matplotlib.org/ (accessed on 23 January 2024)) and the seaborn library (
https://seaborn.pydata.org/ (accessed on 23 January 2024)). Additionally, the SciKit-learn library (
https://scikit-learn.org/stable/about.html (accessed on 23 January 2024)) was used to create evaluation matrices.
4.3. VGG Architectures
The task of FER was based on fine-tuned transfer learning from pre-trained VGG16 and VGG19 architectures. The VGG architecture, developed by Simonyan and Zisserman, achieved significant success as the runner-up in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2014. This architecture was chosen for several reasons: (i) demonstrable success on a variety of image classification tasks, (ii) native support provided by Keras, 5 which offers pre-trained models with publicly available weights, and (iii) a simpler implementation.
However, more complex models such as ResNet50 or DenseNet121 might face the problem of overfitting issues given a smaller dataset size. We employed VGG16 and VGG19, pre-trained on the ImageNet dataset, for classification purposes. In our approach, we unfroze the last four layers of VGG16 and the last five layers of VGG19, while freezing the remaining layers. This strategy allowed us to update the pre-trained ImageNet [
78] weights with the weights learned from our specific datasets, enhancing the model’s ability to learn more effectively.
Following the basic VGG architecture, we added a global average pooling layer, a dropout layer, a dense layer with ReLU activation, another dropout layer, and, finally, a dense layer with softmax activation to classify the emotions. The proposed VGG16 and VGG19 model architectures are detailed in
Table 2 and
Table 3, respectively.
For comparison purposes, our experiment was structured in three distinct setups. Firstly, we maintained all layers of the base VGG19 and VGG16 models frozen, without applying any histogram equalization. Secondly, we fine-tuned the VGG19 and VGG16 architectures by unfreezing the last four layers of VGG16 and the last five layers of VGG19, while keeping the remaining layers frozen. This setup also did not include any histogram equalization. Lastly, we incorporated histogram equalization into the final experimental setup, along with fine-tuning the models as described in the second setup. For the KDEF dataset, our initial step involved optimizing the data for histogram equalization by converting the images to grayscale. Subsequently, to obtain the RGB format suitable for a pre-trained VGG network, we further transformed the grayscale images into RGB format. During this process, each pixel in the grayscale image replicated its intensity value across all three color channels. A general overview of our framework can be found in
Figure 1.
4.4. Hyperparameters
For training our models on different datasets, we used various hyperparameters to optimize the performance. These hyperparameters included the Input Size, Batch Size, Epochs, Learning Rate, Early Stopping, Learning Rate Scheduler, Dropout Rate, and L2 Regularization. The specifics of these parameters for different datasets are outlined in
Table 4.
In the CK+ dataset, the original images were re-sized to (224, 224, 3). We applied a dropout rate of 0.5 for regularization purposes. Additionally, we implemented early stopping, which halts the training process if the validation accuracy does not improve for five consecutive epochs to prevent overfitting.
We used the same techniques for data augmentation and resizing on the KDEF dataset. We also used early stopping and a learning rate scheduler. If the validation accuracy did not improve for three consecutive epochs, the learning rate was multiplied by 0.5 to decrease it. If the accuracy did not increase for five consecutive epochs, the training process was stopped. The dropout rate for this dataset was set to 0.1, and l2 kernel regularization with a value of 0.01 was applied for optimal performance.
For the FER2013 dataset, we conducted extensive experimentation with various learning strategies and optimizers. Through our analysis, we discovered that employing cosine annealing with the adam optimizer yielded the most accurate results. During training,
initial_lr set the starting learning rate, while
T_maxdefined the duration of a cycle of the cosine annealing schedule. The learning rate decreased gradually from the initial value to a minimum over T_max epochs, following a cosine curve pattern. Additionally, we explored different image sizes and were surprised to find that the transfer learning models performed exceptionally well when the images were resized to (144,144,3). In terms of batch size, we used the value of 64, which differed from the batch sizes used in the other datasets. Furthermore, we also found that the optimal dropout rate and l2 regularization penalty were 0.1 for this dataset. Regarding model training strategies, we initially employed a similar approach to that used in the CK+ dataset, wherein the model training would stop if the validation accuracy did not improve for five consecutive epochs. Subsequently, we saved the models and proceeded to train them again using cosine annealing for an additional 30 epochs. The initial learning rate was set to 0.0001, and then it gradually decreased according to the equation specified in Equation (
3). Afterward, the model was trained using this gradually decreasing learning rate.
6. Discussion
This section of our paper highlights key notations and compares our models with existing works in the field. We delve into specific notations crucial for understanding our approach and provide a thorough comparison of our models with those previously established in the literature.
The comparison between models with all layers frozen with no histogram and those with fine-tuned models with no histogram reveals significant differences in the performances across datasets. For instance, on the KDEF dataset, the VGG19 with all layers frozen and no histogram achieved an accuracy of 54.76%, whereas the fine-tuned VGG19 with no Histogram achieved nearly 94.22%. This variation underscores the limitations of freezing all layers, as it limits the model’s adaptability to the new task/domain by preserving fixed feature extraction mechanisms. Oppositely, unfreezing the last five layers allows for fine-tuning, enabling the model to learn task-specific representations and enhance the performance using pre-trained weights while accommodating adjustments to suit the new task requirements.
Furthermore, we proceed to compare the models’ performance post-histogram equalization. Notably, all hyper-tuned models exhibited slightly improved performances with histogram equalization. As indicated in
Table 5, hyper-tuned models using histogram-equalized images showed a 0.3–2% enhancement on average compared to their counterparts without histogram equalization. However, exceptions were observed on the CK+ dataset, where both hyper-tuned models, with and without histogram equalization, exhibited similar performances.
In the case of the Filtered FER2013 dataset, one interesting observation worth discussing is the performance on the FER2013 dataset with the image size set to 144 × 144. Despite the original VGG16 and VGG19 models being trained on the 224 × 224 × 3 ImageNet dataset, the 144 × 144 image size showed promising results across all experiments. This deviation from the conventional image size might be attributed to the potential loss of information and degradation of image quality when resizing an image to larger dimensions. When resizing an image to a larger size, the existing pixels are stretched to fill the new dimensions, which can lead to blurriness or pixelation. This loss of information may have been mitigated by selecting an optimum size of 144 × 144, which preserved the originality of the image while maintaining a balance between image quality and resolution.
Moreover, when we applied cosine annealing to the fine-tuned VGG16 models with histograms, the model’s accuracy initially plateaued at around 67.57% on the FER2013 dataset. However, after reloading the model and running the model for an additional 30 epochs with cosine annealing, the accuracy improved to 69.65%. But which specific features of the cosine annealing rate differ from those in normal learning cycles, resulting in enhancements in model performance should be interesting to find. Normal training cycles involve updating model parameters over multiple epochs to minimize loss and improve performance, focusing on data-driven optimization. In contrast, cosine annealing cycles specifically regulate the learning rate schedule within each training cycle, dynamically adjusting it over epochs according to a cosine curve pattern. While normal training cycles primarily target parameter updates, cosine annealing cycles aim to optimize the learning rate schedule, potentially enhancing training efficiency and model performance. This adaptive learning rate scheduling of cosine annealing allowed the model to escape local minima and explore the parameter space more effectively, ultimately leading to an improved performance.
When handling a small dataset like CK+, training models like VGG16 and VGG19 can be prone to overfitting due to their extensive parameter count. However, employing regularization techniques and training techniques, coupled with data augmentation, aids in effectively training the model with this limited data. This effectiveness can be seen in
Figure 3, where the model effectively identifies crucial features in facial data across various classes, showcasing its robustness even with limited data.