Transfer Learning for Facial Expression Recognition

Kumar, Rajesh; Corvisieri, Giacomo; Fici, Tullio Flavio; Hussain, Syed Ibrar; Tegolo, Domenico; Valenti, Cesare

doi:10.3390/info16040320

Open AccessArticle

Transfer Learning for Facial Expression Recognition

by

Rajesh Kumar

^1,*

,

Giacomo Corvisieri

²,

Tullio Flavio Fici

²,

Syed Ibrar Hussain

¹

,

Domenico Tegolo

^1,*

and

Cesare Valenti

^1,*

¹

Dipartimento di Matematica e Informatica, Università degli Studi di Palermo, Via Archirafi 34, 90123 Palermo, Italy

²

Italtel S.p.A., Viale Schiavonetti 270/F, 00173 Rome, Italy

^*

Authors to whom correspondence should be addressed.

Information 2025, 16(4), 320; https://doi.org/10.3390/info16040320

Submission received: 22 January 2025 / Revised: 25 March 2025 / Accepted: 14 April 2025 / Published: 17 April 2025

(This article belongs to the Special Issue Emerging Applications of Machine Learning in Healthcare, Industry, and Beyond)

Download

Browse Figures

Versions Notes

Abstract

:

Facial expressions reflect psychological states and are crucial for understanding human emotions. Traditional facial expression recognition methods face challenges in real-world healthcare applications due to variations in facial structure, lighting conditions and occlusion. We present a methodology based on transfer learning with the pre-trained models VGG-19 and ResNet-152, and we highlight dataset-specific preprocessing techniques that include resizing images to 124 × 124 pixels, augmenting the data and selectively freezing layers to enhance the robustness of the model. This study explores the application of deep learning-based facial expression recognition in healthcare, particularly for remote patient monitoring and telemedicine, where accurate facial expression recognition can enhance patient assessment and early diagnosis of psychological conditions such as depression and anxiety. The proposed method achieved an average accuracy of 0.98 on the CK+ dataset, demonstrating its effectiveness in controlled environments. However performance varied across datasets, with accuracy rates of 0.44 on FER2013 and 0.89 on JAFFE, reflecting the challenges posed by noisy and diverse data. Our findings emphasize the potential of deep learning-based facial expression recognition in healthcare applications while underscoring the importance of dataset-specific model optimization to improve generalization across different data distributions. This research contributes to the advancement of automated facial expression recognition in telemedicine, supporting enhanced doctor–patient communication and improving patient care.

Keywords:

face detection; facial expression recognition; deep learning techniques

1. Introduction

In healthcare systems, emotions play a vital role in shaping overall well-being. Optimal emotional health enhances quality of life, while poor emotional health can lead to social or mental health challenges. By analyzing video communication, we can better recognize individuals’ emotional states, improving our ability to address emotional well-being in healthcare settings.

The field of artificial intelligence, particularly cognitive computation, focuses on understanding human mental processes, including emotions, which are key indicators of social behavior and mood [1]. However, terms such as sentiment analysis, image sentiment analysis and visual emotion analysis need clarification: while sentiment analysis typically deals with text, visual emotion analysis focuses on interpreting emotions from visual data, such as facial expressions. This study adopts visual emotion analysis as the primary term for analyzing emotions in video-based data.

In recent years, methods for developing computational models for identifying and classifying human emotions have been investigated in the field of cognitive computation, with many applications [2,3,4]. Facial expression recognition (FER) classifies emotion in images [5] using image sentiment analysis, which recognizes and obtains internal expressions by analyzing the sentiments depicted in a digital image. A visual emotion analysis aims to find the sentiment polarity (i.e., positive, neutral, or negative) in an image. These polarities and two sensations (positive and negative) or seven categories have been used to classify positive and negative visual emotions. As social networks grow quickly, we publish words, images, audio, videos and microblogs to express our emotions. Usually, images with bright colors convey pleasant (positive) feelings, while images with dark colors convey unpleasant (negative) feelings [6,7,8,9]. Because images contain a lot of visual information that can be employed for various functions, they carry an abundance of emotional semantics in addition to words and sounds. Sentiment analysis extracts emotions from photos using text mining and natural language processing [9]. Sentiment analysis assesses whether a visual input produces a positive or negative emotional response and then classifies the responses appropriately [10,11].

Deep learning is a subset of machine learning that utilizes artificial neural networks with multiple layers to learn hierarchical representations from data. These neural networks, particularly deep architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), are capable of automatically extracting features from raw input without manual feature engineering. Deep learning models excel in tasks such as image recognition, natural language processing and signal processing by learning complex patterns and relationships through iterative training on large datasets. In the context of image analysis, CNNs apply convolutional operations to detect spatial hierarchies of features, making them effective for tasks such as facial expression recognition and sentiment analysis [12].

Numerous approaches for image sentiment analysis have been proposed, focusing on extracting emotional cues from visual data such as facial expressions, images and videos. These approaches are broadly categorized into machine learning-based methods and lexicon-based methods. Machine learning-based methods, particularly deep learning, have gained significant attention due to their ability to automatically learn hierarchical features from large-scale datasets. Pre-trained CNNs such as VGG-19, ResNet50V2 and DenseNet-121 have demonstrated effectiveness in capturing spatial and contextual information from images for sentiment analysis [12].

Recent studies have demonstrated the effectiveness of fine-tuned deep learning models in image sentiment analysis. For example, ref. [13] employed transfer learning techniques with VGG-19, ResNet50V2 and DenseNet-121, optimizing performance by freezing and unfreezing specific layers and applying regularization techniques to mitigate overfitting. Their approach improved accuracy by 5–10% compared to previous visual sentiment analysis methods. Similarly, ref. [14] introduced a deep learning-based multimodal approach that combines visual and textual features using recurrent neural networks to improve sentiment classification accuracy. This highlights the potential of integrating multiple data sources for better sentiment prediction. These advancements in CNNs and RNNs address challenges such as illumination variations, occlusions and diverse emotional expressions, thereby improving the robustness of image sentiment analysis [15,16].

Deep learning adopts a multilayer strategy for hidden layers in the network. Characteristics are established and extracted via feature selection methods or by conventional machine learning procedures. Improved precision and efficiency are achieved with deep learning models, though features are automatically acquired and extracted [15,16].

Datasets and Model Overview

To address the challenges of FER, this study utilizes three widely used datasets, namely, the Extended Cohn–Kanade (CK+) dataset [17], the Facial Expression Recognition 2013 Dataset (FER2013) [18] and the Japanese Female Facial Expression (JAFFE) dataset [19], for emotion detection and classification. These datasets were chosen because they are widely used benchmarks in state-of-the-art facial expression recognition research. Moreover, we ensure a fair comparative analysis with respect to accuracy, precision, recall and F1-score (please see definitions in Section 5), allowing a meaningful evaluation against existing models; while numerous facial expression datasets are publicly accessible, such as AffectNet [20], PEDFE [21], CEPS [22], EmotioNet [23] and UIBVFED-Mask [24], we focused on those three datasets for a fair comparative analysis.Further details about the datasets and preprocessing steps are discussed in Section 4.2.

Deep learning methods have boosted FER performance in recent years [25]; a deep convolutional neural network with supplemental channels for processing local data was proposed in [26]. Recently, deep learning has been used to predict emotions in photos. A pre-trained convolutional neural network model was utilized by [27] to create features and train classifiers using those features. The connection between emotional traits and visual attention has also been investigated, highlighting the importance of deep learning in understanding human emotions [28]. The well known VGGNet model was used to assess the effectiveness of the suggested strategy [29].

In this study, we employ two state-of-the-art deep learning models, namely, VGG-19 and ResNet-152, both fine-tuned using transfer learning for facial expression recognition. These models were chosen for their proven effectiveness in image classification tasks and their ability to handle complex visual data. ResNet, whose name is an abbreviation for Residual Network, is used in this study; specifically, the study uses ResNet-152, which consists of 152 layers in total. ResNet creates residual relationships among several layers during training, thereby helping to decrease errors, maintain information acquired and improve efficiency [30]. To compare the accuracy of pre-trained algorithms, this study also employs a pre-trained model. Increased accuracy and a significant reduction in computational time and resources can be obtained by using a pre-trained model. To enhance efficacy in training and learning, transfer learning is often employed in image processing. One fundamental principle of transfer learning involves transferring knowledge from domains closely associated with the task at hand to diverse target domains [31,32]. With three fully connected layers and sixteen layers of convolution, VGG-19 is a 19-layer convolutional neural network. VGG-19 was developed on millions of photos in 1000 categories from the ImageNet database. Because each convolutional layer uses several fast

3 \times 3

filters, this method of picture categorization is frequently used.

The issue of recognizing facial expressions from photos was tackled in [33]. The accuracy was more significant with the CNN approach. A network-based model for recognition of facial emotions was proposed in [34]. It converts an image of a face into a chart by fusing fixed and random points: it has been noted that this model built on static and arbitrary points performs more efficiently than the basic CNN.

Additionally, more datasets can be used to improve performance. Ensemble techniques with rapid images captured by several cameras can further enhance the results. A CNN system that uses a state-of-the-art deep learning model to identify emotions in images has been introduced in [35]. In a supervised learning technique, features are extracted and classified using deep learning models and cutting-edge classifiers. Recently, the VGG-19 image recognition algorithm has demonstrated encouraging outcomes.

Image sentiment assessment has attracted much attention lately as a multidisciplinary tool of cognitive science, pattern recognition and computer vision. Emotional intelligence is critical to one’s growth and development [36]. Sentiment can be predicted using a visual sentiment analysis [37]. At present, users upload millions of photos per day to social media platforms. These pictures are essential for expressing people’s emotions on online social networks. Sentiment analysis has been explored in depth in a seminal study [38]. Meena [39] made use of pre-trained CNN models allowing the identification and classification of sentiment analysis. The authors’ most notable results were a recall

R = 0.93

and a precision

P = 0.94

; those were more significant than the other reported results. A novel architecture has been created here using transfer learning, where pre-trained CNN models (VGG-19 and ResNet-152) are fine-tuned on facial expression datasets to improve classification performance with limited training data. CK+, JAFFE and FER2013 are the datasets used in this study for evaluating the effectiveness of transfer learning in FER. Indeed, we also conduct a comparative analysis of our methodology against other models developed on the same datasets. The proposed study is noteworthy for the following reasons:

It uses a cutting-edge CNN model to identify emotions from facial expressions;
It includes a suitable number of layers in the CNN model for effective emotion detection and classification from facial images;
It aids in the creation of more sophisticated real-world applications for multimodal (visual) expression-based emotion detection systems;
Deep learning models, including VGG-19 and ResNet-152, are used to categorize various facial emotions, including neutral, happy, sad and furious.

This paper is organized as follows: Section 2 discusses literature about the existing studies. Section 3 provides details of key operational principles of the proposed framework. The approaches we developed and our research methodology are covered in Section 4, and the overall results and discussion are presented in Section 5. Our findings and conclusions are in Section 6.

2. Related Work on Facial Expression Recognition

Human faces can express a wide range of emotions that everyone can understand. Facial expression recognition has witnessed broad application across diverse domains including driver fatigue, monitoring, associative robotics, human–computer interaction, digital entertainment and medical assistance. Facial expressions are the core of affective computing systems, and emotions are intuitively displayed by facial expressions [40]. Some applications use facial recognition technology to add extra security or protect private information [41]; other applications use it in psychology to identify signs of depression or anxiety in patients and in healthcare for understanding patients’ emotions, which can aid in early invention and treatment so that better care can be provided [42]. To ascertain how someone is feeling, we use facial emotion detection to identify expressions such as sadness, happiness, surprise, anger or fear. This is important for businesses to achieve their marketing goals [43].

Ekundayo and Viriri conducted a review of FER methods and identified three distinct machine learning problem definitions: single-label learning to treat FER as a multiclass problem, multilabel learning to resolve the ambiguity inherent in FER and label distribution learning to recover the distribution of emotion in FER data annotation [44].

The process of traditional expression recognition involves three primary steps: image preprocessing, feature extraction and classification. Among these, feature extraction holds paramount importance, as it directly impacts facial expression recognition accuracy. Traditional expression feature extraction depends on various statistical features of pixel values, including those of facial images. Examples of this include principal component analysis [45], local binary patterns [46] and the Gabor transform [47].

Pose variation and occlusion are the primary problems with facial expression recognition. Specific techniques use the features retrieved from viewable parts to reassemble the occluded facial regions; nevertheless, they primarily rely on precise facial landmark identification, which remains a challenging task [35]. In order to address the problem of pose fluctuations in facial photographs, Zhang et al. used noisy web data to enhance the performance of models in a weakly supervised way [48]. Poux et al. recreated face occluded areas in the optical flow domain by creating an auto-encoder with skip connections [49,50].

Recent research highlights the crucial role of nonverbal communication in human interaction, with facial expressions conveying 55% of emotional information, voice tone 38% and verbal content just 7% [51]. Tang et al. suggested building numerous multiplier stages with a summarizing layer and developing multiplication kernels in order to learn frequency-domain information [52]. Expression intensity is another factor to consider. as expressions made subtly may not be easily recognized [53].

A facial emotion identification system based on fuzzy multiclass support vector machines with biorthogonal wavelet entropy was proposed in [54]. Siqueira et al. investigated ensembles with shared representations to significantly minimize computational load redundancy by adjusting the branching level [55]. Xie et al. created triplet loss using numerous phases of anomaly reduction and class-pair margins to identify and exclude expression photographs with occlusion or significant head postures; this allowed them to create innovative loss functions [56]. An extensive responsive multi-path CNN based on the concentration of prominent expression regions and multi-path variation suppression to autonomously identify expression-related regions to learn excellent features was described in [57]. Deep learning has become a research hotspot in expression analysis that can effectively solve the sensitivity problems, that is, posture, illumination and occlusion [58]. The most recent deep learning-based methods are provided in [59].

Different models’ performances on various datasets can vary. A deep multi-task learning model achieved an accuracy of 0.976 on the CK+ dataset, which is an excellent performance in facial expression recognition [60]. This model is able to leverage both important spatial traits (by class label) and local spatial distribution using a Siamese network with shared weights that are adjusted on the fly by an adaptive re-weighting module. This improves performance especially when very few training samples are available. In experiments, this methodology performed better than several state-of-the-art models on various datasets, suggesting that it is robust, though a slight decrease in accuracy upon combining all datasets was observed. The so-called miniXception architecture showed promising results on the FER2013 dataset that may lead to widespread adoption [61]. This is highly time- and space-efficient, especially in scenarios where computing resources for emotion detection are limited. The model achieved top performance across 15 configurations of eight networks, each trained on frontal images, even if the overall accuracy was just 0.636. Dense_FaceLiveNet was introduced as a CNN model for the task of facial expression recognition [62]. Tests on the JAFFE dataset showed an accuracy equal to 0.907, confirming its ability to recognize facial expressions efficaciously.

The VGG model, which was previously trained, was first submitted to the ImageNet large-scale visual recognition competition ILSVRC-2014 by Oxford University [31]. Kusuma proposed a modified VGG-16 CNN for emotion classification, achieving accuracy

A = 0.69

on FER2013 [63]. VGG-19 is one of the VGG models that were previously trained and used in this investigation.

3. DoctorLINK: Integrating Deep Learning for Facial Expression Recognition

DoctorLINK [64] is the telemedicine platform created by Italtel. This system was built starting from the needs of physicians and patients, considering biomedical sensor technologies on the shelf and the availability of video communication systems over IP networks (e.g., the Internet). Some features were also designed and tested with the help of national and European projects on telemedicine topics and our research focuses on emotion recognition during video consultations to detect patients’ facial expressions.

In addition to the ‘Telehealth’ and ‘Teleconsulting’ services, the system also provides ‘Telemonitoring’ aimed at the care of de-hospitalized or chronically ill patients, who are not in critical condition but need continuous monitoring.

The ‘Telehealth’ and ‘Teleconsulting’ services were implemented by embracing a proprietary video communication platform to facilitate and strengthen patient-physician interactions and are activated directly from the appropriate graphical interfaces. DoctorLINK also provides an “integrated home-care record” system, the “medicine remainder” service and a questionnaire management tool that physicians use to verify the patients’ satisfaction provided by the healthcare facility.

In this research, we focused specifically on the development and integration of a facial expression recognition module within the DoctorLINK platform, rather than working on the broader medical functionalities such as monitoring blood pressure, electrocardiogram readings, pulse oximeter waveforms, temperature, oxygen saturation levels, respiratory rate, breathing patterns and spirometry. This module utilizes deep learning techniques to analyze facial expressions as an indicator of patients’ psychological states, contributing to remote healthcare by providing an additional layer of emotional insight during consultations.

The integration of ResNet-152 and VGG-19 into DoctorLINK was achieved through a structured pipeline, where pre-trained models were fine-tuned on curated facial expression datasets to enhance accuracy in recognizing subtle emotional cues and subsequently embedded into the DoctorLINK video communication system using Python frameworks such as TensorFlow and Keras. The process involved preprocessing video frames to detect and isolate faces with OpenCV, leveraging ResNet-152 and VGG-19 for detailed facial feature extraction and optimizing the models for low-latency emotion detection during video consultations. The novelty of this work lies in the seamless incorporation of a deep learning-based model into DoctorLINK, enabling healthcare providers to automatically assess and interpret a patient’s emotional state.

3.1. Functional Principles of Remote Monitoring

The following measurements can be monitored based on the availability of proper sensors: blood oxygenation, blood glucose, heart rate, blood pressure, lung function, body temperature and body weight. It is also possible to include pulse oximeter waveforms and electrocardiogram examinations, and further clinical parameters may be considered in the future.

Through the site’s web applications, physicians manage and access patients’ home-care records, in which medical home-care data, diseases they suffered and diagnoses they have had over time are recorded and control readings transmitted by biomedical sensors. This information is necessary to define diagnosis and treatment; it also documents the clinical course of the pathology. Figure 1 shows the functional principles of the remote monitoring system.

Remote monitoring of clinical parameters is performed by automatic procedures that check for the presence of atypical values and by graphical interfaces that show the series of values detected by the sensors in tabular and graphic forms. The system is provided with a dashboard that highlights out-of-norm values rapidly and displays patient data by masks. With these features, physicians periodically inspect the transmitted values for abnormalities or events that can be considered related and requiring attention. In such events, the physician, through the built-in video-calling features, can perform additional checks, including scheduling an in-person visit or a teleconference with the patient during which she or he might decide to change the previously assigned diagnosis or therapy.

3.2. Platform Description, Usability, Security and Privacy

The software architecture of DoctorLINK v1.0 is based on microservices, with applications delivered on separate docker containers that communicate over secure network links. The platform is also equipped with HL7 FHIR interfaces to communicate to hospital information systems as well and is also predisposed to interface with the regional telemedicine system. DoctorLINK is provided in high availability and features monitor the status of platform hardware/software resources so that action can be taken in time, thus ensuring high levels of service uptime. The platform consists of a website, which can be installed on the cloud of certified providers and an app to be installed on hardened tablets. In addition to maintaining the network connection with the site, the app manages the medical Bluetooth low energy sensors (risk class IIa) that the patient wears during the acquisitions.

Given the different types of users to whom the service is targeted, special care was taken in developing the usability requirements of the applications’ graphical user interfaces to be intuitive and simple for the users (patients and clinicians) who have to use them daily. To this end, usability tests were performed before, during and after application development. Because the system handles confidential data, it complies with the general data protection regulation [65] and provides security features that are verified during design, development and integration testing. Data and message transmissions are encrypted; checks are made not only on the source code of programs but also on automatic penetration tests, which are used in proof-of-concept facilities to check for known vulnerabilities.

4. Materials and Methods

This section outlines the methodological approach to detect facial emotions in healthcare using deep learning techniques. We substantially improved image recognition capabilities by leveraging the strengths of ResNet-152 and VGG-19 in extracting features and detecting patterns relevant to facial expression recognition.

4.1. Face Detection

The initial step of the proposed approach involves detecting faces [66]. Numerous techniques can be employed, such as Viola–Jones [11], MTCNN [67], MEDIPILINE [68], YOLO [69] and SSD [70]. For this project, we chose to implement the Viola-Jones algorithm due to several compelling advantages. Firstly, Viola-Jones is highly efficient and reliable, making it particularly well-suited for real-time applications where fast processing is crucial. Its robust performance in detecting frontal faces, even under diverse lighting conditions, ensures accurate results and consistent reliability. Additionally, the algorithm’s inherent simplicity and ease of implementation allowed for smooth integration into the DoctorLINK system, providing powerful and effective face detection capabilities. This choice enhances the overall efficiency and effectiveness of facial expression recognition in the remote setting of video calls. However, while Viola–Jones has distinct advantages, other face detection methods such as MTCNN, YOLO, SSD and MEDIPILINE offer complementary benefits. For instance, MTCNN is based on deep learning, excelling in detecting faces under various orientations and occlusions, but it is more computationally expensive and less suited for real-time applications compared to Viola–Jones. YOLO, known for its speed, performs face detection in a single pass, but may struggle with smaller faces or poor lighting. YOLO also requires significant hardware support, making it less ideal for resource constrained environments. SSD provides a balance between speed and accuracy, handling faces across multiple scales well but shares the same computational challenges as YOLO. MEDIPILINE is a consolidated method although it lacks the proven real-time efficiency and simplicity of Viola–Jones.

Despite these alternatives, the Viola–Jones algorithm’s simplicity, speed and reliability make it an optimal choice for DoctorLINK, where real-time performance is paramount. To understand how Viola–Jones achieves such efficiency, it is important to delve into its underlying methodology. The algorithm operates by leveraging Haar-like features to identify key structural patterns in an image, such as the intensity difference between the eyes and cheeks or between the nose and surrounding areas. These features capture essential characteristics of a face. To calculate these features efficiently, the algorithm uses an integral image representation, which allows the sum of pixel intensities within any rectangular region to be computed in constant time, significantly speeding up the process.

We implemented the CNN architecture by applying different deep-learning tasks:

Preprocess the input images (sequential images) from the dataset through filtering and normalization techniques.
Separate training and testing phases: the former includes balancing and training the images; the latter converts the images to feature vectors and defines the trained model.
Include the CNN architecture by defining the training schemes and further choosing the right activation function. The testing and validation results in the form of seven features of facial expressions.
Present the accuracy, precision, recall and F1-score (Section 5) of the overall methodology.

The workflow includes the following steps: data collection, preprocessing, extraction of features, model authentication and sentiment categorizing.

4.2. Datasets and Settings

The Cohn-Kanade Plus (CK+) dataset is a frequently used dataset and it consists of 593 sequences of face lossless grayscale images with

640 \times 490

pixels of 123 subjects, males and females, ranging from 18 to 50 years of age with a variety of heritage, who represent happiness, sadness, fear, contempt, angry, disgust and surprise. The Japanese Female Facial Expression (JAFFE) dataset contains 213 lossless images from 10 Japanese female subjects, each portraying seven emotions: angry, disgust, fear, happy, sad, surprise, neutral. These grayscale images were captured under controlled conditions with resolution of

256 \times 256

pixels. Lastly, the Facial Expression Recognition 2013 Dataset (FER2013) includes 35,887 lossy grayscale images with a resolution of

48 \times 48

pixels, divided into training (28,709) and testing (3589) sets and labeled as anger, disgust, fear, happiness, sadness, surprise or neutral emotion. FER2013 was created by using a crowdsourcing approach and images were collected from the Internet, resulting in a diverse representation of ethnicities and ages.

The image dimensions were resized to to

124 \times 124

pixels using bicubic interpolation and rotated horizontally so that every picture included in the training set shares the same properties (see Table 1). This was accomplished in the preprocessing stage by flipping and rescaling each image to match the input shape requirements of the proposed model based on VGG-19 and ResNet-152. Figure 2 shows some sample images from the used datasets. For non-square datasets such as CK+, we first resized the shorter side to 124 pixels while maintaining the aspect ratio, then applied center cropping on the longer side to preserve facial features. This ensured minimal distortion while maintaining consistency across dataset.

A pandas dataframe [71] is used to process image data. During training, data augmentation techniques such as horizontal flipping are applied to increase variability and improve model generalization. However, flipping is not applied to the test set to ensure unbiased evaluation. Resizing to

124 \times 124

pixels was performed consistently for both training and test images before any augmentation. Rescaling (normalization) was applied to the test set to ensure consistency with the preprocessing used in training. Stratified sampling is used to ensure that the same number of classes have been chosen for the training and test sets. The dataset preparation needed for the training task differs based on the work technique. As a general rule of thumb, most people have used and worked on optimizing the computational architecture. We opted (uniformly randomized data) for 80% of the data for training and 20% for testing, with 10% of the training data further used as a validation set to tune the optimal number of epochs.

Our approach modifies the classifier layer of the VGG-19 architecture in multiple ways. One dense layer with a pair of neurons is used first, followed by a dropout layer to avoid overfitting. The classifying layer’s subsequent function is the Softmax activation function. In general transfer learning, pre-trained models such as VGG-19 and ResNet-152 are fine-tuned on new datasets by replacing the final classification layer and retraining a subset of the network. However, in this study, we introduced additional modifications to enhance performance for facial expression recognition. We applied dataset-specific preprocessing (resizing to 124 × 124 pixels, data augmentation). The initial layers of VGG-19 and ResNet-152, trained on large-scale datasets such as ImageNet, are effective at capturing low-level features such as edges, textures and basic shapes. We selectively froze these layers to preserve their ability to extract transferable general features which can be applied across different tasks. We then fine-tuned the later layers of the models to adapt them to the specific characteristics of the CK+, FER2013 and JAFFE datasets. These layers are responsible for learning high-level, task-specific features, such as facial expressions and emotional cues. To further enhance the model’s robustness and reduce overfitting, we incorporated dropout layers. Flattening layers were also added to ensure efficient feature aggregation. This approach enabled us to leverage the strengths of transfer learning while tailoring the models to the unique requirements of facial expression recognition. To address real-world factors such as variations in facial structures, lighting conditions and occlusions, we employed data augmentation techniques to enhance data diversity, including horizontal flipping, rotation, brightness adjustment and normalization. These techniques simulate real-world scenarios and improve the robustness of the model to diverse and challenging conditions.

All tests were carried out on a desktop machine with an Intel i7-9700 CPU, 32 GiB RAM running Windows 10. TensorFlow 2.5 [72], Jupyter 3.0.14 [73] and Python 3.9 [74] have been installed by Anaconda v2024.10-1 [75]. Better performance with VGG-19 has been found in [26,27,28,32,76]. Weights from earlier training sessions were used to develop the model. To meet our positive and negative categorization requirements, we fine-tuned the final classification layer of VGG-19 as part of the transfer learning process. The pre-trained convolutional layers were frozen to retain general feature extraction capabilities, while the classifier layer was customized to include a dropout layer and a flattening layer. The final dense layer was connected to three nodes representing the categories of neutral, negative and positive, with Softmax as the activation function for probabilistic outcomes. The model was trained on datasets such as FER2013, JAFFE and CK+. The Adam optimizer [77] was used to minimize the categorical cross-entropy loss function through frequent weight updates, ensuring convergence and improved accuracy over time. Hyperparameters such as the learning rate (

10^{- 4}

) and batch size (32) were optimized and validation measures, including accuracy, precision, recall and F1-score, were monitored during training to evaluate performance. Cross-validation was employed to ensure robustness and the final model was evaluated on an unseen test set to confirm its generalization capability.

Figure 3 illustrates the CNN architecture for facial expression recognition. It begins with an input layer that processes resized facial images, typically standardized to dimensions such as

48 \times 48

pixels represented as matrices of pixel values. The core of the architecture consists of multiple convolutional layers, each employing (

3 \times 3

) filters to extract spatial features such as edges and textures. The number of filters increases progressively across the layers, starting with 64 in the first layer, followed by 128 and 256 in subsequent layers. After each convolutional layer, a max-pooling layer with a (

2 \times 2

) window is applied to downsample the feature maps, reducing spatial dimensions while retaining essential information and improving computational efficiency. The feature maps from the final convolutional layer are flattened into a one-dimensional vector, which is then passed through fully connected layers to combine the extracted features into a high-level representation. The output layer utilizes a Softmax activation function to classify the facial expressions into seven categories: neutral expressions (neutral), happy and surprised (positive), sad, angry, disgusted and fearful (negative).

Transfer learning, as employed in this study, involves fine-tuning pre-trained models such as ResNet-152 and VGG-19 on specific facial expression datasets (CK+, FER2013 and JAFFE). This approach adapts models trained on large-scale datasets (e.g., ImageNet) to our domain by modifying the classification layers and optimizing the parameters on the new datasets. This ensures that the models effectively learn features relevant to facial expression recognition while leveraging prior knowledge from the source domain.

Figure 4 presents the label distribution of the dataset using a bar chart to effectively illustrate the proportional representation of each emotion class. The chart highlights a significant label imbalance, with ’happy’ being the most frequent label and ’disgust’ being the least frequent. This imbalance can impact model performance, so data augmentation was used to mitigate its effects.

We fine-tuned the learning rate, the layer structure and the dropout value. Our objective was to achieve optimal performance while ensuring an efficient model convergence. Our training process is a result of meticulous testing and experimentation. It unfolds over 500 epochs with a starting learning rate of

10^{- 3}

. We settled on a single flattened layer featuring a dense layer and Softmax function with a dropout layer. To achieve state-of-the-art accuracy, we trained the model for 500 epochs and carefully determined the optimal number of epochs by monitoring performance on the validation (10% of the training data) set. Through this process, we identified the point at which the model achieved its best generalization, balancing training accuracy and preventing overfitting. This approach ensured that we trained the model sparingly, which could lead to diminishing returns or overfitting, but rather optimized its performance based on validation data and the model’s ability to generalize well.

Table 2 summarizes the layer structure and preprocessing steps used in VGG-19, including the number of layers at each stage and the application of batch normalization across multiple layers. We froze the VGG-19 convolutional base to retain its pre-trained feature extraction capabilities. The fully connected layers were replaced with a flatten layer, dense layers (256 and 512 neurons, ReLU), batch normalization and dropout (0.3) for stability and generalization. A final Softmax layer (7 classes) enables classification, allowing the model to efficiently adapt to facial expression recognition.

Table 3 presents the proposed ResNet-152 architecture, a deep CNN optimized for facial expression recognition. The model consists of 152 layers organized into five stages (conv1 to conv5), with increasing filters and decreasing spatial dimensions. To adapt it for our task, we replaced the original fully connected layers with a custom classifier, including dense layers (256 and 512 units), batch normalization and dropout for regularization. The final Softmax layer (7 units) classifies facial expressions into seven emotion categories. We followed the same approach for ResNet-152.

We selectively froze the initial layers of pre-trained VGG-19 and ResNet-152 to retain their general feature extraction capabilities. As shown in Table 2 and Table 3, the initial convolutional layers (e.g., block1_conv1 and conv1) and early residual blocks (e.g., conv2_x) are frozen (i.e., non-trainable) to ensure that their weights remain fixed during training. These layers capture low-level features such as edges, textures and basic shapes which are transferable across different tasks. Next layers, including the fully connected layers (e.g., dense_12 and dense_13), are trainable, thus allowing them to adapt to the specific characteristics of the facial expression recognition task. This selective freezing approach aligns with the fixed feature extraction layers, where the initial layers are kept fixed and only the classifier is replaced and fine-tuned. By leveraging this strategy, the model benefits from the robust feature representations of pre-trained models while adapting to the requirements due to facial expression recognition.

5. Results and Discussion

To evaluate the results of the proposed approach, we considered standard performance measures, including precision P, recall R, accuracy A and F1-score

F 1

(

P = \frac{T P}{T P + F P}

,

R = \frac{T P}{T P + F N}

,

A = \frac{T P + T N}{T P + T N + F P + F N}

,

F 1 = \frac{2 \times P \times R}{P + R}

), where

TP: correctly predicted positive samples (true positives);
FP: incorrectly predicted positive samples (false positives);
FN: missed positive samples (false negatives);
TN: correctly predicted negative samples (true negatives)).

Categorical cross-entropy [78] is a widely used loss function for multi-class classification tasks, including image classification tasks such as facial expression recognition, as well as any other machine learning task that involves classifying an example into one of many possible categories. This function computes the error by comparing the predicted probability distribution of the model with the true class labels:

L = - \sum_{i} c_{i} log p_{i}

(1)

where

c_{i}

represents the true class label in a one-hot encoded format (1 for the correct class, 0 for others) and

p_{i}

is the corresponding predicted probability. The loss function is computed for each instance by summing over all possible classes, and the final loss is averaged over a batch of instances. A lower loss indicates that the model is making better predictions.

While tuning for the exploration–exploitation dilemma, we empirically realized that 500 epochs proved efficient because more epochs would cause overfitting and fewer are insufficient for learning. More specifically, we monitored accuracy and loss during training and have determined that this fixed epoch setting is particularly effective at stabilizing the model across all datasets and types. This approach helped to reduce the amount of training time while at the same time proudly showcasing the models.

To make the choice of the number of epochs more reliable, the plots in Figure 5, Figure 6 and Figure 7 were obtained as the average of the results of ten experiments each. These plots also show the average and standard deviation and Some small fluctuations are still observable due to the few images in the testing sets (especially for JAFFE). The number of epochs was derived by comparing our accuracy against the best results gained by methods already described in the literature. In the case of CK+ the goal

A = 0.98

was obtained in [60]: Figure 5 highlights that 350 epochs for ResNet-152 are sufficient to reach that accuracy and that 100 epochs are enough for VGG-19. In the case of Fer2013 the goal

A = 0.64

was obtained in [61]: unfortunately, Figure 6 shows that ResNet-152 and VGG-19 were unable to achieve this target. In the case of JAFFE, the goal

A = 0.91

was obtained in [62]: Figure 7 points out that ResNet-152 is unable to achieve that performance, though 450 epochs are enough for VGG-19. The lower accuracy on FER2013 is likely due to its diverse and noisy nature, as it contains images collected from the internet with varying lighting conditions, poses and occlusions.

Although this approach may deviate from more conventional fixed epoch concepts seen in some literature, this serves as a heuristic empirical approach to maximizing model performance based on dataset-specific requirements alongside model capabilities. It is a common practice used in machine learning research, allowing us to train models that will produce the best results.

This study shows that the suggested VGG-19 model performs better than ResNet-152 on the CK+ dataset. Building a deep learning or transfer learning model requires careful consideration of several hyperparameters, aiming to maximize learning and minimize loss. The learning process can be optimized by adjusting a wide variety of hyperparameters, such as activation function selection, hidden units, learning rate and the number of iterations. Smaller batch sizes, momentum and regularization are also crucial. To improve performance, the suggested model was iterated several times; the only layer that was altered in this process was the classifier layer and VGG-19 was used for the modeling task.

The first thing we found in our experiment was that the model was overfitting. A much better training of the model during this phase had minimal effect on the validation data, completely failing faster and resulting in massive data loss and lower validation accuracy. To address this, we persisted in refining the hyperparameters (learning rate, number of epochs, batch size, number of layers, optimizer, regularization and activation function) of the model until we found more appropriate values. Hyperparameter tuning was performed using grid search and random search methods. Key parameters such as learning rate, batch size, dropout rate and the number of layers frozen during fine-tuning were optimized. Grid search helped exhaustively evaluate a range of values, while random search efficiently explored larger parameter spaces. Cross-validation was employed to assess model performance and reduce overfitting.

Table 4 shows the statistical robustness of the model’s performance, presenting the mean and standard deviation of the dataset for VGG-19 and ResNet-152 to support the rationale for VGG-19’s robust performance. Leveraging the capabilities of VGG-19, we developed an efficient facial expression detection system for online video calls, boasting an average processing time of 0.56 s. This innovation carries significant potential in the healthcare sector, enabling remote diagnosis through facial expression analysis. Our system is a valuable tool for effectively conveying emotions by accurately detecting live facial expressions during video calls.

Facial expressions captured during online video calls are depicted in Figure 8, showcasing the ability to analyze individuals’ emotions remotely. Video calling enables rapid-time communication over the Internet, allowing users to interact visually and audibly regardless of geographic location. The implemented system holds promise for the healthcare sector, facilitating the diagnosis of patients’ emotional states during virtual consultations. Looking ahead, advancements in video calling technology could revolutionize healthcare delivery by enabling more immersive and empathetic remote interactions between patients and healthcare providers.

Table 5 enlists the comparison of different neural network algorithms on the various facial emotion detection datasets. Our methods have performed better than other methodologies that are described in the literature. VGG-19 and ResNet-152 have outperformed other solvers on CK+ and JAFFE.

VGG-19’s robust performance is highlighted by its constant trend of producing higher accuracy across various datasets, with CK+ serving as a particular highlight. In addition to experimental evaluation, the outcome of our research was the development of an online system specifically designed for rapid recognition of facial expressions during video conferences. During a video call, the webcam records a series of photos, which the model instantly analyzes to categorize emotions. This technology is a valuable tool for consumers and healthcare providers since it helps explain people’s emotions during live video chats. This cutting-edge device, allowing for the remote tracking and assessment of facial expressions, has enormous promise, especially in healthcare settings. The conclusion is based on comparisons with multiple models, including those listed in the results Table 5.

6. Conclusions

This study presents a comprehensive analysis of deep learning models for facial expression recognition, emphasizing their potential to advance remote healthcare monitoring. Our research explored the utilization of ResNet-152 and VGG-19 methodologies across three distinct datasets: CK+, FER2013 and JAFFE. Notably, VGG-19 demonstrated superior performance on the CK+ dataset, achieving an accuracy of

A = 0.98

, while ResNet-152 outperformed VGG-19 on the FER2013 and JAFFE datasets, with accuracy values of

A = 0.44

and

A = 0.89

, respectively. These results highlight the importance of dataset-specific model selection and optimization.

Unlike generic facial expression recognition studies, our work is specifically tailored to assess patients’ facial expressions remotely. The modularity of the proposed methodology allows its useful integration into a variety of environments, for example, in the context of telemedicine, where remote monitoring tools such as DoctorLINK are becoming increasingly essential.

We aim to enhance the implemented system by incorporating multimodal techniques, thus refining the remote detection of depression within healthcare systems. This approach aligns with the evolving landscape of technology-driven healthcare solutions, promising to revolutionize remote monitoring practices and contribute significantly to the field. Overall, our study not only sheds light on the comparative efficacy of deep learning models but also paves the way for practical applications in real-world scenarios, particularly in the crucial domain of healthcare monitoring. Future directions include integrating advanced deep learning techniques, such as vision transformers, self-supervised learning and reinforcement learning, to enhance facial expression recognition accuracy and expand applications in telemedicine. Additionally, ensuring privacy and security in video calling platforms will be essential for fostering trust and widespread adoption in healthcare settings.

Author Contributions

Conceptualization, R.K., G.C., T.F.F., S.I.H., D.T. and C.V.; software, R.K., G.C., T.F.F., S.I.H., D.T. and C.V.; validation, R.K., G.C., T.F.F., S.I.H., D.T. and C.V.; writing—original draft preparation, R.K., G.C., T.F.F., S.I.H., D.T. and C.V.; writing—review and editing, R.K., G.C., T.F.F., S.I.H., D.T. and C.V. All authors have read and agreed to the published version of the manuscript.

Funding

Cesare Valenti is supported by the research fund of the University of Palermo: FFR 2024 Cesare Valenti. Cesare Valenti is a member of the “Gruppo Nazionale Calcolo Scientifico-Istituto Nazionale di Alta Matematica (GNCS-INdAM)”. The research leading to these results was supported by the European Union’s NextGenerationEU through the Italian Ministry of Universities and Research under grant PNRR-I-M4C2-I1.3 Project PE_00000019, “HEAL ITALIA”, to Domenico Tegolo, CUP B73C22001250006.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Director of the Department of Mathematics and Computer Science, University of Palermo, Italy.

Informed Consent Statement

Informed consent for publication was obtained from the participants depicted in Figure 8.

Data Availability Statement

Additional material can be downloaded from https://bit.ly/4eyCsN6 (accessed on 13 April 2025). Figure 2 shows examples from downloadable public-domain datasets [17,18,19].

Conflicts of Interest

Authors Giacomo Corvisieri and Tullio Flavio Fici were employed by Italtel S.p.A. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Shan, K.; Guo, J.; You, W.; Lu, D.; Bie, R. Automatic facial expression recognition based on a deep convolutional-neural-network structure. In Proceedings of the 15th International Conference on Software Engineering Research, Management and Applications, London, UK, 7–9 June 2017; pp. 123–128. [Google Scholar]
Ghosh, S.; Priyankar, A.; Ekbal, A.; Bhattacharyya, P. Multitasking of sentiment detection and emotion recognition in code-mixed Hinglish data. Knowl. Based Syst. 2023, 260, 110182. [Google Scholar] [CrossRef]
Karilingappa, K.; Jayadevappa, D.; Ganganna, S. Human emotion detection and classification using modified Viola-Jones and convolution neural network. IAES Int. J. Artif. Intell. 2023, 12, 79. [Google Scholar] [CrossRef]
Banskota, N.; Alsadoon, A.; Prasad, P.; Dawoud, A.; Rashid, T.; Alsadoon, O. A novel enhanced convolution neural network with extreme learning machine: Facial emotional recognition in psychology practices. Multimed. Tools Appl. 2023, 82, 6479–6503. [Google Scholar] [CrossRef]
Li, S.; Deng, W. Deep facial expression recognition: A survey. Trans. Affect. Comput. 2020, 13, 1195–1215. [Google Scholar] [CrossRef]
Ashok Kumar, P.; Maddala, J.; Martin Sagayam, K. Enhanced facial emotion recognition by optimal descriptor selection with the neural network. IETE J. Res. 2023, 69, 2595–2614. [Google Scholar] [CrossRef]
Gupta, S.; Kumar, P.; Tekchandani, R. Facial emotion recognition based real-time learner engagement detection system in online learning context using deep learning models. Multimed. Tools Appl. 2023, 82, 11365–11394. [Google Scholar] [CrossRef]
Shahzad, T.; Iqbal, K.; Khan, M.; Iqbal, N. Role of zoning in facial expression using deep learning. IEEE Access 2023, 11, 16493–16508. [Google Scholar] [CrossRef]
Meena, G.; Mohbey, K.; Kumar, S. Sentiment analysis on images using convolutional neural networks-based Inception-V3 transfer learning approach. Int. J. Inf. Manag. Data Insights 2023, 3, 100174. [Google Scholar] [CrossRef]
Singh, P.; Pandey, S.; Sharma, A.; Gupta, T. Implemented Model for CNN Facial Expressions: Emotion Recognition. In Proceedings of the International Conference on Sustainable Emerging Innovations in Engineering and Technology, Ghaziabad, India, 14–15 September 2023; pp. 732–737. [Google Scholar]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; Volume 1. [Google Scholar]
Halder, S.; Afsari, K. Robots in inspection and monitoring of buildings and infrastructure: A systematic review. Appl. Sci. 2023, 13, 2304. [Google Scholar] [CrossRef]
Tembhurne, J.V.; Diwan, T. Sentiment analysis in textual, visual and multimodal inputs using recurrent neural networks. Multimed. Tools Appl. 2021, 80, 6871–6910. [Google Scholar] [CrossRef]
Chandrasekaran, G.; Antoanela, N.; Andrei, G.; Monica, C.; Hemanth, J. Visual sentiment analysis using deep learning models with social media data. Appl. Sci. 2022, 12, 1030. [Google Scholar] [CrossRef]
Swarnkar, M.; Rajput, S. (Eds.) Artificial Intelligence for Intrusion Detection Systems; CRC Press: Boca Raton, FL, USA, 2023. [Google Scholar]
Jang, G.; Kim, D.; Lee, I.; Jung, H. Cooperative Beamforming with Artificial Noise Injection for Physical-Layer Security. IEEE Access 2023, 11, 22553–22573. [Google Scholar] [CrossRef]
Lucey, P.; Cohn, J.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
Goodfellow, I.; Erhan, D.; Carrier, P.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.e.a. Challenges in representation learning: A report on three machine learning contests. In Proceedings of the 20th International Conference on Neural Information Processing, Daegu, Republic of Korea, 3–7 November 2013; Springer: Berlin/Heidelberg, Germany, 2013; Volume 20, pp. 117–124. [Google Scholar]
Lyons, M.; Akamatsu, S.; Kamachi, M.; Gyoba, J. Coding facial expressions with Gabor wavelets. In Proceedings of the 3rd International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 200–205. [Google Scholar]
Mollahosseini, A.; Chan, D.; Mahoor, M.H. Going deeper in facial expression recognition using deep neural networks. In Proceedings of the Applications of Computer Vision, Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
Miolla, A.; Cardaioli, M.; Scarpazza, C. Padova Emotional Dataset of Facial Expressions (PEDFE): A unique dataset of genuine and posed emotional facial expressions. Behav. Res. 2023, 55, 2559–2574. [Google Scholar] [CrossRef]
Romani-Sponchiado, A.; Sanvicente-Vieira, B.; Mottin, C.; Hertzog-Fonini, D.; Arteche, A. Child Emotions Picture Set (CEPS): Development of a database of children’s emotional expressions. Psychol. Neurosci. 2015, 8, 467. [Google Scholar] [CrossRef]
Benitez-Quiroz, F.; Srinivasan, R.; Martinez, A.M. Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5562–5570. [Google Scholar]
Mascaró-Oliver, M.; Mas-Sansó, R.; Amengual-Alcover, E.; Roig-Maimó, M.F. UIBVFED-mask: A dataset for comparing facial expressions with and without face masks. Data 2023, 8, 17. [Google Scholar] [CrossRef]
Zhang, W.; Song, P.; Zheng, W. Joint local-global discriminative subspace transfer learning for facial expression recognition. Trans. Affect. Comput. 2022, 14, 2484–2495. [Google Scholar] [CrossRef]
Li, X.; Xiao, Z.; Li, C.; Li, C.; Liu, H.; Fan, G. Facial expression recognition network with slow convolution and zero-parameter attention mechanism. Optik 2023, 283, 170892. [Google Scholar] [CrossRef]
Fan, S.; Jiang, M.; Shen, Z.; Koenig, B.; Kankanhalli, M.; Zhao, Q. The role of visual attention in sentiment prediction. In Proceedings of the 25th ACM international conference on Multimedia, Mountain View, CA, USA, 13–37 October 2017; pp. 217–225. [Google Scholar]
Chen, T.; Borth, D.; Darrell, T.; Chang, S. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv 2014, arXiv:1410.8586. [Google Scholar]
Marab, S.; Pawar, M. Feature Learning for Effective Content-Based Image Retrieval. In Proceedings of the Computer Vision and Image Processing: 4th International Conference, Bangkok, Thailand, 9 –11 December 2020; Springer: Singapore, 2020; Volume 4, pp. 395–404. [Google Scholar]
Ali, L.; Alnajjar, F.; Jassmi, H.; Gocho, M.; Khan, W.; Serhani, M. Performance evaluation of deep CNN-based crack detection and localization techniques for concrete structures. Sensors 2021, 21, 1688. [Google Scholar] [CrossRef]
Ali, M.; Khatun, M.; Turzo, N. Facial emotion detection using neural network. Int. J. Sci. Eng. Res. 2020, 11, 1318–1325. [Google Scholar]
Helaly, R.; Messaoud, S.; Bouaafia, S.; Hajjaji, M.; Mtibaa, A. DTL-I-ResNet18: Facial emotion recognition based on deep transfer learning and improved ResNet18. Signal Image Video Process. 2023, 17, 2731–2744. [Google Scholar] [CrossRef]
Liu, J.; Fu, F. Convolutional neural network model by deep learning and teaching robot in keyboard musical instrument teaching. PLoS ONE 2023, 18, e0293411. [Google Scholar] [CrossRef] [PubMed]
Taha, B.; Hatzinakos, D. Emotion recognition from 2D facial expressions. In Proceedings of the Canadian Conference of Electrical and Computer Engineering, Edmonton, AB, Canada, 5–8 May 2019; pp. 1–4. [Google Scholar]
Wu, C.; Chai, L.; Yang, J.; Sheng, Y. Facial expression recognition using convolutional neural network on graphs. In Proceedings of the Chinese Control Conference, Guangzhou, China, 27–30 July 2019; pp. 7572–7576. [Google Scholar]
Rasamoelina, A.; Adjailia, F.; SinČàk, P. Deep convolutional neural network for robust facial emotion recognition. In Proceedings of the International Symposium on INnovations in Intelligent SysTems and Applications, Sofia, Bulgaria, 3–5 July 2019; pp. 1–6. [Google Scholar]
Cambria, E.; Das, D.; Bandyopadhyay, S.; Feraco, A. A Practical Guide to Sentiment Analysis; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; Volume 5. [Google Scholar]
Islam, J.; Zhang, Y. Visual sentiment analysis for social images using transfer learning approach. In Proceedings of the International Conference on Big Data and Cloud Computing, Macau, China, 16–18 November 2016; pp. 124–130. [Google Scholar]
Meena, G.; Mohbey, K.; Indian, A. Categorizing sentiment polarities in social networks data using convolutional neural network. SN Comput. Sci. 2022, 3, 116. [Google Scholar] [CrossRef]
Ben, X.; Ren, Y.; Zhang, J.; Wang, S.J.; Kpalma, K.; Meng, W.; Liu, Y.J. Video-based facial micro-expression analysis: A survey of datasets, features and algorithms. Trans. Pattern Anal. Mach. Intell. 2021, 44, 5826–5846. [Google Scholar] [CrossRef] [PubMed]
Lei, Y.; Cao, H. Audio-Visual Emotion Recognition with Preference Learning Based on Intended and Multi-Modal Perceived Labels. Trans. Affect. Comput. 2023, 14, 2954–2969. [Google Scholar] [CrossRef]
Karnati, M.; Seal, A.; Bhattacharjee, D.; Yazidi, A.; Krejcar, O. Understanding deep learning techniques for recognition of human emotions using facial expressions: A comprehensive survey. Trans. Instrum. Meas. 2023, 72, 5006631. [Google Scholar] [CrossRef]
Kanna, R.K.; Kripa, N.; Vasuki, R. Systematic Design of Lie Detector System Utilizing EEG Signals Acquisition. Int. J. Sci. Technol. Res. 2023, 9, 610–612. [Google Scholar]
Ekundayo, O.; Viriri, S. Facial expression recognition: A review of trends and techniques. IEEE Access 2021, 9, 136944–136973. [Google Scholar] [CrossRef]
Saurav, S.; Singh, S.; Saini, R.; Yadav, M. Facial expression recognition using improved adaptive local ternary pattern. In Proceedings of the 3rd International Conference on Computer Vision and Image Processing, Prayagraj, India, 4–6 December 2020; pp. 39–52. [Google Scholar]
Niu, B.; Gao, Z.; Guo, B. Facial expression recognition with LBP and ORB features. Comput. Intell. Neurosci. 2021, 2021, 8828245. [Google Scholar] [CrossRef]
Lu, F.; Zhang, L.; Tian, G. User Emotion Recognition Method Based on Facial Expression and Speech Signal Fusion. In Proceedings of the 16th Conference on Industrial Electronics and Applications, Chengdu, China, 1–4 August 2021; pp. 1121–1126. [Google Scholar]
Zhang, J.; Yu, H. Improving the facial expression recognition and its interpretability via generating expression pattern-map. Pattern Recognit. 2022, 129, 108737. [Google Scholar] [CrossRef]
Poux, D.; Allaert, B.; Ihaddadene, N.; Bilasco, I.; Djeraba, C.; Bennamoun, M. Dynamic facial expression recognition under partial occlusion with optical flow reconstruction. Trans. Image Process. 2021, 31, 446–457. [Google Scholar] [CrossRef] [PubMed]
Poux, D.; Allaert, B.; Mennesson, J.; Ihaddadene, N.; Bilasco, I.; Djeraba, C. Facial expressions analysis under occlusions based on specificities of facial motion propagation. Multimed. Tools Appl. 2021, 80, 22405–22427. [Google Scholar] [CrossRef]
Kumar, R.; Hussain, S. A review of the deep convolutional neural networks for the analysis of facial expressions. J. Innov. Technol. 2024, 6, 41–49. [Google Scholar]
Tang, Y.; Zhang, X.; Hu, X.; Wang, S.; Wang, H. Facial expression recognition using frequency neural network. Trans. Image Process. 2020, 30, 444–457. [Google Scholar] [CrossRef] [PubMed]
Patel, K.; Mehta, D.; Mistry, C.; Gupta, R.; Tanwar, S.; Kumar, N.; Alazab, M. Facial sentiment analysis using AI techniques: State-of-the-art, taxonomies, and challenges. IEEE Access 2020, 8, 90495–90519. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Z.; Lu, H.; Zhou, X.; Phillips, P.; Liu, Q.; Wang, S. Facial emotion recognition based on biorthogonal wavelet entropy, fuzzy support vector machine, and stratified cross validation. IEEE Access 2016, 4, 8375–8385. [Google Scholar] [CrossRef]
Siqueira, H.; Magg, S.; Wermter, S. Efficient facial feature learning with wide ensemble-based convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 5800–5809. [Google Scholar]
Fu, X.; Wu, Z.; Wang, W.; Xie, T.; Keten, S.; Gomez-Bombarelli, R.; Jaakkola, T. Forces are not enough: Benchmark and critical evaluation for machine learning force fields with molecular simulations. arXiv 2022, arXiv:2210.07237. [Google Scholar]
Lu, R.; Zhao, X.; Li, J.; Niu, P.; Yang, B.; Wu, H.; Wang, W.; Song, H.; Huang, B.; Zhu, N.; et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: Implications for virus origins and receptor binding. Lancet 2020, 395, 565–574. [Google Scholar] [CrossRef]
Abdullah, S.M.S.; Abdulazeez, A.M. Facial expression recognition based on deep learning convolution neural network: A review. J. Soft Comput. Data Min. 2021, 2, 53–65. [Google Scholar]
Huang, Y.; Chen, F.; Lv, S.; Wang, X. Facial expression recognition: A survey. Symmetry 2019, 11, 1189. [Google Scholar] [CrossRef]
Zheng, H.; Wang, R.; Ji, W.; Zong, M.; Wong, W.; Lai, Z.; Lv, H. Discriminative deep multi-task learning for facial expression recognition. Inf. Sci. 2020, 533, 60–71. [Google Scholar] [CrossRef]
Poruşniuc, G.; Leon, F.; Timofte, R.; Miron, C. Convolutional neural networks architectures for facial expression recognition. In Proceedings of the E-Health and Bioengineering Conference, Iasi, Romania, 21–23 November 2019; pp. 1–6. [Google Scholar]
Hung, J.C.; Lin, K.C.; Lai, N.X. Recognizing learning emotion based on convolutional neural networks and transfer learning. Appl. Soft Comput. 2019, 84, 105724. [Google Scholar] [CrossRef]
Kusuma, G.P.; Jonathan, J.; Lim, A. Emotion recognition on fer-2013 face images using fine-tuned vgg-16. Adv. Sci. Technol. Eng. Syst. J. 2020, 5, 315–322. [Google Scholar] [CrossRef]
eHealth Products by Italtel. Available online: https://www.italtel.com/doctorlink-digital-health-in-your-hands (accessed on 13 April 2025).
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016. Available online: https://eur-lex.europa.eu/eli/reg/2016/679/oj (accessed on 15 April 2025).
Bagherian, E.; Rahmat, R. Facial feature extraction for face recognition: A review. In Proceedings of the International Symposium on Information Technology, Kuala Lumpur, Malaysia, 26–29 August 2008; Volume 2, pp. 1–9. [Google Scholar]
Zhang, N.; Luo, J.; Gao, W. Research on face detection technology based on MTCNN. In Proceedings of the International Conference on Computer Network, Electronic and Automation, Xi’an, China, 25–27 September 2020; pp. 154–158. [Google Scholar]
Sandeep, P.; Kumar, N.S. Pain detection through facial expressions in children with autism using deep learning. Soft Comput. 2024, 28, 4621–4630. [Google Scholar] [CrossRef]
Yang, W.; Zheng, Z. Real-time face detection based on YOLO. In Proceedings of the Knowledge Innovation and Invention, Jeju Island, Repunlic of Korea, 23–27 July 2018; pp. 221–224. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
McKinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010. [Google Scholar]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A System for Large-Scale Machine Learning. Version 2.5. 2021. Available online: https://www.tensorflow.org (accessed on 15 April 2025).
Jupyter, P. Jupyter Notebook: An Open Source Platform for Interactive Computing. Version 3.0.14. 2020. Available online: https://jupyter.org (accessed on 15 April 2025).
Foundation, P.S. Python Language Reference, Version 3.9. 2020. Available online: https://www.python.org (accessed on 15 April 2025).
Anaconda, I. Anaconda: The Open Data Science Platform. 2021. Available online: https://www.anaconda.com (accessed on 15 April 2025).
Punuri, S.; Kuanar, S.; Kolhar, M.; Mishra, T.; Alameen, A.; Mohapatra, H.; Mishra, S. Efficient net-XGBoost: An implementation for facial emotion recognition using transfer learning. Mathematics 2023, 11, 776. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Categorical Cross-Entropy: Unraveling Its Potentials in Multi-Class Classification. 2020. Available online: https://medium.com/@vergotten/categorical-cross-entropy-unraveling-its-potentials-in-multi-class-classification-705129594a01 (accessed on 15 April 2025).

Figure 1. Key operational principles of remote monitoring.

Figure 2. Facial expressions in CK+, JAFFE and FER2013.

Figure 3. Convolutional neural network architecture for facial expression recognition.

Figure 4. Image distribution in CK+, JAFFE and FER2013. The exact number of images is indicated within the parentheses. From now on the reader is referred to the electronic version of this article for interpretation of the colors.

Figure 5. Accuracy and loss om training and validation on CK+ using ResNet-152 (top) and VGG-19 (bottom). The red line indicates the most opportune number of epochs.

Figure 6. Accuracy and loss in training and validation on FER2013 using ResNet-152 (top) and VGG-19 (bottom). The red line indicates the most opportune number of epochs.

Figure 7. Accuracy and loss in training and validation on JAFFE using ResNet-152 (top) and VGG-19 (bottom). The red line indicates the most opportune number of epochs.

Figure 8. Examples of typical emotions captured during real video calls by our system.

Table 1. Summary of datasets in the study.

Dataset	Number of Images	Original Dimensions	Resized Dimensions
CK+	593	640 × 480	124 × 124
FER2013	35,887	48 × 48	124 × 124
JAFFE	213	256 × 256	124 × 124

Table 2. Proposed VGG-19 architecture.

Layer (Type)	Output Shape	Parameters
`input_5 (inputlayer)`	(none, 124, 124, 3)	0
`block1_conv1 (conv2d)`	(none, 124, 124, 64)	1792
`block1_conv2 (conv2d)`	(none, 124, 124, 64)	36,928
`block1_pool (maxpooling2d)`	(none, 62, 62, 64)	0
`block2_conv1 (conv2d)`	(none, 62, 62, 128)	73,856
`block2_conv2 (conv2d)`	(none, 62, 62, 128)	147,584
`block2_pool (maxpooling2d)`	(none, 31, 31, 128)	0
`block3_conv1 (conv2d)`	(none, 31, 31, 256)	295,168
`block3_conv2 (conv2d)`	(none, 31, 31, 256)	590,080
`block3_conv3 (conv2d)`	(none, 31, 31, 256)	590,080
`block3_conv4 (conv2d)`	(none, 31, 31, 256)	590,080
`block3_pool (maxpooling2d)`	(none, 15, 15, 256)	0
`block4_conv1 (conv2d)`	(none, 15, 15, 512)	1,180,160
`block4_conv2 (conv2d)`	(none, 15, 15, 512)	2,359,808
`block4_conv3 (conv2d)`	(none, 15, 15, 512)	2,359,808
`block4_conv4 (conv2d)`	(none, 15, 15, 512)	2,359,808
`block4_pool (maxpooling2d)`	(none, 7, 7, 512)	0
`block5_conv1 (conv2d)`	(none, 7, 7, 512)	2,359,808
`block5_conv2 (conv2d)`	(none, 7, 7, 512)	2,359,808
`block5_conv3 (conv2d)`	(none, 7, 7, 512)	2,359,808
`block5_conv4 (conv2d)`	(none, 7, 7, 512)	2,359,808
`block5_pool (maxpooling2d)`	(none, 3, 3, 512)	0
`flatten_4 (flatten)`	(none, 4,608)	0
`dense_12 (dense)`	(none, 256)	1,179,904
`batch_normalization_8 (batchnormalization)`	(none, 256)	1024
`dropout_8 (dropout)`	(none, 256)	0
`dense_13 (dense)`	(none, 512)	131,584
`batch_normalization_9 (batchnormalization)`	(none, 512)	2048
`dropout_9 (dropout)`	(none, 512)	0
`dense_14 (dense)`	(none, 7)	3591
total params		21,342,535
trainable params		8,396,039
frozen (non-trainable) params		12,946,496

Table 3. Proposed ResNet-152 architecture.

Layer (Type)	Output Shape	Parameters
`input layer`	(124, 124, 3)	0
`conv1 (conv2d, 7x7, stride=2)`	(62, 62, 64)	9472
`maxpooling (3x3, stride=2)`	(31, 31, 64)	0
`conv2_x (residual blocks x3)`	(31, 31, 256)	∼200,000
`conv3_x (residual blocks x8, stride=2)`	(16, 16, 512)	∼1,200,000
`conv4_x (residual blocks x36, stride=2)`	(8, 8, 1024)	∼7,000,000
`conv5_x (residual blocks x3, stride=2)`	(4, 4, 2048)	∼14,000,000
`global average pooling`	(1, 1, 2048)	0
`flatten layer`	(2048)	0
`dense (256 units, relu)`	(256)	524,544
`batch normalization`	(256)	1024
`dropout (30%)`	(256)	0
`dense (512 units, relu)`	(512)	131,584
`batch normalization`	(512)	2048
`dropout (30%)`	(512)	0
`dense (7 units, softmax)`	(7)	3591
total params		58,370,944
trainable params		58,219,520
frozen (non-trainable) params		151,424

Table 4. Mean

μ

and standard deviation

σ

at chosen epochs: loss L and accuracy A.

Table 4. Mean

μ

and standard deviation

σ

at chosen epochs: loss L and accuracy A.

Model & Dataset	L ( $μ \pm σ$ )	A ( $μ \pm σ$ )	Validation L ( $μ \pm σ$ )	Validation A ( $μ \pm σ$ )
ResNet-152 & CK+	0.00 ± 0.02	1.00 ± 0.01	0.24 ± 0.15	0.97 ± 0.03
VGG-19 & CK+	0.02 ± 0.01	0.99 ± 0.02	0.18 ± 0.04	0.98 ± 0.01
ResNet-152 & JAFFE	1.07 ± 0.14	0.60 ± 0.06	2.94 ± 2.90	0.31 ± 0.09
VGG-19 & JAFFE	0.26 ± 0.17	0.91 ± 0.06	0.83 ± 0.28	0.82 ± 0.05
ResNet-152 & FER2013	1.65 ± 0.00	0.35 ± 0.00	1.67 ± 0.09	0.35 ± 0.03
VGG-19 & FER2013	1.53 ± 0.00	0.40 ± 0.00	1.51 ± 0.00	0.41 ± 0.00

Table 5. Comparison of model performances: accuracy A, precision P, recall R and F1-score

F 1

.

Table 5. Comparison of model performances: accuracy A, precision P, recall R and F1-score

F 1

.

Model	Dataset	A	P	R	$F 1$
DDMTL [60]	CK+	0.98	—	—	—
miniXception ensemble [61]	FER2013	0.64	—	—	—
Dense_FaceLiveNet [62]	JAFFE	0.91	—	—	—
Proposed ResNet-152	CK+	0.98	0.81	0.97	0.93
	FER2013	0.39	0.69	0.68	0.68
	JAFFE	0.64	0.68	0.59	0.65
Proposed VGG-19	CK+	0.98	0.98	0.96	0.97
	FER2013	0.44	0.71	0.72	0.70
	JAFFE	0.89	0.91	0.90	0.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kumar, R.; Corvisieri, G.; Fici, T.F.; Hussain, S.I.; Tegolo, D.; Valenti, C. Transfer Learning for Facial Expression Recognition. Information 2025, 16, 320. https://doi.org/10.3390/info16040320

AMA Style

Kumar R, Corvisieri G, Fici TF, Hussain SI, Tegolo D, Valenti C. Transfer Learning for Facial Expression Recognition. Information. 2025; 16(4):320. https://doi.org/10.3390/info16040320

Chicago/Turabian Style

Kumar, Rajesh, Giacomo Corvisieri, Tullio Flavio Fici, Syed Ibrar Hussain, Domenico Tegolo, and Cesare Valenti. 2025. "Transfer Learning for Facial Expression Recognition" Information 16, no. 4: 320. https://doi.org/10.3390/info16040320

APA Style

Kumar, R., Corvisieri, G., Fici, T. F., Hussain, S. I., Tegolo, D., & Valenti, C. (2025). Transfer Learning for Facial Expression Recognition. Information, 16(4), 320. https://doi.org/10.3390/info16040320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transfer Learning for Facial Expression Recognition

Abstract

1. Introduction

Datasets and Model Overview

2. Related Work on Facial Expression Recognition

3. DoctorLINK: Integrating Deep Learning for Facial Expression Recognition

3.1. Functional Principles of Remote Monitoring

3.2. Platform Description, Usability, Security and Privacy

4. Materials and Methods

4.1. Face Detection

4.2. Datasets and Settings

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI