Facial Recognition Using Hidden Markov Model and Convolutional Neural Network

Bilal, Muhammad; Razzaq, Saqlain; Bhowmike, Nirman; Farooq, Azib; Zahid, Muhammad; Shoaib, Sultan

doi:10.3390/ai5030079

Open AccessArticle

Facial Recognition Using Hidden Markov Model and Convolutional Neural Network

by

Muhammad Bilal

¹

,

Saqlain Razzaq

¹

,

Nirman Bhowmike

²

,

Azib Farooq

³,

Muhammad Zahid

^4,*

and

Sultan Shoaib

^5,*

¹

Department of Electrical Engineering, Military College of Signals (MCS), National University of Sciences and Technology (NUST), Khadim Hussain Road, Rawalpindi 46000, Punjab, Pakistan

²

Phillip M. Drayer Department of Electrical Engineering, Lamar University, Beaumont, TX 77705, USA

³

Department of Computer Science and Engineering, Miami University, Oxford, OH 45056, USA

⁴

Department of Telecommunication Engineering, University of Engineering and Technology, Taxila 47050, Punjab, Pakistan

⁵

Faculty of Arts Computing and Engineering, Wrexham University, Wrexham LL11 2AW, UK

^*

Authors to whom correspondence should be addressed.

AI 2024, 5(3), 1633-1647; https://doi.org/10.3390/ai5030079

Submission received: 16 July 2024 / Revised: 26 August 2024 / Accepted: 2 September 2024 / Published: 6 September 2024

(This article belongs to the Section AI Systems: Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Face recognition (FR) uses a passive approach to person authentication that avoids face-to-face contact. Among different FR techniques, most FR approaches place little emphasis on reducing powerful cryptography and instead concentrate on increasing recognition rates. In this paper, we have proposed the Hidden Markov Model (HMM) and convolutional Neural Network (CNN) models for FR by using ORL and Yale datasets. Facial images from the given data sets are divided into 3 portions, 4 portions, 5 portions, and 6 portions corresponding to their respective HMM hidden states being used in the HMM model. Quantized levels of eigenvalues and eigenvector coefficients of overlapping blocks of facial images define the observation states of the HMM model. For image selection and rejection, a threshold is calculated using singular value decomposition (SVD). After training HMM on 3 states HMM, 4 states HMM, 5 states HMM, and 6 states HMM, the recognition accuracies are 96.5%, 98.5%, 98.5%, and 99.5%, respectively, on the ORL database and 90.6667%, 94.6667%, 94.6667%, and 94.6667% on the Yale database. The CNN model uses convolutional layers, a max-pooling layer, a flattening layer, a dense layer, and a dropout layer. Relu is used as the activation function in all layers except in the last layer, where softmax is used as the activation function. Cross entropy is used as a loss function, and we have used the Adam optimizer in our proposed algorithm. The proposed CNN model has given 100% training and testing accuracy on the ORL data set. While using the Yale data set, the CNN model has a training accuracy of 100% and a testing accuracy of 85.71%. In this paper, our proposed model showed that the HMM model is cost-effective with lesser accuracy, while the CNN model is more accurate as compared to HMM but has a higher computational cost.

Keywords:

convolutional neural network; hidden Markov model; principal component analysis; singular value decomposition

1. Introduction

Face recognition (FR) is a difficult and rapidly expanding study area. FR uses a passive approach to person authentication that avoids face-to-face contact. Various techniques have been developed by investigators in both ideal and chaotic settings. The majority of FR approaches place little emphasis on reducing powerful cryptography and instead concentrate on increasing recognition rates. The five facial areas of the existing state hidden Markov model (HMM) [1] are the hair, forehead, eyes, nose, and mouth. Two additional facial regions—the chin and eyebrows—have been added to the seven regional HMM. Three discrete facial positions are used as the basis for the three-state HMM.

We reduced the operational burden, which increases linearly with the number of states, by removing the hair section from the five-phase process to make it a four-state system. The two essential phases of automatic facial expression recognition are image enhancement and categorization (AFR) [2]. Internal and external features are the two categories of features that can be derived from an object’s elements.

The classification of test images from qualified databases using minimal proximity or maximum likelihood estimation parameters is the other major challenge in FR. A robust method is needed for the classification of facial photographs in an unconstrained environment. These methods’ high computing costs, however, pose a significant obstacle to their use in real-time AFR. It is necessary to use a change management plan to minimize the image dimensions while keeping important visual elements. However, in this paper, wavelet transform has been imagined for the subject preceding FR, even though many adjustments have been made for feature extraction in the pre-treatment stage. The duration of the observable vector and the choice of classification algorithm affect an algorithm’s processing cost. As a result, the choice of classifier has a significant influence on how the FR functional requirements are implemented. Other classification models, such as minimal distance classifier neural networks (NN), HMM, and regression analysis encoder, have also been established. With the use of these distance measurements, test images from under-trained databases are mistaken for images from prepared resources.

2. Literature Review

The Erotic Hidden Markov Model (HMM) is presented which is used as a classifier for face recognition considering symmetrical axis image. The authors in [3], also used discrete wavelet to obtain observation vector. This model was tested on Yale, Faces94 face, and AR datasets and achieved adequate results. The authors in [4] used HMM-based technique for face recognition and face detection. Karhunen-Loeve Transform coefficients were used for HMM characterization. The HMM model provided an accuracy of 86% meanwhile time taken by the model to recognize a face is 250 ms per face which was improved from the previous recognition time (750 ms per face) on the same data set. State duration models were used for the Separable lattice Hidden Markov Model (SL-HMM) for face recognition in [5]. The SL-HMM model was designed to overcome the issue of exponential decay of state duration probability with an increase in duration. Parametric distributions were used to model the probability of state duration. The proposed model provides sufficient results for face recognition images having in-variances accurately. HMM with a transferable domain procedure for face recognition was forwarded and the JPEG approach was selected for the training of images in HMM to transfer input. The authors in [6] used the output vector of Discrete Cosine Transform (DCT) images to train HMM ergodic which was later used for face recognition. This proposed technique was implemented on the ORL database and provided 99.5% accuracy which shows there was a substantial improvement in recognition on the same data set.

The authors in [7] used the Haarcascaded algorithm for the extraction and detection of faces from images which was later used to train the system. This model was based on cameras that are placed diagonally to extract faces using a haarcascade algorithm followed by a convolutional neural network for facial feature extraction. Extracted features were compared with pre-trained features on the basis of which model decided to authorize intruders or not. The forwarded algorithm provided an accuracy of 86.2%. A 2D pseudo-ergodic HMM (EHMM) based model was proposed in [8]. The EHMM model provides flexibility in switching the states of the face image. The 2D-EHMM is trained by segmental K-means design which combines observation densities, optimizing densities, and state transitions. The model is tested on the ORL database and accuracy was presented with the DCT-mod2 and DCT feature sets which are 95.63% and 92.5% respectively for the three states.

In [9], the 2D Pseudo Discrete Hidden Markov Model (P2D-DHMM) was proposed for face detection. This algorithm was based on two-way face scanning, top to bottom and left to right, by sliding windows. Features extracted from 2D-DCT were followed by K-Means clustering for two code-book generations utilized for vector quantization. These code words were utilized as observation vectors for the training of HMM. Parallel to this training, two different discrete HMM were trained by using the Baum Welch Algorithm. The state-of-the-art technique provided an accuracy of 91% with 5.2 s offline training time.

The novel passivity and dissipativity criteria for discrete-time fractional generalized delayed Cohen–Grossberg neural networks presents new conditions of networks passivity and dissipation in terms of fractional dynamics and time delays in [10]. The authors employ the fractional difference equations necessary for creating proper Lyapunov functionals to inject the parameters that would ensure the stability as well as the robustness of the neural networks irrespective of the delays made. The stability analysis of the paper is also carried out and indeed, it confirms the functionality of the proposed criteria by simulating various conditions for the stability of the system. These findings are beneficial for the future developments and implementations of neural networks in realistic environments consisting of time delays as well as fractional dynamic systems with possible uses in neural computation control systems and signal processing. The neural network-based event-triggered data-driven control of disturbed nonlinear systems with quantized input presents the control design that combines the neural network with the event-triggered approach to minimize the disturbance of nonlinear systems as [11]. Since control actions are initiated sparingly and inputs are quantized this approach helps minimize the number of computations and communications. The neural network mimics the behavior of the system and synthetically goes into the design of the control law to be able to cope with disturbances as well as restrictions on the inputs. Computer experiments also validate the ability of the method to achieve a stable state whilst uniformly utilising all available resources making the work important in any environment that has a limited amount of resources such as robotics and automation.

Zernike moments were used for feature extraction in [12]. This technique improved the HMM algorithm and consists of two stages the first stage contributes features from the expression of the mouth or nose etc., on the other hand, the second stage gives expressions of faces like sadness, fear, smile, etc. The proposed algorithm was implemented on the JAFFE data set which provided an accuracy of 87.85%. In [13], the proposed algorithm with a combination of HMM with Singular value decomposition (SVD), Artificial Neural Network (ANN), and Principal Components Analysis (PCA). This combination of techniques was implemented on ORL and Yale datasets and evaluated using decision trees such as false negative, true positive, etc. The model provided the highest recognition rate of 97.49% for ORL using ANN.

3. Methodology

In this paper, we have focused on the FR system using different states’ HMM classification models and the Convolutional Neural Networks (CNN) architecture. The methodology for a proposed HMM and CNN model is presented in Figure 1. The facial images used in this model are taken from [14,15].

3.1. HMM Classification Model

We have proposed an HMM classification model for FR using different states of HMM to classify facial images of a person using ORL and Yale databases. In HMM, the classification model uses three states, four states, five states, and six states of HMM separately to train the HMM model and evaluate the results. The principal component analysis (PCA) used in our algorithm performs the task of feature extraction. The process of noise removal and dimensionality reduction of images is performed by using the Haar wavelet transform at the preprocessing stage. Facial information features are extracted using eigenvalue decomposition. The framework of the model is described in Figure 1. Facial images are divided into three portions, four portions, five portions, and six portions, where the number of portions represents the number of states being used in the HMM model. For the three-state HMM model, each portion is considered an independent HMM hidden state, which is represented in Figure 2. Facial information feature coefficients are the representation of the visible states of the HMM model.

3.1.1. Pre Processing

Form ORL database out of 10 images of a person, five images are used to train the HMM model, while the other five images are used for testing [16]. A third-order nonlinear minimum static filter is used for the suppression of distortions and to enhance the characteristics of facial images. Image dimensionality reduction is performed by using a Haar wavelet. A minimum-order static filter is used in preprocessing. Computational effectiveness can be achieved by an exceptional localization property of DWT. After applying preprocessing steps, facial images are factorized by using SVD as represented in Equation (1).

I = M λ V^{T}

(1)

where I = facial image,

λ

= diagonal matrix that contains the singular values (eigenvalues), M and V represent the matrices that are orthogonal to each other, and they have an orthonormal arbitrary vector. Principal singular values are calculated for all the training facial images, and we find the normalized singular value. The difference between the normalized singular values is determined. Now we determine the distinctive threshold of singular value for each database, which will be used to discard test images of untrained databases before feature extraction. After that, substantial features of the test images will be extracted.

3.1.2. Feature Extraction

A sampling window of size Lw is used to scan facial images using a top-bottom approach, and information in the overlapping block is retrieved as shown in Figure 3. Equation (2) is used for the calculation of a number of total blocks in an image.

B = (\frac{L_{i} - L_{w}}{L_{w} - O}) + 1

(2)

where B is the total number of blocks in an image, L is the height of the image,

L_{w}

is the sampling window size, and O is the overlap size. For

L_{i} = 53

,

L_{w} = 4

, and

O = 3

, the total number of blocks in an image is 50. Now, we will perform eigen decomposition on the covariance matrix for each block using Equation (3).

C \vec{v_{i}} = λ_{i} \vec{v_{i}}

(3)

where

λ_{i}

represents the matrix containing eigenvalues and

v_{i}

represents the matrix containing eigenvectors of the covariance matrix. Only the eigenvalues corresponding to the principal components are selected, which carry the maximum information about the facial image. By representing the eigenvalues of the facial image as a function of its index number, it becomes evident that the most significant information about the facial image is contained in the initial principal components, while the higher principal components carry negligible information. The desired results are obtained by utilizing the two eigenvalues

(λ_{1}, λ_{2})

of the first two principal components and the first coefficient of the principal component

v_{1}

at index (1, 1). These three components efficiently describe an image block. To reduce computational costs, these features are quantized to a discrete level using Equations (4) and (5).

f_{i}^{q} = \frac{f_{i} - f_{i}^{\min}}{Δ_{i}}

(4)

Δ_{i} = \frac{f_{i}^{m a x} - f_{i}^{m i n}}{L_{i}}

(5)

where

f_{i}^{\min}

,

f_{i}^{\max}

, and

f_{i}^{q}

are the minimum, maximum, and quantized values of facial features, respectively.

Δ_{i}

is the difference between two consecutive quantization levels, and

L_{i}

is the number of quantization levels being used. Labels are assigned to each quantized feature using Equation (6).

Label = Q_{L} (1) \times 8 + Q_{L} (2) \times 60 + Q_{L} (3) \times 2 + 6

(6)

The quantization levels

Q_{L} (1)

,

Q_{L} (2)

, and

Q_{L} (3)

represent the quantization levels of

ν_{1} (1, 1)

,

λ_{1}

, and

λ_{2}

respectively. Through the hit and trial method, the optimal values for

Q_{L} (1)

,

Q_{L} (2)

, and

Q_{L} (3)

were determined to be 10, 10, and 7 respectively. Consequently, each block contains 700 feature values. An observation sequence, consisting of feature labels from 50 blocks, is established for each training image. These observation sequences from all training images will be employed for training the Hidden Markov Model (HMM) model.

3.1.3. HMM Training

Facial areas i.e., mouth, nose, eyes, and forehead, are modeled by four different HMM models. HMM, the model used in this paper is 3 states HMM, 4 states HMM, 5 states HMM, and 6 states HMM. The detailed HMM training cycle is shown in Figure 1. Quantized facial features are the visible symbols forming the observation sequence of the model. HMM involves two major steps. The first step is the initialization of transition matrix A, emission matrix B, and initial probability matrix

π

. Where A, B,

π

are the model parameters. Adequate results have been achieved in the proposed model using the following values of initial parameters.

\begin{matrix} N = 3, \\ O = 700, \\ π = 100, \\ a_{i, i} = 0.85 & for 1 < r < N - l, \\ a_{i, i + 1} = 0.15, \\ a_{3, 3} = 1 . & Final state for 3 states HMM \end{matrix}

The Table 1 depicts initial transition probabilities for 3 states HMM model for FR. Where

N = number of states

,

V = visible sequence

,

a_{i, i} = probability of transition from state

w_{i} (t - l)

to state

w_{i} (t)

and

a_{i, i} + 1 = probability of transition from state

w_{i} (t - l)

to state

w_{i + 1} (t)

.

To train HMM we are required to find

a_{i j}

and

b_{j k}

where

a_{i j}

is the transition probability from state

w_{i} (t - l)

to state

w_{j} (t)

and

b_{j k}

is the emission probability that the model generates a visible sequence

v_{k}

when it is in state

w_{j} (t)

and is calculated by Equation (7).

b_{j k} = P (V_{k} | w_{j})

(7)

Randomly initialize

a_{i j}

and

b_{j k}

. Calculate the probability matrices

α_{j} (t)

and

β_{j} (t)

with forward and backward algorithms from [17]. The probability with the forward algorithm is the probability that the model will be in state

w_{j} (t)

at time t by generating the first t number of visible states given in visible sequence V, which is calculated by the following equation:

α_{j} (t) = \{\begin{matrix} 0 & if t = 0 and j \neq initial state \\ 1 & if t = 0 and j = initial state \\ \begin{matrix} \sum_{i} [α_{i} (t - 1) \cdot a_{i j}] \\ \cdot b_{j k v (t)} \end{matrix} & otherwise \end{matrix}

(8)

The probability with the backward algorithm is the probability that the model will be in state

w_{i} (t)

and will generate the remaining symbols of the visible sequence V [18], calculated by Equation (9).

β_{j} (t) = \{\begin{matrix} 0 & if w_{i} (t) = w_{0} and t \neq T \\ 1 & if t = 0 and j = initial state \\ \begin{matrix} \sum [β_{j} (t + 1), a_{i j}] \\ \cdot b_{j k v (t + 1)} \end{matrix} & otherwise \end{matrix}

(9)

Now the probability of transition (

γ_{i j} (t)

) from state

w_{i} (t - 1)

to state

w_{j} (t)

for a particular sequence

v^{T}

is calculated by Equation (10).

γ_{i j} (t) = \frac{[α_{i} (t - 1) \cdot a_{i j} \cdot b_{j k} \cdot β_{j} (t)]}{P (v^{T} | θ)}

(10)

After estimating the

γ_{i j} (t)

refined values for

a_{i j}

and

b_{j k}

are calculated using Equations (11) and (12).

{\hat{a}}_{i j} = \frac{\sum_{t = 1}^{T} γ_{i j} (t)}{\sum_{t = 1}^{T} \sum_{k} v_{i k} (t)}

(11)

{\hat{b}}_{j k} = \frac{\sum_{t = 1}^{T} \sum_{l} γ_{j l} (t)}{\sum_{t = 1}^{T} \sum_{l} v_{i k} (t)}

(12)

where the expected number of transitions from state

w_{i} (t - 1)

to

w_{j} (t)

is

\sum_{t = 1}^{T} γ_{i j} (t)

, and the total expected number of transitions from state

w_{i} (t)

to any state is

\sum_{t = 1}^{T} \sum_{l} v_{i k} (t)

. These iterations will continue until the model has reached the point of convergence. Where the point of convergence will be achieved when the probability variation of two consecutive iterations lies within a specific value of a threshold i.e., 0.09 in our proposed procedure.

3.1.4. HMM Classification

After completion of the training process each individual is considered as a separate HMM. Two observation sequences having eigenvalues and eigenvectors representing each image are defined in Equations (13) and (14).

V_{1}^{T} = [{\vec{v}}_{1} (1, 1), λ_{1}, λ_{2}]

(13)

V_{2}^{T} = [{\vec{v}}_{1} (1, 1) \times 1.01, λ_{1}, λ_{1} \times 1.04]

(14)

Now, the probability of observation sequences related to a test image against all trained HMM (represented by M) is calculated by using Equation (15).

P (\frac{V_{n}^{T}}{M}) = ? n = 1, 2

(15)

To find out the probability, we determine all possible hidden state sequences that generated the particular observation sequence.

P (\frac{V_{n}^{T}}{M})

is computed by summing all hidden sequences probabilities by using Equation (16).

P (\frac{V_{n}^{T}}{M}) = \sum_{i = 1}^{i_{\max}} P (V_{n}^{T} | W_{i}^{T}) P (W_{i}^{T})

(16)

where

i_{\max} = N

shows the maximum possible hidden states path from which HMM model can perform transitions while producing the observation sequences

V_{n}^{T}

. Hidden sequence

W_{1}^{T}

, as shown in Equation (17), is one of the possible hidden sequences

W_{i}^{T}

having length T that produced the observation sequence

V_{n}^{T}

.

W_{1}^{T} = w_{1}, w_{2}, w_{3}, w_{4}, \dots \dots \dots \dots ., w_{t}

(17)

The probability of this specific sequence is a product of transition probabilities and is calculated by using Equation (18). Observation sequence probability for the identified hidden sequence is a product of emission probabilities and is calculated by using Equation (19).

P (W_{1}^{T}) = \prod_{t = 1}^{T} P (w_{t} | w_{t - 1})

(18)

P (V_{n}^{T} | W_{i}^{T}) = \prod_{t = 1}^{T} P (V_{n}^{T} | W_{t})

(19)

Now by putting Equations (18) and (19) into Equation (16), we get

P (\frac{V_{n}^{T}}{M}) = \sum_{i = 1}^{i_{\max}} \prod_{t = 1}^{T} P (V_{n}^{T} | W_{i} (t)) \cdot P (w_{i} (t) | w_{i} (t - 1))

(20)

Classification of test images is performed by using the majority vote rule related to the evaluation probabilities of known observation sequences and is calculated by using Equation (21).

{Image}_{test} = \{\begin{matrix} {Image}_{k} & if P (\frac{V_{n}^{T}}{M_{k}}) = max_{m} P (\frac{V_{n}^{T}}{M_{m}}) \\ Unknown & otherwise \end{matrix}

(21)

3.2. Convolutional Neural Networks (CNN) Architecture

CNN architecture uses the complete image as input and passes the image through different layers to compute the feature vector for each image. The CNN model computes feature vectors for all images during the training of the model and stores them to compare these feature vectors with the feature vectors being extracted from test images. On the basis of these feature vectors images are classified to their classes. Our proposed CNN model is shown in Figure 1. All the images are resized to 112 × 92 during the preprocessing [19]. These images are then fed to the model. The first layer of the model is the convolutional layer having 32 filters with a kernel size of 7 and using relu as its activation function. The second layer is max-pooling layer, with a pool size of 2. The third layer is the convolutional layer having 54 filters with a kernel size of 5 and using relu as its activation function. The fourth layer is a max-pooling layer with a pool size of 2. The fifth layer is flatten layer. The next layers are three sets of dense layers each followed by a dropout layer. The model used sparse categorical cross entropy as a loss function and Adam as an optimizer. The model summary is shown in Table 2. The proposed CNN model is trained on ORL and Yale datasets. ORL dataset uses 400 images of 40 individuals with 10 images of each person and the Yale data set uses a total 165 images of 15 individuals having 11 images of each person [20].

4. Result

The proposed algorithm is executed in MATLAB 2021a [21]. The effectiveness of this algorithm is verified by using ORL and Yale databases. ORL database consists of 400 facial images of 40 persons with different facial orientations and expressions [22]. ORL database image size is 112 × 92 with varying illumination. On the other hand, the Yale database consists of 165 facial images of 15 persons with varying illumination conditions, and each image is 231 × 95 in size.

4.1. HMM

All images of databases are resized to 53 × 44. To decrease computational power and memory requirements, we only used three SVD coefficients. By using the ORL database, the three states HMM gave a classification accuracy of 96.5% with reduced computational complexity of 0.0391 s for training an image using the proposed algorithm. By increasing the number of states, the accuracy of the algorithm is increased, and in four states, HMM gave a classification accuracy of 98.5% but increased the computational complexity of the HMM model. Four states of HMM took 0.0490 s to train an image by using our proposed setting as described in the paper. Further increasing the number of states to five states HMM the accuracy of the model remains the same as of four states HMM and its computational complexity increased to 0.0745 s per training image. At last, setting the HMM states to six states the accuracy of the proposed model is increased to 99.5% but with increased cost of computational complexity of 0.0945 s per training image.

By using the Yale database, the three states’ HMM gave classification accuracy of 90.6667% with reduced computational complexity of 0.0435 s time for training an image using the proposed algorithm. The accuracy of the four, five, and six states HMM model remains constant at 94.6667% but the computational cost increased on each successive increase in the state of HMM classification model. The result comparison is shown in Table 2. The performance of the proposed HMM model is shown in Figure 4 and Computational complexity is shown in Figure 5.

The computational cost of the propose models is high, more so when it comes to the comparing the versatility of the elements within the HMM and CNN architectures. For HMM, the computational complexity also increases with the size of hidden states because the forward backward algorithm which is used in training this model, has a quadratic time complexity that is proportional to the number of states and a linear time complexity with length of sequence. This means that when the number of hidden states increases the computation becomes larger thus not very useful when used in larger problems although good for small tasks. On the other hand the CNN model simplifies the problem to a large extent and as a result gets a higher accuracy, but at the same time the computational requirement for this model is very high in terms of time and hardware required for training and high processing power hardware like GPUs. The time taken for training deepens with the extend of the network and the size of the dataset used. The Adam optimizer proves useful in influencing convergence and model accuracy; there is still a lot of computation involved in the model. In comparison with other models the HMM allows for more efficient computation for the simpler tasks, whereas the CNN, despite being less error prone, is more computationally intensive and therefore requires more careful selection of the resource in question because of the said fact, thus pointing to the possible trade-off between the accuracy and computation complexity.

4.2. CNN

ORL and Yale datasets are tested on the proposed CNN model. The model gave the best results on the datasets used for training and testing. The CCN model for the ORL data set has a total of 52,500,114 parameters and trainable parameters are 52,500,114. The proposed model is trained for 100 epochs. On the last epoch training accuracy is 1.00, training loss is 0.0015, testing accuracy is 1.0, and testing loss is 1.7267 × 10⁻⁴. The performance of the model on the ORL data set is shown in Figure 6. The confusion matrix of the model for the ORL data set is shown in Figure 7.

The CCN model for the Yale data set has a total of 52,497,549 parameters, and the trainable parameters are 52,497,549. On the last epoch, training accuracy is 1.00, training loss is 0.000164, testing accuracy is 0.8571, and testing loss is 4.0638. The performance of the model on the Yale data set is shown in Figure 8. The confusion matrix of the model for the Yale data set is shown in Figure 9.

In this paper, we have proposed two models for FR.

In the HMM classification model for FR, PCA is used to perform the task of feature extraction. The process of noise removal and dimensionality reduction of images is performed by using the Haar wavelet transform. Facial information features are extracted using eigenvalue decomposition. The model uses three states, four states, five states, and six states of HMM separately to train the HMM model and evaluate the results.
The CNN model computes feature vectors for all images during the training of the model and stores them to compare these feature vectors with the feature vectors being extracted from test images. Based on these feature vectors, images are classified into different classes.
The HMM classification model is computationally inexpensive and can be used in online FR. The CNN model is trained using CNN, which is more accurate, but it is computationally expensive and can be used for offline FR analysis.

CNN architecture uses the complete image as input and passes the image through different layers to compute the feature vector for each image. The CNN model computes feature vectors for all images during the training of the model and stores them to compare these feature vectors with the feature vectors being extracted from test images. Based on these feature vectors, images are classified into their respective classes. Our proposed CNN model is shown in Figure 1. All the images are resized to 112 × 92 during the preprocessing [20]. These images are then fed to the model. The first layer of the model is the convolutional layer, which has 32 filters with a kernel size of 7 and uses rely as its activation function. The second layer is the max-pooling layer, with a pool size of 2. The third layer is the convolutional layer, which has 54 filters with a kernel size of 5 and uses relu as its activation function. The fourth layer is the max-pooling layer, with a pool size of 2. The fifth layer is the flattened layer. The next layers are three sets of dense layers, each followed by a dropout layer. The model used sparse categorical cross-entropy as a loss function and Adam as an optimizer. The model summary is shown in Table 2. The proposed CNN model is trained on the ORL and Yale datasets. The ORL dataset uses 400 images of 40 individuals with 10 images of each person, and the Yale data set uses a total of 165 images of 15 individuals with 11 images of each person [21].

5. Conclusions

The recognition rate and computational cost are the two quantities that are directly proportional to each other. In our proposed HMM algorithm, training is cost-effective, which results in lesser computational cost as compared to the CNN model. We can observe that by increasing the number of states in our proposed HMM model, i.e., the computational complexity, the recognition accuracy is also increased. So, if we want to decrease the complexity of the model, then the model’s accuracy will also be decreased. In our proposed CNN model, the computational complexity is much higher as compared to HMM, but its recognition accuracy is much better than HMM. You should weigh the trade-off between computational complexity and accuracy. How much computational cost can you bear to get a specific amount of accuracy for your model? Higher accuracy comes at a great cost. The HMM model is beneficial for online facial detection where time is a precious quantity. While CNN can be used where time is not an issue, accuracy is much more important, i.e., offline facial analysis, where the database is too large and we want accurate results.

In future work, a better CNN model can be designed that involves lesser computational complexity with higher accuracy so that the computational complexity of the CNN model can be reduced.

Author Contributions

Conceptualization, M.B.; Methodology, S.R.; Software, N.B.; Formal analysis, M.Z.; Investigation, A.F.; Writing—review & editing, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be shared upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nefian, A.V.; Hayes, M.H. Hidden Markov models for face recognition. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181), Seattle, WA, USA, 15 May 1998; pp. 2721–2724. [Google Scholar] [CrossRef]
Miarnaeimi, H.; Davari, P. A new fast and efficient HMM based face recognition system using a 7-state HMM along with SVD coefficients. Iranian J. Elect. Electron. Eng. 2008, 4, 46–57. [Google Scholar]
Kiani, K.; Rezaeirad, S. A new ergodic HMM-based face recognition using DWT and half of the face. In Proceedings of the 2019 5th Conference on Knowledge-Based Engineering and Innovation (KBEI), Tehran, Iran, 28 February–1 March 2019; pp. 531–536. [Google Scholar]
Nefian, A.V.; Hayes, M.H. Face detection and recognition using hidden Markov models. In Proceedings of the 1998 International Conference on Image Processing. ICIP98 (Cat. No. 98cb36269), Chicago, IL, USA, 7 October 1998; Volume 1, pp. 141–145. [Google Scholar]
Takahashi, Y.; Tamamori, A.; Nankaku, Y.; Tokuda, K. Face recognition based on separable lattice 2-D HMM with state duration modeling. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 2162–2165. [Google Scholar]
Kohir, V.V.; Desai, U.B. Face recognition using a DCT-HMM approach. In Proceedings of the Fourth IEEE Workshop on Applications of Computer Vision. WACV’98 (Cat. No. 98EX201), Princeton, NJ, USA, 19–21 October 1998; pp. 226–231. [Google Scholar]
Pai, V.K.; Balrai, M.; Mogaveera, S.; Aeloor, D. Face Recognition Using Convolutional Neural Networks. In Proceedings of the 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 11–12 May 2018; pp. 165–170. [Google Scholar]
Kumar, S.A.S.; Deepti, D.R.; Prabhakar, B. Face recognition using pseudo-2D ergodic HMM. In Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, France, 14–19 May 2006; Volume 2, p. II. [Google Scholar]
Vaseghi, B.; Hashemi, S. Face verification using D-HMM and adaptive K-means clustering. In Proceedings of the 2011 4th IEEE International Conference on Broadband Network and Multimedia Technology, Shenzhen, China, 28–30 October 2011; pp. 270–275. [Google Scholar]
Wang, C.; Zhang, H.; Wen, D.; Shen, M.; Li, L.; Zhang, Z. Novel passivity and dissipativity criteria for discrete-time fractional generalized delayed Cohen–Grossberg neural networks. Commun. Nonlinear Sci. Numer. Simul. 2024, 133, 107960. [Google Scholar] [CrossRef]
Wang, X.; Karimi, H.R.; Shen, M.; Liu, D.; Li, L.W.; Shi, J. Neural network-based event-triggered data-driven control of disturbed nonlinear systems with quantized input. Neural Netw. 2022, 156, 152–159. [Google Scholar] [CrossRef] [PubMed]
Rahul, M.; Shukla, R.; Yadav, D.K.; Yadav, V. Zernike moment-based facial expression recognition using two-staged hidden markov model. In Advances in Computer Communication and Computational Sciences; Springer: Singapore, 2019; pp. 661–670. [Google Scholar]
Aggarwal, A.; Alshehri, M.; Kumar, M.; Sharma, P.; Alfarraj, O.; Deep, V. Principal component analysis, hidden Markov model, and artificial neural network inspired techniques to recognize faces. Concurr. Comput. Pract. Exp. 2021, 33, e6157. [Google Scholar] [CrossRef]
Yale Face Database. Available online: https://www.kaggle.com/datasets/olgabelitskaya/yale-face-database (accessed on 5 April 2024).
The ORL Database. Available online: https://www.kaggle.com/datasets/tavarez/the-orl-database-for-training-and-testing/data (accessed on 5 April 2024).
Tan, W.; Wu, F.; Rong, G. Pseudo training sample method for face recognition using HMMs. In Proceedings of the IEEE International Conference on Networking, Sensing and Control, Taipei, Taiwan, 21–23 March 2004; Volume 2, pp. 801–806. [Google Scholar] [CrossRef]
Du, Q.; Chang, C.I. Hidden Markov model approach to spectral analysis for hyperspectral imagery. Opt. Eng. 2001, 40, 2277–2284. [Google Scholar]
Alghamdi, R. Hidden Markov Models (HMMs) and Security Applications. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 2016, 7, 39–47. [Google Scholar] [CrossRef]
Chandra, M.; Kumari, A.; Kumar, S. Isolated text recognition using SVD and HMM. In Proceedings of the 2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies, Ramanathapuram, India, 8–10 May 2014; pp. 1264–1267. [Google Scholar] [CrossRef]
Noushath, S.; Kumar, G.H.; Shivakumara, P. Diagonal Fisher linear discriminant analysis for efficient face recognition. Neurocomputing 2006, 69, 1711–1716, ISSN 0925-2312. [Google Scholar] [CrossRef]
The MathWorks, Inc. MATLAB Version: 9.13.0 (R2021a). 2021. Available online: https://www.mathworks.com (accessed on 1 January 2023).
Orlov, N.; Shamir, L.; Macura, T.; Johnston, J.; Eckley, D.M.; Goldberg, I.G. WND-CHARM–Multi purpose classification using compound image transforms. Pattern Recognit. Lett. 2008, 29, 1684–1693. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Methodology for proposed HMM and CNN model. image source [14,15].

Figure 2. Three states HMM representation on the human facial image. image source [14,15].

Figure 3. Sampling window and overlapping Block. image source [14,15].

Figure 4. Performance of proposed HMM model.

Figure 5. Computational complexity of proposed HMM model.

Figure 6. Model efficiency on ORL data set.

Figure 7. Confusion matrix of the model on ORL dataset.

Figure 8. Model efficiency on Yale data set.

Figure 9. Confusion matrix of the model on Yale dataset.

Table 1. Initial transition probability matrix for 3-states HMM model.

	$S_{1}$	$S_{2}$	$S_{3}$
$S_{1}$	0.85	0.15	0
$S_{2}$	0	0.85	0.15
$S_{3}$	0	0	1

Table 2. Performance of HMM model on different states using ORL and Yale datasets.

Databases	HMM Model	Recognition Accuracy	Training Time/Image	Testing Time/Image
ORL Database	3 State HMM	96.5%	$0.0391 s$	0.0188 s
5\|5 split images of 40 person	4 State HMM	98.5%	0.0490 s	0.0214 s
	5 State HMM	98.5%	0.0745 s	0.0232 s
	6 State HMM	$99.5 %$	0.0945 s	0.0223 s
Yale Database	3 State HMM	90.6667%	$0.0435 s$	0.0328 s
6\|5 split images of 15 person	4 State HMM	$94.6667$ %	0.0599 s	0.0354 s
	5 State HMM	94.6667%	0.0849 s	0.0325 s
	6 State HMM	94.6667%	0.1125 s	0.0365 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bilal, M.; Razzaq, S.; Bhowmike, N.; Farooq, A.; Zahid, M.; Shoaib, S. Facial Recognition Using Hidden Markov Model and Convolutional Neural Network. AI 2024, 5, 1633-1647. https://doi.org/10.3390/ai5030079

AMA Style

Bilal M, Razzaq S, Bhowmike N, Farooq A, Zahid M, Shoaib S. Facial Recognition Using Hidden Markov Model and Convolutional Neural Network. AI. 2024; 5(3):1633-1647. https://doi.org/10.3390/ai5030079

Chicago/Turabian Style

Bilal, Muhammad, Saqlain Razzaq, Nirman Bhowmike, Azib Farooq, Muhammad Zahid, and Sultan Shoaib. 2024. "Facial Recognition Using Hidden Markov Model and Convolutional Neural Network" AI 5, no. 3: 1633-1647. https://doi.org/10.3390/ai5030079

APA Style

Bilal, M., Razzaq, S., Bhowmike, N., Farooq, A., Zahid, M., & Shoaib, S. (2024). Facial Recognition Using Hidden Markov Model and Convolutional Neural Network. AI, 5(3), 1633-1647. https://doi.org/10.3390/ai5030079

Article Menu

Facial Recognition Using Hidden Markov Model and Convolutional Neural Network

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. HMM Classification Model

3.1.1. Pre Processing

3.1.2. Feature Extraction

3.1.3. HMM Training

3.1.4. HMM Classification

3.2. Convolutional Neural Networks (CNN) Architecture

4. Result

4.1. HMM

4.2. CNN

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI