*6.1. Measures of Similarity or Distances*


$$d\_E(P1, P2^\circ) = \sqrt{(x2 - x1)^2 + (y2 - y1)^2}.\tag{15}$$

In general, the Euclidean distance between two points *P* = (1, *p*2, ... , *pn*) and *Q* = (*q*1, *q*2, ... , *qn*) in the n-dimensional space would be defined by the following:

$$d\_{\mathbb{E}}(P, Q) = \sqrt{\sum\_{i}^{n} \left( pi - qi \right)^{2}}.\tag{16}$$

• Bhattacharyya distance [104,105]: The Bhattacharyya distance is a statistical measure that quantifies the similarity between two discrete or continuous probability distributions. This distance is particularly known for its low processing time and its low sensitivity to noise. For the probability distributions *p* and *q* defined on the same domain, the distance of Bhattacharyya is defined as follows:

$$D\_{\mathbb{B}}(p,q) = -\ln(BC(p,q)),\tag{17}$$

$$BC(p,q) = \sum\_{\mathbf{x} \in \mathcal{X}} \sqrt{p(\mathbf{x})q(\mathbf{x})} \text{ (a); } BC(p,q) = \int \sqrt{p(\mathbf{x})q(\mathbf{x})}d\mathbf{x} \text{ (b), }\tag{18}$$

where *BC* is the Bhattacharyya coefficient, defined as Equation (18a) for discrete probability distributions and as Equation (18b) for continuous probability distributions. In both cases, 0 ≤ *BC* ≤ 1 and 0 ≤ *DB* ≤ ∞. In its simplest formulation, the Bhattacharyya distance between two classes that follow a normal distribution can be calculated from a mean (μ) and the variance (σ2):

$$DB(p,q) = \frac{1}{4} \ln \left( \frac{1}{4} \left( \frac{\sigma\_p^2}{\sigma\_q^2} + \frac{\sigma\_q^2}{\sigma\_p^2} + 2 \right) \right) + \frac{1}{4} \left( \frac{(\mu\_p - \mu\_q)}{\sigma\_q^2 + \sigma\_p^2} \right). \tag{19}$$

• Chi-squared distance [106]: The Chi-squared *X*2 distance was weighted by the value of the samples, which allows knowing the same relevance for sample differences with few occurrences as those with multiple occurrences. To compare two histograms *S*<sup>1</sup> = (*u*1, ... ... ... .*um*) and *S*<sup>2</sup> = (*w*1, ... ... ... .*wm*), the Chi-squared *X*2 distance can be defined as follows:

$$D\left(X^{2}\right) = D\left(S\_{1}, S\_{2}\right) = \frac{1}{2} \sum\_{i=1}^{m} \frac{\left(u\_{i} - w\_{i}\right)^{2}}{u\_{i} + w\_{i}}.\tag{20}$$

## *6.2. Classifiers*

There are many face classification techniques in the literature that allow to select, from a few examples, the group or class to which the objects belong. Some of them are based on statistics, such as

the Bayesian classifier and correlation [18], and so on, and others based on the regions that generate the different classes in the decision space, such as K-means [9], CNN [103], artificial neural networks (ANNs) [37], support vector machines (SVMs) [26,107], k-nearest neighbors (K-NNs), decision trees (DTs), and so on.

• Support vector machines (SVMs) [13,26]: The feature vectors extracted by any descriptor are classified by linear or nonlinear SVM. The SVM classifier may realize the separation of the classes with an optimal hyperplane. To determine the last, only the closest points of the total learning set should be used; these points are called support vectors (Figure 14).

There is an infinite number of hyperplanes capable of perfectly separating two classes, which implies to select a hyperplane that maximizes the minimal distance between the learning examples and the learning hyperplane (i.e., the distance between the support vectors and the hyperplane). This distance is called "margin". The SVM classifier is used to calculate the optimal hyperplane that categorizes a set of labels training data in the correct class. The optimal hyperplane is solved as follows:

$$D = \left\{ (\mathbf{x}\_i, y\_i) \middle| \mathbf{x}\_i \in \mathbb{R}^n \; , \; y\_i \in \{-1, 1\}, \; i = 1, \dots, l \right\}. \tag{21}$$

Given that *xi* are the training features vectors and *yi* are the corresponding set of *l* (1 or −1) labels. An SVM tries to find a hyperplane to distinguish the samples with the smallest errors. The classification function is obtained by calculating the distance between the input vector and the hyperplane.

$$w\mathbf{x}\_i - b = \mathbb{C}\_{f,i} \tag{22}$$

where *w* and *b* are the parameters of the model. Shen et al. [108] proposed the Gabor filter to extract the face features and applied the SVM for classification. The proposed FaceNet method achieves a good record accuracy of 99.63% and 95.12% using the LFW YouTube Faces DB datasets, respectively.


**Figure 14.** Optimal hyperplane, support vectors, and maximum margin.

**Figure 15.** Artificial neural network.

Various variants of neural networks have been developed in the last years, such as convolutional neural networks (CNN) [14,110] and recurrent neural networks (RNN) [111], which very effective for image detection and recognition tasks. CNNs are a very successful deep model and are used today in many applications [112]. From a structural point of view, CNNs are made up of three different types of layers: convolution layers, pooling layers, and fully-connected layers.

	- Average-pooling takes all the elements of the sub-matrix, calculates their average, and stores the value in the output matrix.
	- Max-pooling searches for the highest value found in the sub-matrix and saves it in the output matrix.

Wen et al. [113] introduce a new supervision signal, called center loss, for the face recognition task in order to improve the discriminative power of the deeply learned features. Specifically, the proposed center loss function is trainable and easy to optimize in the CNNs. Several important face recognition benchmarks are used for evaluation including LFW, YTF, and MegaFace Challenge. Passalis and Tefas [114] propose a supervised codebook learning method for the bag-of-features representation able to learn face retrieval-oriented codebooks. This allows using significantly smaller codebooks enhancing both the retrieval time and storage requirements. Liu et al. [115] and Amato et al. [116] propose a deep face recognition technique under open-set protocol based on the CNN technique. A face dataset composed of 39,037 faces images belonging to 42 different identities is used to perform the experiments. Taigman et al. [117] present a system (DeepFace) able to outperform existing systems with only very minimal adaptation. It is trained on a large dataset of faces acquired from a population vastly different than the one used to construct the evaluation benchmarks. This technique achieves an accuracy of 97.35% on the LFW. Ma et al. [118] introduce a robust local binary pattern (LBP) guiding pooling (G-RLBP) mechanism to improve the recognition rates of the CNN models, which can successfully lower the noise impact. Koo et al. [119] propose a multimodal human recognition method that uses both the face and body and is based on a deep CNN. Cho et al. [120] propose a nighttime face detection method based on CNN technique for visible-light images. Koshy and Mahmood [121] develop deep architectures for face liveness detection that uses a combination of texture analysis and a CNN technique to classify the captured image as real or fake. Elmahmudi and Ugail [122] present the performance of machine learning for face recognition using partial faces and other manipulations of the face such as rotation and zooming, which we use as training and recognition cues. The experimental results on the tasks of face verification and face identification show that the model obtained by the proposed DNN training framework achieves 97.3% accuracy on the LFW database with low training complexity. Seibold et al. [123] proposed a morphing attack detection method based on DNNs. A fully automatic face image morphing pipeline with exchangeable components was used to generate morphing attacks, train neural networks based on these data, and analyze their accuracy. Yim et al. [124] propose a new deep architecture based on a novel type of multitask learning, which can achieve superior performance in rotating to a target-pose face image from an arbitrary pose and illumination image while preserving identity. Nguyen et al. [111] propose a new approach for detecting presentation attack face images to enhance the security level of a face recognition system. The objective of this study was the use of a very deep stacked CNN–RNN network to learn the discrimination features from a sequence of face images. Finally, Bajrami et al. [125] present experiment results with LDA and DNN for face recognition, while their efficiency and performance are tested on the LFW dataset. The experimental results show that the DNN method achieves better recognition accuracy, and the recognition time is much faster than that of the LDA method in large-scale datasets.

#### *6.3. Databases Used*

The most commonly used databases for face recognition systems under different conditions are Pointing Head Pose Image Database (PHPID) [126], Labeled Faces in Wild (LFW) [127], FERET [15,16], ORL, and Yale. The last are used for face recognition systems under different conditions, which provide information for supervised and unsupervised learning. Supervised learning is based on two training modules: image unrestricted training setting and image restricted training setting. For the first model, only "same" or "not same" binary labels are used in the training splits. For the second model, the identities of the person in each pair are provided in the training splits.


a controlled environment. The images were taken frontally to the subjects, with different facial expressions and three different lighting conditions, as well as several accessories: scarves, glasses, or sunglasses. Two imaging sessions were performed with the same subjects, 14 days apart. These images are a resolution of 576 × 768 pixels and a depth of 24 bits, under the RGB RAW format.

