*Article* **A Momentum-Based Local Face Adversarial Example Generation Algorithm**

**Dapeng Lang 1,2, Deyun Chen 1,\*, Jinjie Huang <sup>1</sup> and Sizhao Li <sup>2</sup>**


**\*** Correspondence: chendeyun@hrbust.edu.cn (D.C.)

**Abstract:** Small perturbations can make deep models fail. Since deep models are widely used in face recognition systems (FRS) such as surveillance and access control, adversarial examples may introduce more subtle threats to face recognition systems. In this paper, we propose a practical white-box adversarial attack method. The method can automatically form a local area with key semantics on the face. The shape of the local area generated by the algorithm varies according to the environment and light of the character. Since these regions contain major facial features, we generated patch-like adversarial examples based on this region, which can effectively deceive FRS. The algorithm introduced the momentum parameter to stabilize the optimization directions. We accelerated the generation process by increasing the learning rate in segments. Compared with the traditional adversarial algorithm, our algorithms are very inconspicuous, which is very suitable for application in real scenes. The attack was verified on the CASIA WebFace and LFW datasets which were also proved to have good transferability.

**Keywords:** adversarial examples; face recognition; mask matrix; targeted attack; non-targeted attack

**Citation:** Lang, D.; Chen, D.; Huang, J.; Li, S. A Momentum-Based Local Face Adversarial Example Generation Algorithm. *Algorithms* **2022**, *15*, 465. https://doi.org/ 10.3390/a15120465

Academic Editors: Francesco Bergadano and Giorgio Giacinto

Received: 24 October 2022 Accepted: 2 December 2022 Published: 8 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

## **1. Introduction**

## *1.1. Introductions*

In the field of computer vision, deep learning has become a major technology for applications such as self-driving cars, surveillance, and security. Face verification [1] and face recognition [2] have outperformed humans. The recently proposed ArcFace [3] is an improvement on the previous face recognition model, which uses the loss function in angle space to replace the one in the CosFace [4] model. Earlier, the loss of the Euclidean distance space was used in the FaceNet [5] model. Furthermore, in some face recognition competitions such as the Megaface competition, ArcFace models are comparable to those of Microsoft and Google, and the accuracy rate reached 99.936%. Moreover, many open-source datasets such as LFW [6], CASIA-WebFace [7], etc. are available to researchers.

Despite the extraordinary success of deep neural networks, adversarial attacks against deep models also pose a huge threat in computer vision such as face recognition [8] and person detecton [9]. Szegedy, C. [10] and Goodfellow, I.J. [11] proved from the principle and experiment that the adversarial example is the inherent property of the deep model and proposed a series of classical algorithms. Dong, Y. [12] proposed the momentum algorithm on this basis, which is also one of the research bases of this paper.

However, the fine adversarial noise based on the whole image is not easy to realize, yet adversarial patch is an excellent option. Adversarial patches are covered to an image making it lead to misclassification or undetectable recognition by highlighting salient features of the object classification [13]. In the task of detection and classification, adversarial patches can be on the target or the background, regardless of the location [14]; a sticker on a traffic sign may cause the misclassification of traffic signs [15]; Refs. [16,17] Make it

impossible for detctor to detect the wearer by creating wearable adversarial clothing(like a T-shirt or jacket). Ref. [18] is a very powerful attack that uses adversarial glasses to deceive both the digital and physical face recognition system; Based on this idea, researchers turned to the application of adversarial patches in the field of face recognition, and achieved a high success rate [19]. Therefore, adversarial examples are a non-negligible threat in the security field and have received a lot of attention. possible for detctor to detect the wearer by creating wearable adversarial clothing(like a T-shirt or jacket). Ref. [18] is a very powerful attack that uses adversarial glasses to deceive both the digital and physical face recognition system; Based on this idea, researchers turned to the application of adversarial patches in the field of face recognition, and achieved a high success rate [19]. Therefore, adversarial examples are a non-negligible threat in the security field and have received a lot of attention.

patches can be on the target or the background, regardless of the location [14]; a sticker on a traffic sign may cause the misclassification of traffic signs [15]; Refs. [16,17] Make it im-

*Algorithms* **2022**, *15*, x FOR PEER REVIEW 2 of 18

#### *1.2. Motivations 1.2. Motivations*

There are numerous methods for targeting face recognition models, and many of them have been validated in real scenarios. Ref. [11] proposed that the perturbation direction is the direction of the gradient of the predicted the target category labels; in addition, a GAN-based AGN [1] generates an ordinary eyeglass frame sticker to attack the VGG model. Ref. [3] proposed a new, simple, and replicable method attack the best public Face ID system ArcFace. Adversarial patches generally have a fixed position and visible scale, and also need to consider deformation and spatial mapping [7]. There are numerous methods for targeting face recognition models, and many of them have been validated in real scenarios. Ref. [11] proposed that the perturbation direction is the direction of the gradient of the predicted the target category labels; in addition, a GAN-based AGN [1] generates an ordinary eyeglass frame sticker to attack the VGG model. Ref. [3] proposed a new, simple, and replicable method attack the best public Face ID system ArcFace. Adversarial patches generally have a fixed position and visible scale, and also need to consider deformation and spatial mapping [7].

The second idea is rooted in the pixel level, which tricks the FRS with subtle perturbations. As previously described, generating adversarial examples against the full image ignores the semantic information within faces [9]. Such algorithms theoretically validate the feasibility of the attack, but are too restrictive in terms of the environmental requirements, making it difficult to realize. Meanwhile, existing algorithms launch undifferentiated attacks on all the targets in the picture. In real scenes, there are multiple objects in the complex background and foreground, and attacking multiple objects at the same time makes it easy to attract the attention of defenders. To address the above problems, we propose an adversarial example generation algorithm that targets local areas with distinctive facial features. The second idea is rooted in the pixel level, which tricks the FRS with subtle perturbations. As previously described, generating adversarial examples against the full image ignores the semantic information within faces [9]. Such algorithms theoretically validate the feasibility of the attack, but are too restrictive in terms of the environmental requirements, making it difficult to realize. Meanwhile, existing algorithms launch undifferentiated attacks on all the targets in the picture. In real scenes, there are multiple objects in the complex background and foreground, and attacking multiple objects at the same time makes it easy to attract the attention of defenders. To address the above problems, we propose an adversarial example generation algorithm that targets local areas with distinctive facial features.

#### *1.3. Contributions 1.3. Contributions*

As shown in Figure 1, our algorithm combines the advantages of adversarial patches and perturbations, generating invisible adversarial examples in the form of a patch. We first extracted a face from the image, and then generated the adversarial example based on the local key features of the face. The adversarial example can be targeted or non-targeted, which can effectively mislead FRS. As shown in Figure 1, our algorithm combines the advantages of adversarial patches and perturbations, generating invisible adversarial examples in the form of a patch. We first extracted a face from the image, and then generated the adversarial example based on the local key features of the face. The adversarial example can be targeted or non-targeted, which can effectively mislead FRS.

The work in this paper is as follows. The work in this paper is as follows.


ate convergence; the momentum parameter was introduced to avoid the algorithm oscillating near the best point, which improved the attack efficiency.


## **2. Preliminaries**

## *2.1. Deep Model of Face Recognition*

DeepFace [1] is the first near-human accuracy model using Labeled Faces in the Wild (LFW) [20] and applies neural networks to face recognition models with nine layers to extract the face vectors. FaceNet [5] computes the Euclidean distance of the feature vectors of face pairs by mapping the face images into the feature space. In addition, they introduced triplet loss as a loss function so that after training, the distance of matched face pairs with the same identity would be much smaller than the distance of unmatched face pairs with different identities [4]. Sphereface [21] uses angular softmax loss to achieve the requirement of "maximum intra-class distance" to be less than "minimum inter-class distance" in the open-set task of face recognition. ArcFace [3] introduces additive angular margin loss, which can effectively obtain face features with high discrimination. The main approach is to add the angle interval *m* to the *θ* between *x<sup>i</sup>* and *Wij* to penalize the angle between the deep features and their corresponding weights in an additive manner. The equation is as follows:

$$L\_3 = -\frac{1}{N} \sum\_{i=1}^{N} \log \frac{e^{s(\cos(\theta\_{y\_i} + m))}}{e^{s(\cos(\theta\_{y\_i} + m))} + \sum\_{j=1, j \neq y\_i}^{n} e^{s \cdot \cos \theta\_j}} \tag{1}$$

## *2.2. Classic Adversarial Attacks Algorithms*

Adversarial examples are delicately designed perturbations imperceptible to humans to the input that leads to incorrect classifications [9]. The generation principle is shown in the following equation:

$$X' = X + \varepsilon \cdot \operatorname{sign}(\nabla\_X L(f(\mathbf{x}), y)) \tag{2}$$

where *e* is set empirically, which indicates the learning rate. *L*(*f*(*x*), *y*) is the linear loss function with the image *x* and label *y*. Update the input data by passing back the gradient ∇*xL*(*f*(*x*), *y*), and use the *sign*() to calculate the update direction.

The fast gradient sign method (FGSM) is a practical algorithm for the fast generation of the adversarial examples proposed by Goodfellow et al. [11]. To improve the transferability of the adversarial examples, Dong et al. [12] proposed the momentum iterative fast gradient sign method by adding the momentum term to the BIM, which prevents the model from entering the local optima and generating overfitting. The C&W [13] attack is a popular white-box attack algorithm that generates adversarial examples with high image quality, and transferability, and is very difficult to defend. Lang et al. [22] proposed the use of the attention mechanism to guide the generation of adversarial examples.

## *2.3. Adversarial Attacks on Face Recognition*

The attack on the face not only needs to deceive the deep model but also requires the semantic expression of the attack method. Ref. [23] studied an off-the-shelf physical attack projected by a video camera, and project the digital adversarial mode onto the face of the adversarial factor in the physical domain, so as to implement the attack on the system. Komkov et al. [19] attached printed colored stickers on hats, called AdvHat, as shown in Figure 2.

Figure 2.

**Figure 2.** AdvHat can launch an attack on facial recognition systems in the form of a hat. **Figure 2.** AdvHat can launch an attack on facial recognition systems in the form of a hat.

In the context of the COVID-19 epidemic, Zolfi et al. [24] used universal adversarial perturbations to print scrambled patterns on medical masks and deceived face recognition models. Yin et al. [25] proposed the face adversarial attack algorithm of the Adv-Makeup framework, which implemented a black-box attack with imperceptible properties and good mobility. The authors in [26] used a generation model to improve the portability of adversarial patches in face recognition. This method not only realized the digital adversarial example but also achieved success in the physical world. In [27], they generated adversarial patches based on FGSM. The effectiveness of the attack was proven by a series of experiments with different numbers and sizes of patches. However, the patch was still visible and still did not take into account the feature information of the face. The study in [28] introduced adversarial noise in the process of face attribute editing and integrated it into the high-level semantic expression process to make the example more hidden, thus improving the transferability of adversarial attacks. In the context of the COVID-19 epidemic, Zolfi et al. [24] used universal adversarial perturbations to print scrambled patterns on medical masks and deceived face recognition models. Yin et al. [25] proposed the face adversarial attack algorithm of the Adv-Makeup framework, which implemented a black-box attack with imperceptible properties and good mobility. The authors in [26] used a generation model to improve the portability of adversarial patches in face recognition. This method not only realized the digital adversarial example but also achieved success in the physical world. In [27], they generated adversarial patches based on FGSM. The effectiveness of the attack was proven by a series of experiments with different numbers and sizes of patches. However, the patch was still visible and still did not take into account the feature information of the face. The study in [28] introduced adversarial noise in the process of face attribute editing and integrated it into the high-level semantic expression process to make the example more hidden, thus improving the transferability of adversarial attacks.

adversarial factor in the physical domain, so as to implement the attack on the system. Komkov et al. [19] attached printed colored stickers on hats, called AdvHat, as shown in

## **3. Methodology and Evaluations**

#### **3. Methodology and Evaluations**  *3.1. Face Recognition and Evaluation Matrix*

*3.1. Face Recognition and Evaluation Matrix*  We used a uniform evaluation metric to measure whether a face pair matched or not. A positive sample pair is a matched face pair with the same identity; a negative sample pair is a mismatched face pair. To evaluate the performance of the face recognition model, We used a uniform evaluation metric to measure whether a face pair matched or not. A positive sample pair is a matched face pair with the same identity; a negative sample pair is a mismatched face pair. To evaluate the performance of the face recognition model, the following concepts are introduced.

the following concepts are introduced. The True Positive Rate (TPR) is calculated as follows:

$$TPR = \frac{TP}{TP + FN} \tag{3}$$

where *TP* indicates the matching face pair and is correctly predicted as a matching face pair, and *FN* means a matching face pair and is incorrectly predicted as a mismatched face pair. *TPR* is the probability of correctly predicted positive samples to all positive samples, which is the probability of correctly predicted matched face pairs to all matched face pairs. where *TP* indicates the matching face pair and is correctly predicted as a matching face pair, and *FN* means a matching face pair and is incorrectly predicted as a mismatched face pair. *TPR* is the probability of correctly predicted positive samples to all positive samples, which is the probability of correctly predicted matched face pairs to all matched face pairs.

The False Positive Rate (FPR) is calculated as follows: The False Positive Rate (FPR) is calculated as follows:

$$FPR = \frac{FP}{FP + TN} \tag{4}$$

where *FP* denotes a face pair whose true label is mismatched and is incorrectly predicted as a matched face pair. *TN* denotes a face pair whose true label is mismatched and is correctly predicted as a mismatched face pair. *FPR* is the probability that the incorrectly predicted negative samples account for all negative samples, and in the face recognition scenario is the probability that the incorrectly predicted mismatched face pairs account for all mismatched face pairs.

Therefore, the accuracy rate (Acc) of the face recognition model is calculated as follows:

$$Acc = \frac{TP + TN}{TP + FN + TN + FP} \tag{5}$$

That is, the accuracy of the face recognition model is the ratio of the number of correctly predicted face pairs to the total number of face pairs. Meanwhile, we chose five face recognition models with different network architectures for validation. These networks are described in the following sections.

## *3.2. Adversarial Attacks against Faces*

The adversarial attacks are classified into non-targeted attacks and targeted attacks. An intuitive way to do this is to set a threshold. When the distance between two faces and this threshold is compared, if the result is less than the threshold, the two faces are from the same person and vice versa. This is obviously more difficult for the FRS to mistake the target face for another designated one [18].

Suppose that for input *x*, the true label *f*(*x*) = *y* is output by the classification model *f* . The purpose of the adversarial attack is to generate an adversarial example *x adv* by adding a small perturbation, and there exists *x adv* <sup>−</sup> *<sup>x</sup> p* ≤ *ε*, where *p* can be 0, 1, 2, ∞.

For the non-targeted attack, the generated adversarial example makes *f x adv* 6= *y* and the results of the classifier were different from the original label; for the targeted attack, it makes *f x adv* 6= *y* ∗ , where *y* <sup>∗</sup> 6= *y*, a previously defined specific class.

## *3.3. Evaluation Indices of Attack*

Our goal is to generate adversarial patches to deceive FRS within a small area of the human face. The patch is generated by optimizing the pixels in the area, changing the distance between pairs of faces. The smaller the patch, the less likely it is to be detected by defenders. We explain the process of generating these patches.

1. Cosine Similarity is calculated by the cosine of the angle between two vectors, given as vector *X* and vector *Y*, and their cosine similarity is calculated as follows.

$$\cos(X, Y) = \frac{X \cdot Y}{\|\|X\|\| \|Y\|\|} = \frac{\sum\_{i=1}^{n} X\_i Y\_i}{\sqrt{\sum\_{i=1}^{n} X\_i^2} \sqrt{\sum\_{i=1}^{n} Y\_i^2}} \tag{6}$$

where *X<sup>i</sup>* and *Y<sup>i</sup>* are the individual elements of vector *X* and vector *Y*, respectively. The cosine similarity takes values in the range [–1, 1], and the closer the value is to 1, the closer the orientation of these two vectors (i.e., the more similar the face feature vectors). Cosine similarity can visually measure the similarity between the adversarial example and the clean image.

2. Total variation (*TV*) [19], as a regular term loss function, reduces the variability of neighboring pixels and makes the perturbation smoother. Additionally, since perturbation smoothness is a prerequisite property for physical realizability against attacks, this lays some groundwork for future physical realizability [18]. Given a perturbation noise *r*, *ri*,*<sup>j</sup>* is the pixel where the perturbation *r* is located at coordinate (*i*, *j*). The *TV*(*r*) value is smaller when the neighboring pixels are closer (i.e., the smoother the perturbation, and vice versa). The *TV* is calculated as follows:

$$TV(r) = \sum\_{i,j} \left( (r\_{i,j} - r\_{i+1,j})^2 + (r\_{i,j} - r\_{i,j+1})^2 \right)^{\frac{1}{2}} \tag{7}$$

3. We used the *L*<sup>2</sup> constraints to measure the difference between the original image and the adversarial example. *L*<sup>2</sup> is used as a loss function to control the range of perturbed noise. In the application scenario of attacking, it can be intuitively interpreted as whether the modified pixels will attract human attention.

Given vector X and vector Y, their *L*<sup>2</sup> distances (i.e., Euclidean distances) can be calculated as follows:

$$\|\|\mathbf{X}\_{\prime}\mathbf{Y}\|\|\_{2} = \sqrt{\sum\_{i}^{n} (\mathbf{x}\_{i} - y\_{i})^{2}}\tag{8}$$

where *x<sup>i</sup>* and *y<sup>i</sup>* are the elements of the input vector X and the output vector Y, respectively. The larger the *L*<sup>2</sup> distance between the two vectors, the greater their difference.

## **4. Our Method**

## *4.1. Configurations for Face Adversarial Attack*

After the image preprocessing, we extracted the features from the two face images and calculated their distance. For the input face image *x*, the face recognition model *f* extracted the features. For the input face pairs {*x*1, *x*2}, the face feature vector *f*(*x*1) and *f*(*x*2) were mapped to 512-dimensional feature vectors, respectively.

Therefore, we compared the distance of *f*(*x*1) and *f*(*x*2) with the specified threshold to determine whether the face pair matched or not. We calculated the angular distance by cosine similarity, which is as follows.

$$Simliery = \cos(f(\mathbf{x}\_1), f(\mathbf{x}\_2))\tag{9}$$

$$D(f(\mathbf{x}\_1), f(\mathbf{x}\_2)) = \frac{\arccos(Similarity)}{\pi} \tag{10}$$

where cos(·, ·) is the cosine similarity of the feature vector of the face pair in the range of [−1, 1]. Therefore, *D*(*f*(*x*1), *f*(*x*2)), based on the cosine similarity, ranged from [0, 1]. The closer the distance is to 0, the more similar the face feature vector and the more likely the face pair is matched, and vice versa. Equation (11) is used to predict the matching result of the face pair.

$$\mathbb{C}(\mathbf{x}\_1, \mathbf{x}\_2) = \mathbb{I}(D(f(\mathbf{x}\_1), f(\mathbf{x}\_2)) < t \text{threshold}) \tag{11}$$

where I(·) is the indicator function; the threshold is the baseline of the detection model that is different depending on the model used. *C*(·, ·) outputs the matching result, if *C* = 1, the face pair is matched; if *C* = 0, the face pair is not matched. A unified attack model is established based on the I(·) indicator function to implement targeted and non-targeted attacks. The flow of face pair recognition based on the threshold comparison is shown in Figure 3. *Algorithms* **2022**, *15*, x FOR PEER REVIEW 7 of 18

of constraint against adversarial perturbance. Due to various poses, illumination, and occlusions, we applied a deeply cascaded multitasking framework to integrate face detection and alignment through multitasking learning. First, since images have different sizes, the key points of the extracted face were affined to the unit space using affine transformation to unify the size and coordinate system. The detection and alignment of faces were accomplished by building a multi-level CNN structure containing three stages. Candidate windows will be quickly generated by a shallow CNN. Then, the windows were optimized by more complex CNNs to discard a large number of non-facial windows. Finally, it refines the results. by using a more powerful CNN and outputs the facial marker posi-

The algorithm flow is shown in Figure 4. In Figure 4b, for a given image, we first adjusted it to different scales to construct the image pyramid. In Figure 4c, we referred to the method in [29] to obtain the candidate windows and their bounding box regression vectors. The estimated bounding box regression vectors were then used to calibrate the candidate boxes; in Figure 4d, we used non-maximal suppression (NMS) to merge the highly overlapping candidate objects. In Figure 4e, all candidate frames were used as input to the CNN of the optimization network, which further discarded a large number of incorrect candidate frames, calibrated them using bounding box regression, and merged the NMS candidate frames. Finally, Figure 4f shows the CNN-based classification network that generated a more detailed description of the faces and outputs the critical facial posi-

**Figure 3.** Schematic diagram of the face feature vector and threshold comparison. **Figure 3.** Schematic diagram of the face feature vector and threshold comparison.

*4.2. Local Area Mask Generation* 

tions.

tions.

## *4.2. Local Area Mask Generation*

The human eye region contains critical semantic information despite its small area [23]. The local region matrix is generated according to the human eye position as the range of constraint against adversarial perturbance. Due to various poses, illumination, and occlusions, we applied a deeply cascaded multitasking framework to integrate face detection and alignment through multitasking learning. First, since images have different sizes, the key points of the extracted face were affined to the unit space using affine transformation to unify the size and coordinate system. The detection and alignment of faces were accomplished by building a multi-level CNN structure containing three stages. Candidate windows will be quickly generated by a shallow CNN. Then, the windows were optimized by more complex CNNs to discard a large number of non-facial windows. Finally, it refines the results. by using a more powerful CNN and outputs the facial marker positions.

The algorithm flow is shown in Figure 4. In Figure 4b, for a given image, we first adjusted it to different scales to construct the image pyramid. In Figure 4c, we referred to the method in [29] to obtain the candidate windows and their bounding box regression vectors. The estimated bounding box regression vectors were then used to calibrate the candidate boxes; in Figure 4d, we used non-maximal suppression (NMS) to merge the highly overlapping candidate objects. In Figure 4e, all candidate frames were used as input to the CNN of the optimization network, which further discarded a large number of incorrect candidate frames, calibrated them using bounding box regression, and merged the NMS candidate frames. Finally, Figure 4f shows the CNN-based classification network that generated a more detailed description of the faces and outputs the critical facial positions. *Algorithms* **2022**, *15*, x FOR PEER REVIEW 8 of 18

**Figure 4.** The process of face recognition and face alignment. (**a**) Clean image of a child. (**b**) Image pyramid. (**c**) Bounding box regression. (**d**) Merging the candidate objects. (**e**) Face location. (**f**) Key point location. **Figure 4.** The process of face recognition and face alignment. (**a**) Clean image of a child. (**b**) Image pyramid. (**c**) Bounding box regression. (**d**) Merging the candidate objects. (**e**) Face location. (**f**) Key point location.

Pixels are randomly sampled within the range of key points as the corresponding feature pixel [29]. The feature pixels select the closest initial key point as the anchor and calculate the deviation. The coordinate system of the current pixel after rotation, transformation, and scaling should be close to the initial key point. It acts on the deviation and adds its own position information to obtain the feature pixel of the current key point. Then, we constructed the residual tree and calculated the deviation of the current key point from the target key point. We split the sample and updated the current key point position based on the average residual of the sample. Back to the previous step, it reselected the feature key points, fit the next residual tree, and finally combined the results of all residual trees to obtain the key point locations. According to the default settings, the coordinates of the points 0, 28, 16, 26–17 in the image for the human eye area are shown Pixels are randomly sampled within the range of key points as the corresponding feature pixel [29]. The feature pixels select the closest initial key point as the anchor and calculate the deviation. The coordinate system of the current pixel after rotation, transformation, and scaling should be close to the initial key point. It acts on the deviation and adds its own position information to obtain the feature pixel of the current key point. Then, we constructed the residual tree and calculated the deviation of the current key point from the target key point. We split the sample and updated the current key point position based on the average residual of the sample. Back to the previous step, it reselected the feature key points, fit the next residual tree, and finally combined the results of all residual trees to obtain the key point locations. According to the default settings, the coordinates of the points 0, 28, 16, 26–17 in the image for the human eye area are shown in Figure 5.

(**a**) (**b**) (**c**) **Figure 5.** Schematic diagram of key point detection for human face. (**a**) Sixty-eight key points on a

We located the human eye region based on the key points of the detected eye in the image and drew the mask against the attacked region. The range of pixel values in the generated mask image was normalized to [0.0, 1.0] to generate a binary-valued mask ma-

human face. (**b**) Face that needs to be matched. (**c**) Key points and face fitting.

in Figure 5.

trix. This is shown in Figure 6.

(**a**) (**b**) (**c**)

(**d**) (**e**) (**f**) **Figure 4.** The process of face recognition and face alignment. (**a**) Clean image of a child. (**b**) Image pyramid. (**c**) Bounding box regression. (**d**) Merging the candidate objects. (**e**) Face location. (**f**) Key

Pixels are randomly sampled within the range of key points as the corresponding feature pixel [29]. The feature pixels select the closest initial key point as the anchor and calculate the deviation. The coordinate system of the current pixel after rotation, transformation, and scaling should be close to the initial key point. It acts on the deviation and adds its own position information to obtain the feature pixel of the current key point. Then, we constructed the residual tree and calculated the deviation of the current key point from the target key point. We split the sample and updated the current key point position based on the average residual of the sample. Back to the previous step, it reselected the feature key points, fit the next residual tree, and finally combined the results of all residual trees to obtain the key point locations. According to the default settings, the coordinates of the points 0, 28, 16, 26–17 in the image for the human eye area are shown

point location.

in Figure 5.

**Figure 5.** Schematic diagram of key point detection for human face. (**a**) Sixty-eight key points on a human face. (**b**) Face that needs to be matched. (**c**) Key points and face fitting. **Figure 5.** Schematic diagram of key point detection for human face. (**a**) Sixty-eight key points on a human face. (**b**) Face that needs to be matched. (**c**) Key points and face fitting.

We located the human eye region based on the key points of the detected eye in the image and drew the mask against the attacked region. The range of pixel values in the generated mask image was normalized to [0.0, 1.0] to generate a binary-valued mask matrix. This is shown in Figure 6. We located the human eye region based on the key points of the detected eye in the image and drew the mask against the attacked region. The range of pixel values in the generated mask image was normalized to [0.0, 1.0] to generate a binary-valued mask matrix. This is shown in Figure 6. *Algorithms* **2022**, *15*, x FOR PEER REVIEW 9 of 18 *Algorithms* **2022**, *15*, x FOR PEER REVIEW 9 of 18

**Figure 6.** Schematic diagram of eye area matrix generation. (**a**) Locating key areas of the human eye. (**b**) Generating a human eye area mask. **Figure 6.** Schematic diagram of eye area matrix generation. (**a**) Locating key areas of the human eye. (**b**) Generating a human eye area mask. **Figure 6.** Schematic diagram of eye area matrix generation. (**a**) Locating key areas of the human eye. (**b**) Generating a human eye area mask.

We generated adversarial examples combining the eye region matrix and full face region, respectively, to test the effect of the attacks. Figure 7a shows the clean image used for testing while Figure 7b shows the visualization of the perturbation based on the eye region and the full-face region. Figure 7c is the adversarial example. After testing, both images could successfully deceive the face detector. The adversarial perturbation generated based on the human eye region accounted for 17.8% of the total pixels, while the number of pixels of the adversarial perturbation generated based on the whole face accounted for 81.3% of the image. We generated adversarial examples combining the eye region matrix and full face region, respectively, to test the effect of the attacks. Figure 7a shows the clean image used for testing while Figure 7b shows the visualization of the perturbation based on the eye region and the full-face region. Figure 7c is the adversarial example. After testing, both images could successfully deceive the face detector. The adversarial perturbation generated based on the human eye region accounted for 17.8% of the total pixels, while the number of pixels of the adversarial perturbation generated based on the whole face accounted for 81.3% of the image. We generated adversarial examples combining the eye region matrix and full face region, respectively, to test the effect of the attacks. Figure 7a shows the clean image used for testing while Figure 7b shows the visualization of the perturbation based on the eye region and the full-face region. Figure 7c is the adversarial example. After testing, both images could successfully deceive the face detector. The adversarial perturbation generated based on the human eye region accounted for 17.8% of the total pixels, while the number of pixels of the adversarial perturbation generated based on the whole face accounted for 81.3% of the image.

(**a**) (**b**) (**c**) **Figure 7.** Schematic diagram of key point generation matrix based on eye detection. (**a**) Clean image of a child. (**b**) Adversarial perturbation based on the human eye and full face. (**c**) Adversarial exam-**Figure 7.** Schematic diagram of key point generation matrix based on eye detection. (**a**) Clean image of a child. (**b**) Adversarial perturbation based on the human eye and full face. (**c**) Adversarial examples based on the human eye and full face. **Figure 7.** Schematic diagram of key point generation matrix based on eye detection. (**a**) Clean image of a child. (**b**) Adversarial perturbation based on the human eye and full face. (**c**) Adversarial examples based on the human eye and full face.

age ௧. The loss function ℒ1 is shown as Equation (12).

age ௧. The loss function ℒ1 is shown as Equation (12).

As above-mentioned, this algorithm optimizes the (, ௗ௩) function in the local region. For the targeted attack and non-targeted attack, the relationship between the clean face image and the three target images ௧ and the adversarial example image ௗ௩

(1) For the non-targeted attack, an adversarial example ௗ௩ was generated for the input image so that the difference between them was as large as possible. When the difference was larger than the threshold value calculated by the deep detection model, the attack was successful; on the other hand, for the targeted attack, the generated adversarial example ௗ௩ needed to be as similar as possible to the target im-

As above-mentioned, this algorithm optimizes the (, ௗ௩) function in the local

ℒ<sup>1</sup> = ∙ cos൫(), (ௗ௩)൯ − (1 − α) ∙ cos൫(ௗ௩), (௧)൯ (12)

ℒ<sup>1</sup> = ∙ cos൫(), (ௗ௩)൯ − (1 − α) ∙ cos൫(ௗ௩), (௧)൯ (12)

(1) For the non-targeted attack, an adversarial example ௗ௩ was generated for the input image so that the difference between them was as large as possible. When the difference was larger than the threshold value calculated by the deep detection model, the attack was successful; on the other hand, for the targeted attack, the generated adversarial example ௗ௩ needed to be as similar as possible to the target im-

ples based on the human eye and full face.

*4.3. Loss Functions* 

was compared.

*4.3. Loss Functions* 

was compared.

## *4.3. Loss Functions*

As above-mentioned, this algorithm optimizes the *C x*, *x adv* function in the local region. For the targeted attack and non-targeted attack, the relationship between the clean face image *x* and the three target images *x tar* and the adversarial example image *x adv* was compared.

(1) For the non-targeted attack, an adversarial example *x adv* was generated for the input image *x* so that the difference between them was as large as possible. When the difference was larger than the threshold value calculated by the deep detection model, the attack was successful; on the other hand, for the targeted attack, the generated adversarial example *x adv* needed to be as similar as possible to the target image *x tar* . The loss function L*oss*<sup>1</sup> is shown as Equation (12).

$$\mathcal{L}os\_1 = \mathfrak{a} \cdot \cos\left(f(\mathfrak{x}), f\left(\mathfrak{x}^{adv}\right)\right) - (1 - \mathfrak{a}) \cdot \cos\left(f\left(\mathfrak{x}^{adv}\right), f\left(\mathfrak{x}^{tor}\right)\right) \tag{12}$$

where *cos*(·, ·) is the cosine similarity of the feature vector calculated by Equation (10); *α* takes the value of 0 or 1, representing the non-targeted attack and targeted attack, respectively.

(2) The perturbation size is constrained by the *L*<sup>2</sup> norm, thus ensuring that the visibility of the perturbation is kept within a certain range when an effective attack is implemented. The loss function in this section constrains the perturbation after the restriction as follows.

$$\mathcal{L}os\_2 = L\_2(mask\odot r) \tag{13}$$

where *r* is the perturbation. The mask is that generated from the first face image of the face pair to restrict the perturbation region. It is a [0, 1] matrix scaled to the same size as the image. The  symbol indicates the dot product operation between the elements.

(3) The *TV* is used to improve the smoothness of the perturbation through Equation (14), and the loss function of this part also deals with the perturbation after restriction, as follows.

$$\mathcal{L}oss\_{\overline{3}} = TV(mask\odot r) \tag{14}$$

In summary, for the above targeted and non-targeted attacks, the loss function is minimized by solving the following optimization problem of Equation (15), which can generate the final adversarial perturbation *r* :

$$\min\_{r} \mathcal{Loss} = \min\_{r} (\lambda\_1 \mathcal{Loss}\_1 + \lambda\_2 \mathcal{Loss}\_2 + \lambda\_3 \mathcal{Loss}\_3) \tag{15}$$

The hyperparameters *λ*1, *λ*2, and *λ*<sup>3</sup> are used to control the relative weights of the perturbation losses. The correlation coefficients of the two regular term loss functions L*oss*<sup>2</sup> and L*oss*<sup>3</sup> are gradually reduced as the number of iterations increases.

## *4.4. Momentum-Based Optimization Algorithms*

To solve the optimization problem above, the adversarial perturbation is optimized by using an iterative gradient descent method to minimize the objective function. A momentum parameter superimposes in the gradient direction and dynamically stabilizes update directions in each iteration step [12].

In the updating process, due to the different iterations of updating for different scenes, we divided the updating process into several stages, and the learning rate of different

stages gradually decreased. The gradient is calculated as follows.Meanwhile, the learning rate *α*∆*<sup>t</sup>* is changed according to the number of iterations *it* and stages *st*.

$$\begin{aligned} grad &= \beta \cdot m\_i + \frac{\nabla\_x \mathcal{L} \cos\left(\mathbf{x}^{adv}, y\right)}{||\nabla\_x \mathcal{L} \cos\left(\mathbf{x}^{adv}, y\right)||\_1} \\ m\_{i+1} &= grad \\ r\_{t+1} &= r\_t - a\_{\Delta i} \* grad \\ a\_{\Delta i} &= \left(\frac{it}{st}\right)^i a\_{\Delta(i-1)} + a\_{\Delta(i-1)} \end{aligned} \tag{16}$$

where <sup>∇</sup>*x*L*oss x adv* , *y* 1 is the regularized representation of the gradient of <sup>∇</sup>*x*L*oss x adv* , *y* . The parameter *β* is the decay factor, adjusting for the influence of momentum on the gradient calculation. *r<sup>t</sup>* is the adversarial perturbation generated in the *t* iteration. The parameter *α*∆*<sup>t</sup>* is dynamically adjusted and is related to the iterations *it* and the current stages *st*. If *it* is high, then \* can be set bigger. As *it* increases, *st* will become smaller.

$$\mathbf{x}\_{t+1}^{adv} = \text{clip}\_{\mathbf{x}, \mathbf{c}}(\mathbf{x} + \text{mask} \odot \mathbf{r}\_{t+1}) \tag{17}$$

where *clipx*,*ε*(·) serves to restrict the adversarial examples after superimposed perturbation to a reasonable range (after normalization) of [−1, 1] at the end of each iteration.

The final elaborate perturbation is processed and added to the original face image so that the final adversarial example is generated by restricting the perturbation to a reasonable range of [0, 255] using *clipx*,*ε*(·). The process is shown in Figure 8. The final elaborate perturbation is processed and added to the original face image so that the final adversarial example is generated by restricting the perturbation to a reasonable range of [0, 255] using ,ఌ(∙). The process is shown in Figure 8.

**Figure 8.** Flowchart of the adversarial attack based on the eye area. **Figure 8.** Flowchart of the adversarial attack based on the eye area.

Figure 8 shows that the feature vectors are first extracted from the aligned faces. The attack region is mapped by keypoint detection and the adversarial perturbation information is generated based on this region. The local aggressive perturbation is obtained through the optimization of the loss functions. This perturbation information can effectively mislead FRS. Figure 8 shows that the feature vectors are first extracted from the aligned faces. The attack region is mapped by keypoint detection and the adversarial perturbation information is generated based on this region. The local aggressive perturbation is obtained through the optimization of the loss functions. This perturbation information can effectively mislead FRS.

## **5. Experiments**

for targeted attacks.

#### **5. Experiments**  *5.1. Datasets*

*5.1. Datasets*  In this paper, we used CASIA WebFace [7] as the training dataset. All of the pictures In this paper, we used CASIA WebFace [7] as the training dataset. All of the pictures are from movie websites and vary in light and angle. In order to verify the effect of the

test face pairs, of which 3000 are matched pairs and 3000 are mismatched pairs.

= ቊ (ଵ, ଶ) ∈ ௦,

are from movie websites and vary in light and angle. In order to verify the effect of the algorithm on different datasets, we choose LFW [22] as our test dataset. It provides 6000

images to 112 × 112. Experimentally, there were 500 pairs of matched faces with the same identity for non-targeted attacks and 500 pairs of unmatched faces with different identities

Five mainstream pre-trained face recognition models were used for comparison, namely ResNet50-IR (IR50) [31], ResNet101-IR (IR101) [32], SEResNet50-IR (IR-SE50) [33], MobileFaceNet [8], and ResNet50 [33]. In order to show the success rate of adversarial attacks more intuitively, the metrics were the True Accept Rate (TAR) and False Accept Rate (FAR) [20]. Given a face pair (ଵ, ଶ), let the matched face pair be ௦ and the unmatched face pair be ௗ. Given a threshold, the calculation of *TAR* and *FAR* is as follows.

with ൫(ଵ), (ଶ)൯ < ℎℎ<sup>ቋ</sup> (18)

*5.2. Performance Evaluation for Face Recognition Models* 

algorithm on different datasets, we choose LFW [22] as our test dataset. It provides 6000 test face pairs, of which 3000 are matched pairs and 3000 are mismatched pairs.

For face detection and alignment, MTCNN [30–34] was used to uniformly crop the images to 112 × 112. Experimentally, there were 500 pairs of matched faces with the same identity for non-targeted attacks and 500 pairs of unmatched faces with different identities for targeted attacks.

## *5.2. Performance Evaluation for Face Recognition Models*

Five mainstream pre-trained face recognition models were used for comparison, namely ResNet50-IR (IR50) [31], ResNet101-IR (IR101) [32], SEResNet50-IR (IR-SE50) [33], MobileFaceNet [8], and ResNet50 [33]. In order to show the success rate of adversarial attacks more intuitively, the metrics were the True Accept Rate (TAR) and False Accept Rate (FAR) [20]. Given a face pair (*x*1, *x*2), let the matched face pair be *P<sup>s</sup>* and the unmatched face pair be *P<sup>d</sup>* . Given a threshold, the calculation of *TAR* and *FAR* is as follows. *Algorithms* **2022**, *15*, x FOR PEER REVIEW 12 of 18

$$TA = \left\{ \text{with } D(f(\mathbf{x}\_1), f(\mathbf{x}\_2)) < \text{threshold} \right\} \tag{18}$$

$$FA = \left\{ \begin{array}{l} (\mathfrak{x}\_1, \mathfrak{x}\_2) \in P\_{d\prime} \\ \text{with } D(f(\mathfrak{x}\_1), f(\mathfrak{x}\_2)) < threshold \end{array} \right\} \tag{19}$$

$$TAR = \frac{|TA|}{|P\_s|}\tag{20}$$

$$FAR = \frac{|FA|}{|P\_d|}\tag{21}$$

where |*TA*| is the number of all matched pairs whose distance is less than the threshold; |*Ps*| is the number of all matched pairs; |*FA*| is the number of all unmatched pairs whose distance is less than the threshold; and |*P<sup>d</sup>* | is the number of all unmatched face pairs. where || is the number of all matched pairs whose distance is less than the threshold; |௦| is the number of all matched pairs; || is the number of all unmatched pairs whose distance is less than the threshold; and |ௗ| is the number of all unmatched face pairs. Different models will have different thresholds that can objectively reflect the success

Different models will have different thresholds that can objectively reflect the success rate of the attack. Accordingly, the threshold was determined according to different values of *FAR*, and was chosen when *FAR* <sup>=</sup> <sup>1</sup> <sup>×</sup> <sup>10</sup>−<sup>2</sup> or <sup>1</sup> <sup>×</sup> <sup>10</sup>−<sup>3</sup> . We traversed the range of thresholds and used the 10-fold cross-validation method to find the threshold closest to the target *FAR*. rate of the attack. Accordingly, the threshold was determined according to different values of FAR, and was chosen when FAR = 1 ൈ 10ିଶ or 1 ൈ 10ିଷ. We traversed the range of thresholds and used the 10-fold cross-validation method to find the threshold closest to the target *FAR*.

As shown in Figure 9, the TPR of each model (i.e., the proportion of correctly predicted matched face pairs to all unmatched face pairs) was maintained above 96.5% when *FAR* <sup>=</sup> <sup>1</sup> <sup>×</sup> <sup>10</sup>−<sup>6</sup> . When *FAR* <sup>=</sup> <sup>1</sup> <sup>×</sup> <sup>10</sup>−<sup>2</sup> , the TPR reached more than 98.5%. This indicates that the performance of these backbone network models had excellent performance. As shown in Figure 9, the TPR of each model (i.e., the proportion of correctly predicted matched face pairs to all unmatched face pairs) was maintained above 96.5% when FAR = 1 ൈ 10ି. When FAR = 1 ൈ 10ିଶ, the TPR reached more than 98.5%. This indicates that the performance of these backbone network models had excellent performance.

*5.3. Attack Method Evaluation Indicators* 

**Figure 9.** The ROC curve of FPR in the range of 1 ൈ 10ି to 1. **Figure 9.** The ROC curve of FPR in the range of 1 <sup>×</sup> <sup>10</sup>−<sup>6</sup> to 1.

The test results of the five models on the LFW at a FAR of about 0.01 are shown in Table 1. The value of TAR@FAR = 0.01 (i.e., the probability of correctly identifying matching face pairs when the FAR is close to 0.01) was maintained at more than 98.9%. The test results of the five models on the LFW at a FAR of about 0.01 are shown in Table 1. The value of TAR@FAR = 0.01 (i.e., the probability of correctly identifying matching face pairs when the FAR is close to 0.01) was maintained at more than 98.9%.

IR50 99.596 0.00995 0.43326 0.20814 IR101 98.984 0.01259 0.41311 0.26960 IR-SE50 99.396 0.01025 0.42920 0.22060 ResNet50 99.596 0.00836 0.45207 0.15001 MobileFaceNet 99.196 0.01076 0.43763 0.19469

The accuracy of a face recognition model intuitively reflects the predictive ability of

= 1 − (22)

**Table 1.** The TAR and corresponding thresholds for different models.

the model. The attack success rate (ASR) is calculated as follows:


**Table 1.** The TAR and corresponding thresholds for different models.

## *5.3. Attack Method Evaluation Indicators*

The accuracy of a face recognition model intuitively reflects the predictive ability of the model. The attack success rate (ASR) is calculated as follows:

$$ASR = 1 - A\mathcal{cc} \tag{22}$$

The higher the ASR, the more vulnerable the model is to adversarial attacks; the lower the ASR, the more robust the model is to adversarial attacks and is able to withstand a certain degree of adversarial attacks.

In order to evaluate the magnitude of the difference between the generated adversarial example and the original face image after the attack, this experiment used the peak signal-tonoise ratio (*PSNR*), and structural similarity (*SSIM*) [34], which are two metrics to measure the image quality of the adversarial example.

The *PSNR* is defined and calculated by the mean squared error (*MSE*). The following equation calculates the *PSNR* for a given image *I*.

$$PSNR = 20 \cdot \log\_{10}(MAX\_I) - 10 \cdot \log\_{10}(MSE) \tag{23}$$

where *MAX<sup>I</sup>* is the maximum pixel value of the image; *MSE* is the mean square error. The larger the *PSNR*, the less distortion and the better quality of the adversarial example [3].

Considering human intuition, we adopted the evaluation index of structural similarity (*SSIM*), which takes into account the three factors of brightness, contrast, and structure. Given images *x* and *y* with the same dimensions, the structural similarity is calculated as follows.

$$SSIM(\mathbf{x}, \mathbf{y}) = \frac{\left(2\mu\_{\mathbf{x}}\mu\_{\mathbf{y}} + c\_1\right)\left(2\sigma\_{\mathbf{xy}} + c\_2\right)}{\left(\mu\_{\mathbf{x}}^2 + \mu\_{\mathbf{y}}^2 + c\_1\right)\left(\sigma\_{\mathbf{x}}^2 + \sigma\_{\mathbf{y}}^2 + c\_2\right)}\tag{24}$$

Among them, *µx*, *µ<sup>y</sup>* are the mean values of image *x*, *y*; *σ* 2 *x* , *σ* 2 *<sup>y</sup>* are the variance of image *x*, *y*; *σxy* is the covariance, and *c*<sup>1</sup> and *c*<sup>2</sup> are used to maintain stability. *SSIM* takes values in the range of [−1, 1], and the closer the value is to 1, the higher the structural similarity between the adversarial example and the original image. To a certain extent, it can indicate the more imperceptible the perturbation applied to the adversarial example is to humans.

## *5.4. Adversarial Attack within Human Eye Area*

## 5.4.1. Non-Targeted Attacks based on Eye Area

A schematic diagram of the non-targeted attack is shown in Figure 10. The first column shows the face pair before the attack. To the human eye, there is no difference between the second image in Figure 10a,b, and the second image in Figure 10b is the adversarial example.

**Figure 10.** Before and after eye-based non-targeted attack. (**a**) Visualization of facial features of the same person from different angles. (**b**) Visualization of face features after being attacked.

The higher the ASR, the more vulnerable the model is to adversarial attacks; the lower the ASR, the more robust the model is to adversarial attacks and is able to withstand

In order to evaluate the magnitude of the difference between the generated adversarial example and the original face image after the attack, this experiment used the peak signal-to-noise ratio (PSNR), and structural similarity (SSIM) [34], which are two metrics

The PSNR is defined and calculated by the mean squared error (MSE). The following

where ூ is the maximum pixel value of the image; is the mean square error. The larger the PSNR, the less distortion and the better quality of the adversarial example

(, ) <sup>=</sup> ൫2௫௬ + ଵ൯൫2௫௬ + ଶ൯

image , ; ௫௬ is the covariance, and ଵ and ଶ are used to maintain stability. *SSIM* takes values in the range of [−1, 1], and the closer the value is to 1, the higher the structural similarity between the adversarial example and the original image. To a certain extent, it can indicate the more imperceptible the perturbation applied to the adversarial example

A schematic diagram of the non-targeted attack is shown in Figure 10. The first column shows the face pair before the attack. To the human eye, there is no difference between the second image in Figure 10a,b, and the second image in Figure 10b is the adver-

<sup>ଶ</sup> + ଵ൯൫௫

<sup>ଶ</sup> + ௬

<sup>ଶ</sup> + ଶ൯ (24)

<sup>ଶ</sup> are the variance of

ଶ, ௬

൫௫ <sup>ଶ</sup> + ௬

Among them, ௫, ௬ are the mean values of image , ; ௫

Considering human intuition, we adopted the evaluation index of structural similarity (SSIM) , which takes into account the three factors of brightness, contrast, and structure. Given images and with the same dimensions, the structural similarity is calcu-

= 20 ⋅ ଵ (ூ) − 10 ⋅ ଵ () (23)

a certain degree of adversarial attacks.

[3].

lated as follows.

is to humans.

sarial example.

to measure the image quality of the adversarial example.

equation calculates the PSNR for a given image .

*5.4. Adversarial Attack within Human Eye Area*  5.4.1 Non-Targeted Attacks based on Eye Area

*weightattetion* in the fourth column indicates the attention of the model, where the darker color indicates that the model paid more attention to the area. It can be seen that there was no significant change in the attention hotspots before and after the attack. In the third column, the xCos [35] module visualizes the face pairs before and after the attack and visualizes the changes in the images from the perspective of the neural network parameters. The bluer color in the similarity plot *cospatch* indicates that they are more similar, and the redder color indicates that they are less similar. It can be seen that the face pairs changed dramatically after the attack.

## 5.4.2. Targeted Attacks Based on Eye Area

The purpose of the targeted attack is to deceive the deep detection model into misidentifying another specific face from the original image. As shown in Figure 11a, the similarity graph of the face pair before the attack had a large number of red grids, indicating that this pair was very dissimilar and was a mismatched face pair, while the model's attention focused on the eye area in the middle of the face. The first image in Figure 11b is the generated adversarial example; the second image is the target image. Intuitively, the first images in Figure 11a,b are exactly the same. This is also reflected in the attention map. However, for the face recognition model, the grid of the eye region in *cospatch* mostly changed to blue, and 43% of the regions changed from yellow to blue. This indicates that the image change affected the classification of deep model.

## 5.4.3. Quantitative Comparison of Different Attack Models

To verify the effectiveness of the algorithm, we selected 500 faces for targeted and non-targeted attacks. Furthermore, each model has a different threshold for the best performance. The attack success rates of different attack models are shown in Table 2.

**Table 2.** The accuracy and success rates of different models under the specified thresholds.


pairs changed dramatically after the attack. 5.4.2 Targeted Attacks based on Eye Area

**Figure 11.** Before and after the eye-based targeted attack. (**a**) Visualization of the features of different faces. (**b**) Feature visualization of different faces after the targeted attack. **Figure 11.** Before and after the eye-based targeted attack. (**a**) Visualization of the features of different faces. (**b**) Feature visualization of different faces after the targeted attack.

**Figure 10.** Before and after eye-based non-targeted attack. (**a**) Visualization of facial features of the

ℎ௧௧௧ in the fourth column indicates the attention of the model, where the darker color indicates that the model paid more attention to the area. It can be seen that there was no significant change in the attention hotspots before and after the attack. In the third column, the xCos [35] module visualizes the face pairs before and after the attack and visualizes the changes in the images from the perspective of the neural network parameters. The bluer color in the similarity plot ௧ indicates that they are more similar, and the redder color indicates that they are less similar. It can be seen that the face

The purpose of the targeted attack is to deceive the deep detection model into misidentifying another specific face from the original image. As shown in Figure 11a, the similarity graph of the face pair before the attack had a large number of red grids, indicating that this pair was very dissimilar and was a mismatched face pair, while the model's attention focused on the eye area in the middle of the face. The first image in Figure 11b is the generated adversarial example; the second image is the target image. Intuitively, the first images in Figure 11-(a) and (b) are exactly the same. This is also reflected in the attention map. However, for the face recognition model, the grid of the eye region in ௧ mostly changed to blue, and 43% of the regions changed from yellow to blue.

same person from different angles. (**b**) Visualization of face features after being attacked.

This indicates that the image change affected the classification of deep model.

5.4.3 Quantitative Comparison of Different Attack Models To verify the effectiveness of the algorithm, we selected 500 faces for targeted and non-targeted attacks. Furthermore, each model has a different threshold for the best performance. The attack success rates of different attack models are shown in Table 2. **Table 2.** The accuracy and success rates of different models under the specified thresholds. **Models ACC (%) Targeted-ASR (%) Non-Targeted-ASR (%) Threshold**  IR50 98.2 90.4 99.2 0.43326 Our method was compared with the traditional adversarial algorithms. With a high success rate, we compared the differences between the adversarial examples and the original images, and the evaluated metrics included the image quality of the adversarial examples, the calculated average PSNR and SSIM. The perturbations of the adversarial examples generated by our algorithm for different deep detection models were counted. A PSNR greater than 40 indicates that the image distortion was small; the closer the SSIM takes the value of 1, the closer the adversarial example is to the original image. The comparison results are shown in Table 3.

IR101 94.7 96.2 98.6 0.41311 **Table 3.** Average PSNR, SSIN, and the perturbed parameters for different models.


As shown in Table 3, for different attacks, the PSNR of all models was above 40 dB and the SSIM was above 0.98, indicating that the image distortion was very small and the perturbations were imperceptible to humans; on the other hand, the perturbations generated by the targeted attack was much lower than that of the non-targeted attack. The momentum in this algorithm was updated toward the target image due to the directed output of the image. Therefore, the adversarial example generation algorithm was also guided to optimize in the direction of the specific objects.

## 5.4.4. Comparison of Adversarial Example Algorithms

In this paper, the validation dataset covered 40 different categories. These categories can be correctly classified by the MobileFaceNet model (Top-1 correct); we also selected ArcFace [3] and SphereFace [21] face recognition models for testing. ArcFace uses IR101 as the network structure and has 99.8% accuracy in the LFW test set; SphereFace's network structure removes the BN module, which differs significantly from the ResNet50 residual network, with 99.5% accuracy in the LFW test set. We selected a variety of typical adversarial example algorithms FGSM [10], I-FGSM [11] algorithms, and the face-specific attack

method AdvGlasses [18], AdvHat [19], and our algorithm (AdvLocFace) for cross-testing. The comparison results are shown in Table 4.


**Table 4.** Accuracy and success rates of different algorithms.

In Table 4, the diagonal lines are white-box attack settings. I-FGSM improved the success rate of white-box attacks by increasing the iterative process, but reduced the mobility of the attack method due to the overfitting of the perturbation. The AdvHat algorithm is an advanced physical attack method that attacks realistic attacks by pasting stickers on the hat, and it is easy to replicate this attack. The optimization process, based on the consideration of pixel smoothing and color printability, limits the effect of mobility in digital attacks. AdvLocFace, with the best threshold for similar models based on the base model of training, obtained a more stable success rate of black-box attacks. For network models with different structures and different training data, the attack success rate decreased significantly.

## **6. Conclusions**

This paper proposed a face adversarial example generation algorithm based on local regions. The algorithm uses the principle of a face recognition system to build a local area containing key features and generates momentum-based adversarial examples. This algorithm is a typical white-box attack method but still achieves good results in the blackbox attack scenario.

Compared with the traditional adversarial attack method, the adversarial perturbance generated by our method only needs to cover a small part of the original image. Because the region contains the key features of the face, it can successfully mislead the face recognition system. In addition, the generated adversarial example is patch-like, which is highly similar to the original image and therefore more inconspicuous. Our algorithm can selectively attack any target in the image, so it can be extended to attack other types of images. the experiments show that the proposed algorithm can effectively balance the modified pixel area and attack successfully, achieving good transferability.

**Author Contributions:** Investigation, J.H.; Methodology, D.L.; Software, S.L.; Writing—Original draft, D.C.; Writing—Review & editing, J.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the State Administration of Science, Technology and Industry for National Defense, PRC, Grant No. JCKY2021206B102.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** All data and used models during the study are available in a repository or online. The datasets were CASIA WebFace [7], LFW [6], and MTCNN [30].

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


## *Article* **From Iris Image to Embedded Code: System of Methods**

**Ivan Matveev <sup>1</sup> and Ilia Safonov 2,\***


**Abstract:** Passwords are ubiquitous in today's world, as are forgetting and stealing them. Biometric signs are harder to steal and impossible to forget. This paper presents a complete system of methods that takes a secret key and the iris image of the owner as input and generates a public key, suitable for storing insecurely. It is impossible to obtain source data (i.e., secret key or biometric traits) from the public key without the iris image of the owner, the irises of other persons will not help. At the same time, when the iris image of the same person is presented the secret key is restored. The system has been tested on several iris image databases from public sources. It allows storing 65 bits of the secret key, with zero possibility to unlock it with the impostor's iris and 10.4% probability to reject the owner in one attempt.

**Keywords:** biometric cryptosystem; iris identification; error-correcting codes

## **1. Introduction**

Nowadays, cryptographic algorithms are widely used for information protection. A large number of them, as well as their applications, have been invented and introduced [1]. These algorithms and systems are mathematically grounded and reliable. The weak link in their implementation and usage, as usual, is human. Cryptography requires keys, i.e., sequences of digits, which should be reproduced precisely. While a human is able to remember and reproduce a personally invented password (though there are difficulties here already), it is practically impossible to memorize a long sequence of pseudorandom symbols, which is created automatically [2]. Meanwhile, humans possess biometric features that are simple to retrieve, difficult to alienate, and contain a significant amount of information. The disadvantage of biometric traits is their variability: it is impossible to exactly replicate the measurement results, we can only say that two sets of traits taken from one person are in some sense closer than the sets obtained from different people. It is of great interest to combine these two approaches, i.e., to develop methods for obtaining unique cryptographic keys from variable biometry data of a given person.

The eye iris is the most suitable biometric modality among all non-invasive ones due to its highest information capacity. The number of degrees of freedom of the iris template was evaluated as 249 [3]. It promises to be almost as good as a strong symmetric cryptography key length of 256 bit, while the net coming fingerprint is reported to have 80 bits [4]. In order to design a practically usable system it is advisable to base it on the iris. Up to now a major focus in developing automated biometric is building an identification system, i.e., the system, which executes a scenario: sample biometric features once, record, take them sometime later and decide whether these samples belong to the same person.

The workflow of the biometric identification system can be combined of the blocks: capture, segmentation, template generation, and template matching, see Figure 1.

**Citation:** Matveev, I.; Safonov, I. From Iris Image to Embedded Code: System of Methods. *Algorithms* **2023**, *16*, 87. https://doi.org/10.3390/ a16020087

Academic Editors: Francesco Bergadano and Giorgio Giacinto

Received: 15 December 2022 Revised: 19 January 2023 Accepted: 3 February 2023 Published: 6 February 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Figure 1.** Biometric system workflow.

Note that in this scenario the biometric template should be securely stored and exclude the intruder from obtaining it. Here, a different problem is solved, thus only capture, segmentation and partly template generation blocks are inherited, and matching is replaced by embedding/extracting the cryptographic key into/from the biometric features.

The explanation here goes alongside the data processing: from the source iris image to the embedding of the secret key. The capture process, i.e., obtaining eye images with a camera device, is beyond the scope of this paper. We start from the image segmentation task and present a framework for locating the iris in an eye image. In the next section clue methods of the framework are described. Then feature extraction and matching methods are given. Following is the discussion of the application scenario of embedding the secret key to biometric features. The successful extraction of the embedded key depends on the difference between registered and presented biometric features, the value of this difference is determined based on several databases. In the next section the methods of encoding and decoding the key are presented, and the selection of their optimal parameters is discussed. The contribution of this work is comprised of the following.


## **2. Eye Segmentation Framework**

Methods, algorithms and applications of iris biometrics have attracted much attention during the last three decades [5–7] and continue developing rapidly in recent years [8]. The main trend of the latest development is switching from heuristic approaches and hand-crafted methods to employing neural networks in various tasks. A wide variety of artificial neural networks has emerged and is applied to iris segmentation, starting from earlier research with fully connected nets [9] to latest applications of attention-driven CNNs [10], U-Nets [11], hybrid deep learning [12]. Another trend comes from the in-born ability of neural networks to *classify* objects (say, pixels, textures, images) rather than *calculate* their positions and other numerical properties. Due to this, most of the works in iris segmentation rely on detecting masks, i.e., pixels belonging to regions of the iris (or pupil, or else) in the image. Positions and sizes of pupil and iris are then derived from these masks. Surely, detecting masks is what one calls segmentation; however, such an approach ignores the clear and simple geometry of the iris and is prone to detecting irises of

unnatural shape as is shown in [12]. Some works [13,14] apply a neural network to detect the position of the iris as a number; however, it seems a strained decision.

Here we adopt a "classical" method. The obvious approach to iris region extraction in eye imaging is a chain of operations that starts with the detection of the pupil (the most distinctive area that is dark and has an expressed circular shape). Then outer iris border is presumably located. Finally, the visible iris part is refined by cutting off the areas distorted by reflections, eyelids and eyelashes. Most researchers and developers follow this method. Detection of each iris feature is usually carried out only one time and it is not recalculated any more even after obtaining other features, which can serve for refinement. For instance, in [15–18] full set of iris features is detected; however, pupil parameters are obtained at the first step and are not revised any more.

Only a few papers develop something different from this sequence "first pupil, then iris, once determined, and never reviewed". In [19,20], the position of the iris center is estimated first which makes pupil detection more robust. In [21], pupil parameters are refined using iris size after the iris is located. In [21,22], detection methods run iteratively several times for refinement. In [20,23], a multi-scale approach is used, and methods run in several scales. However, none of these works use various types of methods for detecting any iris feature.

Here we develop a *system of methods* for segmentation of the iris in an eye image. Evaluating of each of parameters is performed at several steps. The main idea of this system is that at first the most general characteristics of objects are defined, which are then sequentially supplemented by more specific and refined ones. Beginning steps do not need to output precise final parameters, used as final. Instead, they should be robust and general and tolerate a wide range of conditions, i.e., detect the object of any quality. Later steps should have the highest possible precision and may reject poor quality data.

Iris region in frontal images is delimited by two nearly concentric nearly circular contours, called inner and outer borders. Hereinafter the contour separating iris and pupil is referred to as *inner border*, *pupil border* or simply *pupil*, and the contour delimiting iris and sclera is called *outer border* or *iris*. In most cases pupil border is wholly visible in the image, but some part of the iris border is frequently overlapped by eyelids and eyelashes.

Since the pupil and iris are almost concentric, one *eye center* point may serve as an approximate center for both contours. It can be considered the most general geometric property of the iris, and the first step of eye detection should be locating this eye center. Note that only the position of the center is to be found, rather than the size of any contour. Excluding size and allowing approximate detection involves both concentric borders in the process. This is especially significant for eyes with poorly visible inner boundaries, where pupil location alone fails frequently. A modification of Hough method [24] is used.

It is very likely that after center detection pupil size should be estimated. To the best of our knowledge, this is carried out in all works where iris segmentation starts from eye center location, as in [19]. However, this method is not stable and universal for a wide range of imaging conditions. Detecting the radius may easily mistake the outer border for the inner, especially for images with poor inner border contrast [25]. Here we decide to employ both correlated contours around the detected center, and detect sizes of them simultaneously. Hereinafter this detection is referred to as *base radii* detection, meaning that it finds approximate (base) radii of inner and outer circles around a given center. The method relies on circular projections of gradient [26]. Base radii detection produces approximate center coordinates and radii of pupil and iris circles, which satisfy some reasonable limitations. Furthermore, the quality of detection is calculated. The quality should be high enough to pass the image to further processing.

Then both boundaries are re-estimated with better precision (refined). Pupil refinement is carried out by a specially developed version of the shortest path method [27]. Iris is refined by the same method as that of base radius. The difference is that the position of the pupil is now fixed and only the iris center and radius are being searched. Iris segmentation here results in detecting two nearly concentric circles, which are approximating the inner and outer borders of the iris ring. Occlusion detection [28] is carried out to ensure the quality of iris data, i.e., to reject strongly occluded irises from further processing, but apart from this the occlusion mask is not used.

Summing up, the segmentation stage of the system employs five steps: center detection, base radii detection, pupil refinement, iris refinement and occlusion detection, see Figure 2.

**Figure 2.** Workflow of iris segmentation methods.

At each stage of segmentation, quality value is estimated and the process is terminated if the quality is below acceptable.

## **3. Eye Segmentation Methods**

Methods of iris segmentation are briefly presented in this section.

## *3.1. Center Detection*

The algorithm finds the coordinates (*xC*, *yC*) = ~*c* of eye center in the image *b*(~*x*), and does not need to estimate pupil or iris size. There is also no need to find the center position precisely, it is sufficient to locate it somewhere inside the pupil. Thus, pixels of both pupil and iris borders are used in Hough's procedure. Furthermore, the algorithm has low computational complexity since only two parameters are estimated and a two-dimensional Hough accumulator is used.

The algorithm runs the following five steps.

## Step 1. Gradient calculation.

Consider rectilinear coordinate system *Oxy* in the image with the center in the left bottom corner and axes *Ox* and *Oy* directed along its borders. Denote brightness *b*(~*x*) in image point <sup>~</sup>*x*. Brightness gradient <sup>∇</sup><sup>~</sup> *<sup>b</sup>*(~*x*) = <sup>~</sup>*g*(~*x*) is estimated by standard Sobel masks [29].

## Step 2. Outlining voting pixels.

We need edge pixels to vote. These are selected with the help of a gradient value threshold. Cumulative distribution of brightness gradient values in pixels over the image is calculated, and set Ω<sup>1</sup> of pixels with brightness gradient in the upper 5% of this distribution are selected:

$$\begin{aligned} H(\mathcal{G}) &= |\{ \vec{\pi} : \|\vec{\mathcal{g}}(\vec{\pi})\| \leqslant \mathcal{G} \}|\,, \\ \Omega\_1 &= \{ \vec{\pi} : H(\|\vec{\mathcal{g}}(\vec{\pi})\|) \geqslant (1 - \tau\_\mathbf{s})N \}\,, \end{aligned} \tag{1}$$

where |*S*| is power (count of elements) of set *S*, *N* is total number of image pixels, *τ<sup>s</sup>* = 0.05 is the share of points being selected.

## Step 3. Voting to accumulator.

Hough methods use *accumulator* function, which is defined over a parameter space. We detect the eye center, which is some point in the image, and its parameters are its coordinates in the image. Thus, the parameter space is 2D vector ~*x* and the accumulator is *A*(~*x*) with the same size as the source image.

Ray from some given point <sup>~</sup>*<sup>x</sup>* <sup>∈</sup> <sup>Ω</sup><sup>1</sup> in anti-gradient direction −∇<sup>~</sup> *<sup>b</sup>*(~*x*) is the locus of all possible dark circles with border passing through this point. A set of such rays, drawn in the accumulator, traced from each pixel coordinated selected at step 2 will produce clotting at the center of any roundish dark object. The more circle-like this object is, the more expressed will be its central clotting.

## Step 4. Accumulator blurring.

The accumulator *A*(~*x*) is subject to a low-pass filter, to suppressed noise such as singular sporadic rays produced by non-circular edges in the image. Denote the result as *AB*(~*x*).

Step 5. Maximum location.

Maximum position

$$\vec{\mathcal{L}} = \arg\max\_{\vec{\mathcal{X}}} A\_B(\vec{\mathcal{X}}) \tag{2}$$

in blurred accumulator corresponds to the center of the best round-shaped object in the image. It is the most probable eye center. However, local maxima exist in any image due to noise. In order to decide whether there is a noticeable circular object, one can compare the value of local maxima against the values produced by noise. Since *τ<sup>s</sup>* pixels of the image are voting and for each point voting procedure draws a segment of approximately 0.5*W* pixels, where *W* is a linear size of the image, the average brightness level is near 0.5*τsW*. Selecting desirable signal to noise ratio *PSNR*, one can write the condition of accepting the located maximum (2) for eye center:

$$Q\_{\mathbb{C}} = \max\_{\vec{\mathbb{X}}} A\_B(\vec{\mathbb{X}}) > \frac{1}{2} P\_{\text{SNR}} \tau\_{\mathbb{S}} \mathcal{W} \,. \tag{3}$$

If condition (3) does not hold, the decision is made that there is no eye in the image *b*(~*x*).

## *3.2. Base Radii Detection*

The algorithm simultaneously locates two iris boundaries as circle approximations: inner (pupil) (*xP*, *yP*,*rP*) and outer (iris) (*x<sup>I</sup>* , *y<sup>I</sup>* ,*rI*) starting from the center ~*c* (2). In this section, we set (*xC*, *yC*) = ~*c* as coordinate origin. Anti-gradient vector at the boundary of the dark circle and the direction to the circle center coincide or form a small angle. As the pupil and iris are both dark circles on the brighter background, one can state the following condition for pixels ~*x* of their boundaries:

$$\phi(\vec{x}) = \arccos \frac{\vec{g}(\vec{x}) \cdot \vec{x}}{||\vec{g}(\vec{x})|| \ ||\vec{x}||} < \tau\_{\Phi} \ . \tag{4}$$

We use a threshold value *τ<sup>φ</sup>* = 45<sup>0</sup> .

Furthermore, the condition for gradient value (1) is applicable. Pixel ~*x* satisfying the conditions (1), (4) probably belongs to the inner or outer boundary. Call it *candidate*. Define the set of candidate pixels as Ω2:

$$\Omega\_2 = \left\{ \vec{x} : \phi(\vec{x}) < \tau\_{\Phi'} H(\|\vec{g}(\vec{x})\|) \right\} \ge (1 - \tau\_{\mathfrak{s}}) \mathcal{N} \right\}.\tag{5}$$

For each radius *r* a ratio of candidate count at this radius to the count of all pixels at this radius is estimated:

$$\Pi(r) = \frac{|\{\vec{\mathfrak{X}} : \|\vec{\mathfrak{X}}\| \in [r - 0.5, r + 0.5), \vec{\mathfrak{X}} \in \Omega\_2\}|}{|\{\vec{\mathfrak{X}} : \|\vec{\mathfrak{X}}\| \in [r - 0.5, r + 0.5)\}|}\,\,. \tag{6}$$

If there is a dark circle of some radius *ρ* with the center near the coordinate origin its border pixels are likely to belong to the set Ω2, and are likely to have distance *ρ* to the coordinate origin. Thus, Π(*ρ*) will be big, i.e., have local maximum. Other contours will not vote to the same radius of circular projection and will not form local maxima therein.

The image plane is divided into four quadrants, left, right, top and bottom by the lines *y* = *x* and *y* = −*x*. In each quadrant, a *sub-projection* is calculated separately according to (6). Positions of local maxima on the right, left, top, and bottom sub-projections are:

$$\mu\_{\mathfrak{a}}(n) = \arg\log \max\_{\mathfrak{a}} \Pi\_{\mathfrak{a}}(r) \; , \; \mathfrak{a} = \left\{ R, L, T, B \right\} \; . \tag{7}$$

The quality of maxima is simply the value of histogram at the point

$$
\eta\_{\mathfrak{a}}(n) = \Pi\_{\mathfrak{a}}(\mu\_{\mathfrak{a}}(n))\,. \tag{8}
$$

If not occluded, each of the two circular contours (inner and outer borders) gives a local maximum in each sub-projection. Other maxima may arise due to occlusions such as eyelashes and eyelids or due to other details in eye images, including patterns of the iris itself. Combining local maxima positions (7) gives set of hypothetical pupils:

$$\begin{aligned} \mathbf{x}\_{P}^{i,j} &= \frac{1}{2} (\mu\_{R}(i) - \mu\_{L}(j)) \; , \; i = \overline{1, n\_{R}} \; , \; j = \overline{1, n\_{L}} \; , \\\mathbf{y}\_{P}^{k,l} &= \frac{1}{2} (\mu\_{T}(k) - \mu\_{B}(l)) \; , \; k = \overline{1, n\_{T}} \; , \; l = \overline{1, n\_{B}} \; , \\\mathbf{y}\_{P}^{i,j,k,l} &= \frac{1}{4} (\mu\_{R}(i) + \mu\_{L}(j) + \mu\_{T}(k) + \mu\_{B}(l)) \; . \end{aligned} \tag{9}$$

Qualities of combinations are also defined from values (8):

$$q\_P^{i,j,k,l} = \frac{1}{4} (q\_R(i) + q\_L(j) + q\_T(k) + q\_B(l))\,. \tag{10}$$

Irises are estimated by just the same formulas:

$$\begin{aligned} \mathbf{x}\_{I}^{i,j} &= \frac{1}{2} (\mu\_{R}(i) - \mu\_{L}(j)) \,, \ i = \overline{1, n\_{R}} \,, \ j = \overline{1, n\_{L}} \,, \\\ y\_{I}^{k,l} &= \frac{1}{2} (\mu\_{T}(k) - \mu\_{B}(l)) \,, \ k = \overline{1, n\_{T}} \,, \ l = \overline{1, n\_{B}} \,, \\\ r\_{I}^{i,j,k,l} &= \frac{1}{4} (\mu\_{R}(i) + \mu\_{L}(j) + \mu\_{T}(k) + \mu\_{B}(l)) \,, \\\ q\_{I}^{i,j,k,l} &= \frac{1}{4} (q\_{R}(i) + q\_{L}(j) + q\_{T}(k) + q\_{B}(l)) \, . \end{aligned} \tag{11}$$

The nature of the pupil and iris imposes certain limitations on their locations and sizes. We use the following four inequalities: pupil size is not less than 15% of iris size and not

more than 75% of iris size; center of the iris is inside pupil; pupil cannot be displaced too much from iris center. This can be written as:

$$r\_P > 0.15r\_I \quad r\_P < 0.75r\_I \quad d < r\_P \quad 2(r\_I - r\_P - d) > r\_I - r\_P + d \tag{12}$$

where~*c<sup>P</sup>* = (*xP*, *yP*),~*c<sup>I</sup>* = (*x<sup>I</sup>* , *yI*) are centres of pupil and iris, *d* = k~*c<sup>P</sup>* −~*cI*k is a distance between these centres.

From all possible variants of pupil-iris pair given by (9)–(11) we select those satisfying conditions (12). The quality of combination is a sum of pupil and iris qualities (10) and a weighted quality of fitting to conditions (12):

$$\begin{aligned} Q &= q\_P + q\_I + \gamma q\_{f\bar{t}\bar{t}} \\ q\_{f\bar{t}\bar{t}} &= \min\left\{ \frac{r\_P - 0.15r\_I}{r\_P}, \frac{0.75r\_I - r\_P}{r\_P}, \frac{r\_P - d}{r\_P}, \frac{r\_I - r\_P - 3d}{r\_I - r\_P} \right\}. \end{aligned} \tag{13}$$

The combination with the best quality is selected. If *Q* is below the given threshold, it is supposed that the eye in the image is squinted and upper and lower eyelids cover a big share of the iris border. In this case, the variant with absent top and bottom iris local maxima is tested. The formulas (9) and (10) are modified accordingly, iris center vertical position is taken equal to that of the pupil: *y<sup>I</sup>* ≡ *yP*. If *Q* is below the threshold again, it is decided that there is no feasible iris ring and in the image. Other types of occlusion are not treated, the iris images are considered too bad for processing in this case. Thresholds for *Q* and value of *γ* in (13) are estimated experimentally so as to reject the biggest share of erroneously detected irises while preserving good outcomes. So, the method runs in six steps:

## Step 1. Gradient calculation.

This step is common with center detection.

## Step 2. Candidates selection.

This step is similar to Step 2 of center detection. In addition to gradient value condition (1) angular condition (4) is imposed.

## Step 3. Circular projecting.

Calculating circular projections (6) in four quadrants.

## Step 4. Enumeration of maxima.

Finding local maxima (7) in projections. Prior to this the projections are smoothed with a Gaussian filter to suppress redundant local maxima originating from noise.

## Step 5. Enumerations of hypothetical irises.

Finding coordinates and radii of inner and outer circles from combinations of maxima (9), which hold conditions (12).

## Step 6. Selecting the best iris.

Pair of circles is selected according to the qualities (8), (10), (13).

If no feasible iris is detected in step 5, the result is "no eye detected".

A sample of the projection combination is presented in Figure 3. Real positions of pupil and iris borders, taken from expert marking are depicted by arrows. There is no local maxima corresponding to the iris border in the top projection Π*T*(*r*) since the iris border is occluded. Such minor obstacles do not prevent choosing correct combination.

**Figure 3.** Four circular projections, their maxima positions and correct position of borders.

## *3.3. Pupil Refinement*

*Circular shortest path* method constructs a closed contour in a circular ring [30]. The ring is centered at a given point and has inner and outer radii. CSP method is a kind of optimal path method, i.e., it optimized the functional, which is the cost of the path. We take the ring concentric to the approximate pupil circle and spread 30% of its radius inside and outside.

In order to ease calculations polar transformation is carried out. The ring shape in the source image is unwrapped to a rectilinear raster. Radial and angular coordinates of the ring are mapped to abscissa and ordinate. Thus, the problem of locating the circular shortest path is reduced to a problem of detecting the optimal path from the left to the right side of the rectangle such that terminal points of the path have the same ordinate. Contour is represented as a function *ρ*(*φ*), *φ* ∈ [0; 2*π*], *ρ*(0) = *ρ*(2*π*) with limited derivative *dρ*/*dφ* < 1. In a discrete rectilinear raster of size *W* × *H* the contour is turns to a sequence of points: {(*n*, *ρn*)}, *n* ∈ [0; *W* − 1]. Limitations to the derivative transforms to |*ρn*+<sup>1</sup> − *ρn*| 6 1, edge condition is set as |*ρW*−<sup>1</sup> − *ρ*0| 6 1.

Consider points (*n*, *ρ* 0 ) and (*n* + 1, *ρ* 00) from adjacent columns of the raster. Denote the cost of passing between them as

$$\mathbb{C}((n,\rho'),(n+1,\rho'')) \equiv \mathbb{C}\_n(\rho',\rho'') = \mathbb{C}\_n^{(I)}(\rho',\rho'') + \mathbb{C}\_n^{(O)}(\rho',\rho'') \,. \tag{14}$$

This cost is a sum of inner and outer parts.

Inner cost is a function of contour shape, designed in a way to promote its smoothness:

$$\mathbf{C}\_{n}^{(I)}(\boldsymbol{\rho}',\boldsymbol{\rho}'') = \begin{cases} \mathbf{0} \ \boldsymbol{\rho}' = \boldsymbol{\rho}'' \ \boldsymbol{\rho} \\ \tau\_{i} \ \ |\boldsymbol{\rho}' - \boldsymbol{\rho}''| = 1 \\ \infty \ |\boldsymbol{\rho}' - \boldsymbol{\rho}''| > 1 \ . \end{cases} \tag{15}$$

Value of *τ<sup>i</sup>* > 0 is a parameter defining the magnitude of a "force", which pulls the contour towards a straight line. Optimizing the inner part alone would give horizontal lines in polar raster, i.e., ideal circles with the given center in source image.

Outer cost is designed to make the contour pass through border pixels. So it is low in boundary points (the gradient vector is big and perpendicular to the local direction of the contour) and is high otherwise. The outer part is the cost of passing the point (*n*, *ρ* 0 ):

$$\mathcal{C}^{(O)}(n,\rho) = \begin{cases} 0 & (\ge,y) \in \Omega\_3 \\ \tau\_{o\circ\prime} & (\ge,y) \notin \Omega\_3 \end{cases} \tag{16}$$

where Ω<sup>3</sup> is the set of points defined by (5), *x* and *y* are the coordinates of the source image point, which was mapped to (*n*, *ρ*).

Optimal contour *S* = {*ρn*} *W n*=1 is the one minimizing the total cost:

$$\mathcal{S}^\* = \arg\min\_{\mathcal{S}} \sum\_{n=1}^{W} \mathbb{C}\_n(\rho\_{n\prime}\rho\_{n+1})\,. \tag{17}$$

This discrete optimization problem can be solved by various methods. Here the method works in quite a narrow ring and the exhaustive search is faster due to small overhead.

Denote sum in (17) as Σ. In the best case Σ = 0, in the worst case Σ = *W*(*τ<sup>i</sup>* + *τo*). Mapping this into the range [0; 1] where value 1 stands for best we obtain the quality

$$Q\_{ref} = 1 - \frac{\Sigma}{\mathcal{W}(\mathfrak{r}\_{\bar{l}} + \mathfrak{r}\_o)} \ . \tag{18}$$

The contour is considered acceptable if *Qre f* > 0.5, otherwise the decision is made that the pupil border cannot be detected with the required precision and the segmentation is terminated.

The algorithm runs in five steps.

## Step 1. Candidates selection.

The same gradient calculation as in the first step of previous methods is used. Then the conditions (1), (4) are imposed as in Step 2 of base radii detection. However, a smaller angular threshold *τ<sup>φ</sup>* = 30<sup>0</sup> is set since the center position is known with better precision.

## Step 2. Polar transform.

The transform creates an image (rectangular raster) *P*(*φ*, *ρ*), *φ* ∈ [0, *W* − 1], *ρ* ∈ [0; *H* − 1] by calculating a brightness value in each of its pixels (*φ*, *ρ*). This brightness is taken from source image *b*(*x*, *y*) where its coordinates are estimated as

$$\begin{aligned} x(\phi,\rho) &= \left(r\_1 + \frac{r\_2 - r\_1}{H}\rho\right) \cos\left(\frac{2\pi\phi}{W}\right) \\ y(\phi,\rho) &= \left(r\_1 + \frac{r\_2 - r\_1}{H}\rho\right) \sin\left(\frac{2\pi\phi}{W}\right) \end{aligned} \tag{19}$$

where *r*<sup>1</sup> and *r*<sup>2</sup> are the inner and outer radii of the ring in the source image, and the coordinate origin of the source image is placed at the center of the ring. The brightness of the point of the polar image is obtained by bilinear interpolation:

$$\begin{aligned} N(\rho,\phi) &= \\ (1-\{\mathbf{x}\})(1-\{\mathbf{y}\})b(\lfloor\mathbf{x}\rfloor,\lfloor\lfloor y\rfloor\rfloor) + \\ &\{\mathbf{x}\}(1-\{\mathbf{y}\})b(\lfloor\mathbf{x}\rfloor+1,\lfloor y\rfloor)+ \\ &(1-\{\mathbf{x}\})\{y\}b(\lfloor\mathbf{x}\rfloor,\lfloor y\rfloor+1)+ \\ &\{\mathbf{x}\}\{y\}b(\lfloor\mathbf{x}\rfloor+1,\lfloor y\rfloor+1) \end{aligned} \tag{20}$$

where b*a*c and {*a*} define integer and fractional parts of *a*.

Step 3. Optimal path tracking.

Finding *S* ∗ according to (14)–(17).

Step 4. Transforming to original coordinates.

Restore the coordinates of the optimal path from *Oρφ* polar system back to the source image *Oxy* system.

Step 5. Estimating equivalent circle.

Pupil border contour is not a circle precisely; however, we can define an *equivalent circle*, with area and center of mass same as those of the figure enclosed into the pupil border contour. The center and radius of the equivalent circle are:

$$\begin{aligned} \mathbf{x}\_{eq} &= \frac{M\_{\mathbf{x}}}{M}, \quad y\_{eq} = \frac{M\_{\mathbf{y}}}{M}, \; r\_{eq} = \sqrt{\frac{M}{\pi}},\\ M &= |\Omega\_4|, \; M\_{\mathbf{x}} = \sum\_{(\mathbf{x}, \mathbf{y}) \in \Omega\_4} \mathbf{x}, \; M\_{\mathbf{y}} = \sum\_{(\mathbf{x}, \mathbf{y}) \in \Omega\_4} \mathbf{y} \; . \end{aligned} \tag{21}$$

where Ω<sup>4</sup> is the area inside contour *S* ∗ in source image. This equivalent circle is further used as the pupil border, and it happens to be a better model due to its stability [31].

## **4. Experiments with Iris Segmentation**

Iris segmentation here results in detecting two nearly concentric circles, which are approximating inner and outer borders of iris ring.

Assessment of the iris segmentation quality can be carried out in the following ways:


In order to compare the results of the proposed system with the known analogs, the following publicly available iris image databases were used: CASA-3-Lamp and CASIA-4-Thousand [32] (totally 54,434 images), BATH [33] (31,988 images), NDIRIS [34] (64,980 images), UBIRIS-1 [35] (1207 images).

## *4.1. Matching against Manual Segmentation*

All images were processed by a human expert, who marked two circles approximating iris borders in each of them or rejected if the iris was not visible or of poor quality. (In fact, there were very few, less than a hundred altogether, images rejected at this stage.) We assume that the expert did it accurately; therefore this segmentation is taking for ground truth. Denote the manually marked circles as (*x*, *y*,*r*) ∗ *P* for pupil and (*x*, *y*,*r*) ∗ *I* for the iris. Values of absolute and relative errors of eye center detection averaged in databases

$$\varepsilon\_{\mathsf{C,abs}} = \left< \Delta \right> \,, \ \varepsilon\_{\mathsf{C,rel}} = \left< \frac{\Delta}{r\_P^\*} \right> \,, \ \Delta = \left( (\mathsf{x}\_{\mathsf{C}} - \mathsf{x}\_P^\*)^2 + (y\_{\mathsf{C}} - y\_P^\*)^2 \right)^{1/2} \tag{22}$$

are given in Table 1.

It can be seen that for all databases except for the small bases MMU and UBIRIS, which contain low-resolution images, and UBIRIS, which contain images with small pupil size, the mean absolute deviation of the detected eye center from the true center of the pupil does not exceed four pixels and the relative deviation does not exceed one-tenth of the radius.


**Table 1.** Errors of eye center detection.

The next method of the system is base radius detection. Table 2 presents the average deviations of the detected centers and radii of the pupil and the iris from those marked by human experts.

$$\begin{aligned} \varepsilon\_{P,\text{abs}} &= \left\langle \left( (\mathbf{x}\_P - \mathbf{x}\_P^\*)^2 + (y\_P - y\_P^\*)^2 \right)^{1/2} \right\rangle, \ \varepsilon\_{rP,\text{abs}} = \langle r\_P - r\_P^\* \rangle \ , \\ \varepsilon\_{I,\text{abs}} &= \left\langle \left( (\mathbf{x}\_I - \mathbf{x}\_I^\*)^2 + (y\_I - y\_I^\*)^2 \right)^{1/2} \right\rangle, \ \varepsilon\_{rI,\text{abs}} = \langle r\_I - r\_I^\* \rangle \ , \end{aligned} \tag{23}$$

**Table 2.** Errors of base radii detection, pixels.


It is seen that the mean error in detecting the pupil center is reduced compared with the first column of Table 1.

Table 3 shows the errors for the final circles of the pupil and the iris obtained by the system, calculated according to (23).


**Table 3.** Errors of final iris parameters detection, pixels.

## *4.2. Matching against Other Methods*

Table 4 compares the computation time and errors in determining the pupil parameters for the presented system and its analogs. The comparison was carried out with the methods described in [3,36–39].

The third method to assess the algorithm of iris segmentation, i.e., applying its results to iris recognition is disclosed further.


**Table 4.** Matching against other methods.

## **5. Feature Extraction and Matching**

We use the standard approach [3] here, which first transforms the iris ring to a so-called *normalized* image. This image is a rectangular raster, it is obtained from the iris ring by the polar transformation, analogous to (19), (20), where *r*<sup>1</sup> and *r*<sup>2</sup> are set to the radius of pupil and iris, respectively. In fact, more elaborate version of (19) is used:

$$\begin{aligned} x(\boldsymbol{\phi},\boldsymbol{\rho}) &= \left(1 - \frac{\rho}{H}\right) \mathbf{x}\_1 \left(\frac{2\pi\boldsymbol{\phi}}{W}\right) + \frac{\rho}{H} \mathbf{x}\_2 \left(\frac{2\pi\boldsymbol{\phi}}{W}\right), \\ y(\boldsymbol{\phi},\boldsymbol{\rho}) &= \left(1 - \frac{\rho}{H}\right) y\_1 \left(\frac{2\pi\boldsymbol{\phi}}{W}\right) + \frac{\rho}{H} y\_2 \left(\frac{2\pi\boldsymbol{\phi}}{W}\right), \\ x\_1(\boldsymbol{\phi}) &= \mathbf{x}\_P + r\_P \cos(\boldsymbol{\phi}), y\_1(\boldsymbol{\phi}) = y\_P + r\_P \sin(\boldsymbol{\phi}), \\ x\_2(\boldsymbol{\phi}) &= \mathbf{x}\_I + r\_I \cos(\boldsymbol{\phi}), y\_2(\boldsymbol{\phi}) = y\_I + r\_I \sin(\boldsymbol{\phi}) \end{aligned} \tag{24}$$

where *xP*, *yP*, *r<sup>P</sup>* are the position and radius of pupil and *x<sup>I</sup>* , *y<sup>I</sup>* , *r<sup>I</sup>* are the position and radius of iris. In comparison to (19) this variant accounts for the difference of pupil and iris centres.

The key idea of standard iris feature extraction is to convolve the normalized iris image with a filter, calculating the most informative features of the texture. Earlier Gabor wavelet was used for feature extraction. In one-dimensional space, it is represented as

$$\mathcal{g}\_{\sigma\lambda}(\mathbf{x}) = \exp\left(-\frac{\mathbf{x}^2}{2\sigma^2}\right) \exp\left(-i\frac{\mathbf{x}}{\lambda}\right), \quad \mathcal{G}\_{\sigma\lambda}(\mathbf{u}) = \exp\left(-\frac{(\mathbf{u}-\lambda^{-1})^2\sigma^2}{2}\right), \tag{25}$$

where *σ* defines the width of the wavelet in the spatial domain, *λ* is the wavelength of modulation of the Gaussian by a harmonic function. By introducing inverse values *S* = 1/*σ* and *W* = 1/*λ*, a simplified representation in the frequency domain can be obtained:

$$G\_{SW}(\mu) = \exp\left(-\frac{(\mu - W)^2}{2S^2}\right). \tag{26}$$

It turned out that the modification of the Gabor wavelet called Log-Gabor function is better for feature extraction. Log-Gabor is given in the frequency domain as:

$$\mathcal{G}\_{SW}(u) = \exp\left(\frac{-\log^2(u/W)}{2\log^2(\mathcal{S}/W)}\right) = \exp\left(-\frac{(\log u - \log W)^2}{2\log^2 L}\right). \tag{27}$$

This is equivalent to (26), in which each variable is replaced by its logarithmic counterpart. *L* = *S*/*W* = *λ*/*σ* represents the ratio of the modulation wavelength to the width of the Gaussian. Research has shown that Log-Gabor wavelets are most likely optimal for the template generation problem. Therefore, we use this type of filter. The parameter *λ* is essentially the characteristic size of the objects in the image extracted by this filter, and *L*

is the number of periods of the harmonic function in Equation (25) which have sufficient amplitude and influence to the result. Optimal values of *λ* and *L* are selected according to [40].

Iris features *V*(*φ*, *ρ*) are calculated by convolution of the normalized image (20) with a Gabor or Log-Gabor filter, the transformation is performed in the spectral domain:

$$\begin{split} \mathcal{V}(\boldsymbol{\phi}, \boldsymbol{\rho}) &= \mathcal{N}(\boldsymbol{\phi}, \boldsymbol{\rho}) \* \mathcal{g}\_{\sigma\lambda}(\boldsymbol{\phi}) = \\ &= \mathcal{F}^{-1}\{\mathcal{F}\{\boldsymbol{N}(\boldsymbol{\phi}, \boldsymbol{\rho})\} \mathcal{F}\{\mathcal{g}\_{\sigma\lambda}(\boldsymbol{\phi})\}\} = \\ &= \mathcal{F}^{-1}\{\mathcal{F}\{\boldsymbol{N}(\boldsymbol{\phi}, \boldsymbol{\rho})\} \mathcal{G}\_{\lambda L}(\boldsymbol{u})\} \ . \end{split} \tag{28}$$

where *σ* and *λ* define the width of the wavelet along the angular axis and the modulation frequency, *s* is the width along the radial axis, F is the Fourier transform. The features used to form the patterns are computed as binary values of real and imaginary parts of the array *V*(*φ*, *ρ*):

$$\begin{aligned} T\_{R\varepsilon}(\phi,\rho) &= \mathcal{H}[\mathfrak{R}(V(\phi,\rho))]\ ,\\ T\_{Im}(\phi,\rho) &= \mathcal{H}[\mathfrak{R}(V(\phi,\rho))]\ ,\end{aligned} \tag{29}$$

where H[·] is the Heavyside function. So, eye image *b*(*x*, *y*) produces a template *T*(*φ*, *ρ*), and each element of the template contains two bits.

We use features raster of 13 pixels in radial direction *r* and 256 pixels in tangential direction *φ*. Since each pixel produces two bits in (29) the total size of the template is 6656 bit [40].

Although here we do not build a classification system, which calculates a distance between templates and compares it against a classification threshold, template matching is implicitly present, as it will be shown below. Thus, we need to describe the matching method.

In a standard iris recognition approach templates *T*<sup>1</sup> and *T*<sup>2</sup> are matched by normalized Hamming distance:

$$\rho\_0(T\_1, T\_2) = \frac{1}{|\Omega|} |\{T\_1(\phi, \rho) \neq T\_2(\phi, \rho), (\phi, \rho) \in \Omega\}|\,\tag{30}$$

where Ω = *M*<sup>1</sup> ∩ *M*<sup>2</sup> is the intersection of the visible areas (presenting true data) of the two irises. Because of the uncertainty of the iris rotation angle, a more complex distance formula is used. The rotation of the original image of the eye is equivalent to a cyclic shift of the normalized image along the *Oφ* axis. Therefore, one of the templates (together with the mask) is subjected to several shift and compare operations:

$$\begin{aligned} \rho(T\_1, T\_2) &= \min\_{\psi} \rho\_{\psi}(T\_1, T\_2) \\ \rho\_{\psi}(T\_1, T\_2) &= \frac{1}{\Omega(\psi)} |\{T\_1(\phi + \psi, \rho) \neq T\_2(\phi, \rho), (\phi, \rho) \in \Omega(\psi)\}| \\ \Omega(\psi) &= M\_1(\phi + \psi) \cap M\_2(\phi) \ . \end{aligned} \tag{31}$$

where *ψ* ∈ [−*S*; *S*] is the image rotation angle.

Here things may be simplified. For the embedding method, only irises with low occlusion levels are acceptable. Thus, it is supposed that masks *M*<sup>1</sup> and *M*<sup>2</sup> cover all of the iris area, and Ω set spans all templates. Omitting mask, rewriting |{*T*<sup>1</sup> 6= *T*2}| as ∑ *T*<sup>1</sup> ⊕ *T*<sup>2</sup> and using single order index *i* instead of coordinates (*φ*, *ρ*) put (30) as:

$$\rho\_0(T\_1, T\_2) = \frac{1}{N} \sum\_{i=1}^{N} T\_1(i) \oplus T\_2(i) \tag{32}$$

where *T*(*i*) is the *i*-th bit of the template, operation ⊕ is the sum modulo 2, *N* is the size of the template. Furthermore, (31) transforms to

$$\begin{aligned} \rho(T\_1, T\_2) &= \min\_{\psi} \rho\_{\psi}(T\_1, T\_2) \; , \\ \rho\_{\psi}(T\_1, T\_2) &= \frac{1}{N} \sum\_{i=1}^{N} T\_1(i(\psi)) \oplus T\_2(i) \; , \end{aligned} \tag{33}$$

where *i*(*ψ*) index is recalculated accordingly.

The recognition system is designed to supply the following conditions with the lowest possible errors:

> *T*1, *T*<sup>2</sup> taken from same person =⇒ *ρ*(*T*1, *T*2) 6 *θ* , *T*1, *T*<sup>2</sup> taken from different persons =⇒ *ρ*(*T*1, *T*2) > *θ* . (34)

Violation of the first condition in (34) is called *false reject* and its probability is referred to as *false reject rate* (FRR). FRR of the system is estimated in tests as the ratio of the number of false rejects to the number of all matches of biometric traits of the same persons. Analogously, violation of the second condition in (34) is called *false accept* and its probability is named *false accept rate* (FAR). The threshold *θ* is chosen from a trade-off between FRR and FAR.

## **6. Selecting the Embedding Method**

There are many works, where biometry is used in combination in combination with other security measures such as usual secured passwords, for instance [41,42]. Here, we intend to develop a system that uses only data transmitted insecurely—the only protection is the iris of the owner.

We also limit ourselves to the case of symmetric encryption. During encoding the *message M* and the secret *key K* are combined into the *code* by the *encoder* function Φ: *C* = Φ(*M*, *K*), and during decoding the message is reconstructed from code and key by *decoder* functions Ψ: *M* = Ψ(*C*, *K*). If key *K* is not present, it is impossible to obtain *M* from *C*, thus the code *C* can be made public. Symmetric encryption requires that *K* is repeated exactly. Not a single bit of it can be changed.

The central problem in automatic biometry systems can be put as developing the optimal classifier. The classifier consists of a distance function between two biometric data samples *ρ*(*D*1, *D*2) and a threshold *θ* (34). The function *ρ* can be treated as a superposition of two sub-functions. The first one is the calculation of the biometric template *T* from source data *T* = *T*(*D*), Second sub-function is the calculation of the distance itself *ρ*(*T*1, *T*2). Features should be selected, which are stably close for the same person and stably far for different persons with respect to function *ρ*. As a rule, the elements of biometric templates are highly correlated. On the contrary, cryptographic keys are deliberately developed so as to have uncorrelated bits. However, the entropy (information amount) of an iris template is comparable to that of currently used cryptographic keys [43]. This suggests that it is possible to implement a cryptographic key in biometrics without reducing its robustness.

It should be noted that most of the works presenting the application of cryptographic methods to biometrics, develop the scenario of *cancelable biometrics* [44]. Its essence is producing such biometric templates that source biometric data cannot be extracted or guessed anyhow from any number of templates. Cancelable biometrics is nothing but a kind of fuzzy hashing [45]. Formally, an additional step is introduced in the calculation of the distance function *ρ*. Distance *ρ*(*S*1, *S*2) is calculated, *S* = *S*(*T*) is the hash function. Obviously, the recognition problem is still being solved here. Thus, cancelable biometrics is just a remake of identification and cannot be used for our purposes.

There are two approaches to how to process volatile biometrics, leading them to an unchanging cryptographic key. The first approach employs already computed biometric features constituting the template *T*, which are supplemented with error correction using different variants of redundant coding. This approach is used here. In the second approach [46] biometric features are not obtained in explicit form. Instead, a neural network is trained, which directly produces a target key from raw biometric data *D*. The advantage of this approach is said to be less coding redundancy by using continuous data at all stages and quantization only at the end. Disadvantages are the unpredictability of neural network training, lack of guaranteed quality of performance, including uncertainty in retaining quality in a wider set of data than that used in training.

The task of reproducing a cryptographic key is accomplished by *biometric cryptosystems* (BC) [45,47], also called *biometric encryption* [48]. There are two classes of BCs, which implement different approaches: *key generation* and *key binding*.

Methods of key generation, i.e., direct production of the key from raw biometry or template without using supplementary code are studied in [49–51]. Biometric template *T* is mapped into the space of cryptographic keys (usually bit strings) by a special function: *K*(*T*) : *T* → {0, 1} *n* , where *n* is the length of the key. One function is used for registration and recognition. The conditions must hold

$$\begin{aligned} T\_1 \; T\_2 \; \text{taken from one person} &\implies \; K\_1 = K\_2 \; . \\ T\_1 \; T\_2 \; \text{taken from different persons} &\implies \; K\_1 \neq K\_2 \; . \end{aligned} \tag{35}$$

These conditions are closely related to (34); however, in (35) the task is to reproduce the sequence of bits. The results of the methods without supplementary data are not very hopeful for practical applications. Error level is generally very high in this approach. In [50] the authors report FRR = 24% at FAR = 0.07% even for homogeneous high-quality images [32]. In [51], the idea is based on assumption that two iris codes can be mapped to some "closest" prime number and this number will be the same for the codes from one person. Considering the variability of iris codes even for ideal conditions this is unlikely to happen. The authors do not report the study of recognition errors.

Scenario with *helper code* demonstrates much better performance. During registration the encoder takes the template *T*1, computes the key *K*<sup>1</sup> = *K*(*T*1), encrypts the message *M* with *K*<sup>1</sup> and additionally outputs some helper code *h* = Φ(*T*1). Immediately after this the original *T*1, *M*, and *K*<sup>1</sup> are destroyed, leaving only encoded message *M*<sup>0</sup> and helper code *h*. The original template *T*<sup>1</sup> or key *T*<sup>1</sup> cannot be recovered from *M*<sup>0</sup> and *h*. During presentation another set of biometric traits *T*<sup>2</sup> is obtained and the key *K*<sup>2</sup> = Ψ(*T*2, *h*) is calculated. Functions Φ and Ψ are designed so as to satisfy (35). Thus, by providing biometrics and helper code, the registered user can obtain the original key *K*<sup>2</sup> = *K*1, and hence the message *M*. At the same time, the intruder, even knowing *h*, will not be able to obtain *K*<sup>1</sup> [52], so the helper code *h* can be made non-secret.

The biometric data itself may be used as a key: *K* ≡ *T*. In this case, at the stage of presentation, original biometrics *T*<sup>1</sup> is restored from presented *T*2. This scenario is called *secure sketch* [53]. However, the works based on secure sketches and available in the literature show rather modest results. For example, the system [54] is workable under the assumption that intraclass variability of features is below 10%. In practice, the variability is more than 20%. This conditions the inoperability of the proposed method.

The *key binding* scheme in the above terms looks like a pair of encoder function *C* = Φ(*K*1, *T*1) and decoder function *K*<sup>2</sup> = Ψ(*C*, *T*2), which holds the (35) condition. The advantage is that *K*<sup>1</sup> is set externally, rather than created by the developed algorithm. From this point of view, *K*<sup>1</sup> can be perceived as a message *M*, which is external to the encryption system. This immediately simplifies the biometric cryptosystem to a symmetric encryption scenario. The difference is that the secret key *K* must be the same in encoding and decoding in symmetric encryption, whereas the biometric features (also secret) differ: *T*<sup>1</sup> 6= *T*2. This scenario is called *fuzzy extractor* [53].

If Ψ is an inverse of Φ and biometric data are composed of real numbers the scenario is referred to as *shielding functions* [55]. So-called *fuzzy vault* [43] is another popular method of key embedding. It is founded on Shamir's secret separation scheme [56]. Here rather low, practically meaningful error values are obtained: [57] reports FRR = 0.78% and [58] reports FRR = 4.8% at zero FAR. However, both results are shown using a single small image database (less than 1000 samples).

The most promising for use in iris biometry is *fuzzy commitment* scenario [59]. In [60], a simple algorithm is proposed. The basic idea is to use employ the *error correcting coding* (ECC) [61]. ECC is widely used in data transmission over noisy channels. Generally, data transmission involves a pair of functions also called encoder and decoder. The encoder *R* = Φ*p*(*K*) maps the transmitted message *K* into a larger redundant code *R*. Then *R* is passed through the transmission channel, which alters each of its symbols independently with the probability *q*, and the altered code *R* 0 is received at the other side. The decoder *K* = Ψ*p*(*R* 0 ) is able to restore *K* back from *R* 0 under the condition that no more than a *p* share of values were altered. Call *p* as *tolerated error probability*. Thus, if *q* < *p* then the message is restored with a probability close to 1. Otherwise, the probability to restore *K* is close to 0. One can design Φ and Ψ for a wide range of transition error probabilities *p* ∈ [0; 0.5). Redundancy grows as *p* approaches to 0.5, for *p* = 0.5 it becomes infinite.

Here ECC is used as follows. The encoder and decoder are constructed so as to have a tolerated error probability equal to the classification threshold of the biometric recognition system: *p* = *θ*. Upon registration, a password *K*<sup>1</sup> is constructed and the user's template *T*<sup>1</sup> is obtained. The code *R*<sup>1</sup> = Φ*p*(*K*1) (it generally looks like pseudorandom numbers) is bitwise summed modulo 2 (exclusive or) to the iris template yielding the public code *C* = *R*<sup>1</sup> ⊕ *T*1. After *C* is calculated, template *T*1, message *K*1, and redundant code *R*<sup>1</sup> are destroyed. None of them can be extracted from *C* alone. Thus, it is possible to expose *C* publicly and transmit it through unprotected channels. Upon presentation iris of a person is registered once more and a new template *T*<sup>2</sup> is formed. Of course, it is not equal to the original one. Since *R*<sup>2</sup> = *C* ⊕ *T*<sup>2</sup> = (*R*<sup>1</sup> ⊕ *T*1) ⊕ *T*2, then *R*<sup>1</sup> ⊕ *R*<sup>2</sup> = *T*<sup>1</sup> ⊕ *T*2. If the templates *T*<sup>1</sup> and *T*<sup>2</sup> are taken from one person, the distance is very likely to be less than the classification threshold: *ρ*(*T*1, *T*2) 6 *θ*, so *ρ*(*R*1, *R*2) 6 *θ*. By the nature of (32) it means that less than *p* share of bits differ in *R*<sup>1</sup> and *R*<sup>2</sup> and the original secret key *K*<sup>1</sup> = *K*<sup>2</sup> = Ψ*p*(*R*2) will be recovered. On the other hand, if the templates are classified as belonging to different persons *ρ*(*T*1, *T*2) > *θ*, the probability of restoring the original *K*<sup>1</sup> is close to zero. The scenario of operation is shown in Figure 4.

**Figure 4.** Scenario of the method [60].

Work [60] proposes a cascade of two ECC algorithms: Reed–Solomon [62] and Hadamard [61]. Reed–Solomon coding handles an entire block of data of length *L*, processing it as a set of *L*/*s s*-bit symbols. Any arbitrary symbols (not bits!) can be different as long as their number is not greater than *pL*. In [60], this coding is aimed to combat group errors appearing from various occlusions (eyelashes, eyelids), which cover significant areas of the iris. Hadamard coding processes small chunks of data (few bits), and corrects no more than 25% of the errors in each chunk. For Hadamard code to be most successful in error correction, the errors (deviations of *T* 0 from *T*) should be evenly scattered across the template with a density of no more than 25%. This coding is designed to deal with single pixel deviations arising from camera noise. The key *K* is encoded by Reed–Solomon code, the result is processed by Hadamard code.

This cascade performs well if the share of altered bits in one person's templates does not exceed 25%. However, in practical databases and applications this share is bigger which leads to an unacceptably high (more than 50%) false reject probability. To overcome this difficulty, it is proposed [42] to introduce additional template masking: every fourth bit of the iris templates is set to zero. Due to this, the proportion of altering bits in the templates of one person is reduced below 20%. This easy solution ruins the very idea of security: if some bits of the template are fixed, then appropriate bits of redundant code are made known to code crackers and can be used to attack the code. A critique of this method in terms of resistance to cracking is given in [46]. The attack is carried out by gradually restoring the original template.

Here we attempt to refine the fuzzy extractor [60] in a more feasible method and build a practically applicable key embedding method. Based on the iris feature extraction system, experiments against several publicly available iris databases are carried out. Two steps are added to the encoder tail (and hence, decoder head): majority coding of single bits and pseudorandom bit mixing. Three of these four steps have parameters, which affect their properties, including error tolerance and size. Optimal values of these parameters are selected to fit the redundant code size into the iris template size, keep the error tolerance near the target level, and maximize the size of encoded key.

## **7. Determining the Threshold**

So, if the registration template *T* and the presentation template *T* 0 are at Hamming distance (32) below the threshold *θ*, then the encrypted message *M* is recovered with high confidence; otherwise, the probability to recover it is extremely low (order of random guess). Thus, the threshold *θ* separates "genuine" and "intruder" templates *T* 0 with respect to *T*.

It is necessary to determine the value of the threshold *p*, which will be used for separating "genuines" and "impostors". With this value redundant coder Φ*<sup>p</sup>* and decoder Ψ*<sup>p</sup>* will be devised, capable to restore message for "genuine" template and making it impossible for "impostor" template.

The following publicly available databases were used for the experiments: CASIA-4- Thousand [32], BATH [33], ICE subset of NDIRIS [34], UBIRIS-1 [35].

Table 5 gives a list of databases used with the obtained thresholds.


**Table 5.** Database characteristics and thresholds.

For each database the following numbers are given:


Since *FAR*(*θ*) is a monotonous growing function, we select a minimal *θ*(*FAR* = 10−<sup>4</sup> ) from the fourth column of the table. It is *θ* = 0.351 for the CASIA database. Other databases have even smaller *FAR* with this value of *θ*.

So, the value of *θ* = 0.35 is the tolerated error probability *p* for constructing the ECC. Table 5 shows the values of false accept and false reject rates for this threshold. The maximum false reject rate does not exceed 8%.

## **8. Error Correction Cascade**

We describe the applied methods in the sequence of their execution by the decoder, which is also the method "from simple to complex". In the beginning data unit is a single bit, at the end it is the whole message. The problem is to devise an error correction method, which encodes a message into a block of data with redundancy, and then is able to reconstruct the message if no more than *p* 6 0.35 share of these bits is altered. Popular Walsh-Hadamard and Reed-Muller [63] methods can be used only for *p* < 0.25, thus they are not directly applicable. Furthermore, the errors of neighboring elements of the template are strongly correlated, whereas almost all ECC methods show the best performance against uncorrelated errors.

## *8.1. Decorrelation by Shuffling*

It is more difficult to design methods usable against correlated errors, and their performance is worse compared to the case of uncorrelated errors. Much of the effort in this case is directed precisely at decorrelation. Luckily, the whole block of data is available in our task (rather than sequentially feeding with symbols as in many transmission channel systems) and a simple method of decorrelation can be applied which is the quasi-random shuffling of iris template bits. A bit of the template array *T* is placed from position *i* into position *ij* mod *N*:

$$
\tilde{T}(ij \mod N) = T(i) \,, \; i = \overline{0, N-1} \; \prime \tag{36}
$$

where *j* is a number, relatively prime to total number *N* of bits in array. Relatively prime condition guarantees *ij* mod *N* number being unique for *i* = 0, *N* − 1 and the shuffling *<sup>T</sup>* <sup>→</sup> *<sup>T</sup>*˜ being reversible. After shuffling the neighboring bits of *<sup>T</sup>*˜ are taken from bits, which were far away from each other in *T* and their errors are uncorrelated. If one rule of shuffling is always applied then bits in all templates change their position in the same method, and calculation (32) affects the same pairs of bits. Thus, Hamming distance is unchanged and all developments from it are preserved.

This method does not change the size of the code.

## *8.2. Bit-Majority Coding*

The error rate *p* = 0.35 is too big for most error correction codes. Practically, the only possibility here is the majority coding of single bits. It is applicable for *p* < 0.5. At the coding stage, the bit is repeated *n* times. At the decoding stage, the sum of *n* successive bits is counted. If it is below *n*/2, zero bit value is decoded, otherwise a unit. It is easy to see that odd values of *n* are preferable. If *p* is the error probability of a single bit and bits are altered independently the error probability of decoded bit is

$$p\_D(p) = 1 - \sum\_{l=0}^{(n-1)/2} \binom{n}{l} p^{n-l} (1-p)^l = 1 - (1-p)^n \sum\_{l=0}^{(n-1)/2} \binom{n}{l} \left(\frac{p}{1-p}\right)^l. \tag{37}$$

If the error probability of one bit of the code is *p* = 0.35, then bit majority coding with *n* = 7 will transmit bits with error probability *p<sup>D</sup>* = 0.2. This value is below 0.25 and allows the use of Hadamard codes. Majority coding with *n* = 15 will give *p<sup>D</sup>* = 0.12 for *p* = 0.35. A larger duplication is possible, but results in a larger code size.

The parameter of this method, affecting its size and error probabilities is the bit repetition rate *n*.

## *8.3. Block Coding*

Denote the set of all bit strings of length *n* as B *n* . This set can be viewed as a set of vertices of *n*-dimensional binary cube. Consider the string of length *k* called *message* here: *M* ∈ B *k* . There can be 2 *<sup>k</sup>* different messages. Consider the set of 2 *k* strings of length *n* > *k* called *codes*. There is a one-to-one correspondence of messages and codes. Since the code length is greater than the message length, the coding is redundant and it is possible to alter some bits of the code but still be able to restore the corresponding message. The idea of block coding is to select such 2 *k* codes out of their total amount of 2 *n* , that the probability of restoration error is minimal. The set of selected codes is called *code table* C. In terms of the *n*-dimensional binary cube, this means selecting 2 *<sup>k</sup>* vertices so as to maximize minimal Hamming distance between selected vertices:

$$\mathbb{C}^\* = \arg\max\_{\mathbb{C}} \min\_{\boldsymbol{u}, \boldsymbol{v}, \boldsymbol{\epsilon} \in \mathbb{C}^\* \atop \boldsymbol{u} \neq \boldsymbol{v}} \rho(\boldsymbol{u}, \boldsymbol{v}) \,, \ \rho^\* = \min\_{\boldsymbol{u}, \boldsymbol{v}, \boldsymbol{\epsilon} \in \mathbb{C}^\* \atop \boldsymbol{u} \neq \boldsymbol{v}} \rho(\boldsymbol{u}, \boldsymbol{v}) \,, \tag{38}$$

where *ρ* is the distance (32).

Hadamard coding is based on a Hadamard matrix, which is constructed iteratively:

$$H\_0 = \begin{pmatrix} 0 \end{pmatrix},\ H\_{n+1} = \begin{pmatrix} H\_n & H\_n \\ H\_n & \overline{H}\_n \end{pmatrix},\tag{39}$$

where *H* is a bit inversion of all elements of *H*. Hadamard matrix *H<sup>k</sup>* is a square matrix with 2 *k* rows. It gives the coding table naturally: each row number is the message and the row contents is the code. It can be proven that for Hadamard codes *ρ* ∗ = 2 *k*−1 . We use so-called augmented Hadamard codes, where another 2 *k* strings are added to the code table. These strings are bitwise inverted versions of the strings obtained from (39). For this code table *ρ* ∗ = 2 *k*−2 .

There is a simple and well-known estimation (called *Hamming boundary*) of the probability of block coding error, which for augmented Hadamard codes is:

$$p\_H \lesssim 1 - P\_{corr} = 1 - \left(1 - p\right)^n \sum\_{l=0}^{(n-1)/4} \binom{n}{l} \left(\frac{p}{1-p}\right)^l,\tag{40}$$

where *p* is the probability of bit inversion. Since this stage inputs the output of bit majority decoding, the value of *p* here is the value of *p<sup>D</sup>* from (37). Let us redefine *p<sup>D</sup>* → *p* in this section for simplicity. Furthermore, one can note that (40) is the same as (37) except for the upper summation limit.

However, this is a rather rough estimate, which grows worse with the increase in *n*. For small *n* exact calculations performed by simple exhaustive search, the results are given below. The message is decoded assuming that the original code is distorted minimally, i.e., for the code *C* we will look for the closest code *C* <sup>∗</sup> ∈ C. Let us call it *attractor*. There can

be several attractors (several codes can have the same minimal distance to *C*). If there are several attractors, a random one is chosen. Let us denote the set of attractors for *C* as *A*(*C*). The probability of decoding the correct message is that of choosing the correct attractor

$$P\_{corr} = \sum\_{M} P(M) \sum\_{\mathbb{C}} p(\mathbb{C}^\* | \mathbb{C}) p(\mathbb{C} | \mathbb{C}^\*) \, , \tag{41}$$

where *P*(*M*) is the probability to obtain message *M* as input. Message *M* is encoded by code *C* <sup>∗</sup> ∈ C. Then *p*(*C*|*C* ∗ ) is the probability to obtain distorted code *C* while transmitting *C* ∗ , *p*(*C* ∗ |*C*) is probability to recover *C* ∗ (hence, correct *M*) from distorted code *C*. Suppose the probability of all messages is the same. Then the sum by *M* is reduced and

$$P\_{corr} = \sum\_{\mathbb{C}} p(\mathbb{C}^\* | \mathbb{C}) p(\mathbb{C} | \mathbb{C}^\*)\,. \tag{42}$$

Without loss of generality, due to the symmetry of Hadamard codes [61], we can assume *C* ∗ to be a zero code, i.e., a string of zero bits (as is in standard code). Then the probability of obtaining a certain code *C* from zero code is *p b*(*C*) (1 − *p*) *n*−*β*(*C*) , where *β*(*C*) ≡ *ρ*(0, *C*) is number of unit bits in *C*, and

$$P\_{corr} = \sum\_{\mathbb{C}} p(0|\mathbb{C}) p^{\beta(\mathbb{C})} (1 - p)^{n - \beta(\mathbb{C})} \,, \tag{43}$$

where *p*(0|*C*) is the likelihood to obtain zero code from *C*. Define the set of attractors for string *C* as code table entries with minimal distance to the code:

$$\begin{aligned} A(\mathbb{C}) &= \left\{ \mathbb{C}' \in \mathbb{C} : \rho(\mathbb{C}', \mathbb{C}) = \rho\_{\text{min}}(\mathbb{C}) \right\}, \\ \rho\_{\text{min}}(\mathbb{C}) &= \min\_{\mathbb{C}' \in \mathbb{C}} \rho(\mathbb{C}, \mathbb{C}') \text{ .} \end{aligned} \tag{44}$$

Define the cardinality of this set as *α* = |*A*(*C*)| and state that *α* = 0 if there is another code table entry more close to *C* then the correct code: ∃*C* <sup>0</sup> ∈ C, *C* <sup>0</sup> 6= *C* ∗ : *ρ*(*C* 0 , *C*) < *ρ*(*C* ∗ , *C*). Then we can write

$$p(0|\mathcal{C}) = \begin{cases} 0 \text{, } \mathfrak{a} = 0 \text{,} \\ 1/\mathfrak{a} \text{, } \mathfrak{a} \neq 0 \text{.} \end{cases} \tag{45}$$

For small values of *n* all points of code space B *n* can be enumerated and their distribution by distance to zero *β* and attractor number *α* can be estimated:

$$\begin{aligned} H(\nexists, \mathfrak{k}) &= \left| \left\{ \mathbb{C} : 0 \in A(\mathbb{C}), \left| a(\mathbb{C}) \right| = \mathbb{R} \; , \; \not\mathfrak{k}(\mathbb{C}) = \mathfrak{k} \right\} \right| , \\ \mathfrak{F} &\in \left[ 0; n \right] \; , \; \mathfrak{k} \in \left[ 1; 2^{k} \right] \; . \end{aligned} \tag{46}$$

Substituting to (43) we get:

$$P\_{corr} = \sum\_{a \neq 0} \sum\_{\beta} \frac{H(\beta, a)}{a} p^{\beta} (1 - p)^{n - \beta} = \sum\_{\beta} p^{\beta} (1 - p)^{n - \beta} \sum\_{a \neq 0} \frac{H(\beta, a)}{a} \tag{47}$$

and decoding error

$$p\_H = 1 - P\_{corr} = 1 - (1 - p)^n \sum\_{\beta} h(\beta) \left(\frac{p}{1 - p}\right)^{\beta}, \quad h(\beta) = \sum\_{\alpha \neq 0} \frac{H(\beta, \alpha)}{\alpha} \,. \tag{48}$$

The formula is the same as (40) except for the coefficients and summation limits. The values of *h*(*β*) for the augmented Hadamard code of order 5 (*n* = 2 <sup>5</sup> <sup>−</sup> <sup>1</sup> <sup>=</sup> 31, *k* = 5 + 1 = 6) are given in Table 6. All meaningful values are given, and values for other *α* and *β* are zero. For instance, if the code is distorted in 12 bits or more, it will be never

recovered to the correct value, since there will be another valid code from the code table, closer to distorted value.


**Table 6.** Values *h*(*β*) and *pH*(*β*) for Hadamard code *k* = 6, *n* = 31.

Up to values *β* = 7 only one attractor is chosen, i.e., at this or less divergence the message is definitely recovered. This corresponds to Hamming boundary (40). However, even with larger divergences, up to *β* = 11 there is a significant probability of correct recovery. This plays a big role since the majority of distorted codes fall outside of Hamming boundary but still have the significant probability to restore the message correctly. Thus, the probability (40) is overestimated. For example, for the considered code and error *p* = 0.250, the formula (40) gives *p<sup>H</sup>* = 0.527, which would seem to prevent using such a code. However, the calculation using the formula (48) gives *p<sup>H</sup>* = 0.261, which is fairly suitable for use in the next step of coder.

The Hadamard coding parameter is only the word length *k*. The size of the codeword *n* is dependent: *n* = 2 *k*−1 for the augmented variant.

## *8.4. Reed–Solomon Message Coding*

The unit of encoding for Reed–Solomon's algorithm is the entire message, which is divided into codewords of fixed size, *s* bits each. A stream of *L* bits is cut into *k* = d*L*/*s*e words. Then additional words can be added up to the total count of *n* by the coding algorithms. It turns out that if no more than *t* = b(*n* − *k*)/2c codewords are altered then it is possible to recover the message. So, Reed–Solomon code corrects no more than *t* of errors, where *t* is half the number of redundant words. Denoting *p* = *t*/*n*, we get

$$p \leqslant \frac{n-k}{2n} \,. \tag{49}$$

This number is an estimate of the tolerated error probability of a codeword. The Reed–Solomon method also imposes a limitation to a codeword count in the whole message:

$$m \leqslant 2^{s} - 1 \; . \tag{50}$$

The error probability in (49) is determined by the previous step: *p* = *pH*. Hence, possible Reed–Solomon codes here are determined by codeword length *s* and message length *L*.

## **9. Selection of Code Parameters**

Four ECC methods are organized in a chain. Encoder runs Reed–Solomon, Hadamard, bit majority and shuffling to obtain the redundant code. The decoder executes this chain in reverse order. The encoder should obtain the code of size no more than the size of the iris template, i.e., *N* = 6656 bits for the presented system. The code size cannot be larger, duplication and masking are unacceptable, as they make it trivial to break such a code. Furthermore, of course, it is desired to embed the message of reasonable size. This is a discrete constrained optimization problem.

ECC methods used here have the following parameters affecting their characteristics: (1) decorrelation has no parameters; (2) majority coding is governed by the bit duplication count *n*; (3) Hadamard coding depends on word size *k*, Reed–Solomon coding is parameterized by word size *s* and message length *L*. Combinations of (*n*, *k*,*s*, *L*) values yield different encoding with specific code length *C*(*n*, *k*,*s*, *L*) and error probabilities. The errors are the aforementioned FRR and FAR. False rejection is a failure to recover the embedded key after presenting the same person's biometrics. False acceptance is recovering the person's key with another person's biometric. The errors depend on ECC parameters: *FRR*(*n*, *k*,*s*, *L*) and *FAR*(*n*, *k*,*s*, *L*). Furthermore, the formal statement of the problem is

$$\begin{aligned} \text{FRR}(n, k, s, L) &\to \min \text{ }, \\ \text{s.t.} \quad \text{FAR}(n, k, s, L) &\leqslant 10^{-4} \text{ }, \text{ } \text{C}(n, k, s, L) \leqslant \text{N}. \end{aligned} \tag{51}$$

## **10. Results and Comparison with Literature**

The solution of (51) was found: *L* = 65, *n* = 13, *k* = 5, *s* = 5, *FRR* = 10.4%. Message size *L* = 65 bits is considered satisfactory for "common" user keys. For bigger message sizes there reasonable solution was not discovered. It should be noted that without bit shuffling (not explicitly involved in (51)) the problem is not solved even for *L* = 65.

Table 7 contains the results reported in the literature in contrast with those presented in this work.


**Table 7.** The results of iris biometric cryptosystems

The presented system may seem not very successful against its rivals with respect to error level and key length. However, one should note that each of these systems was tested with a single database. Both CASIA databases have images of one eye taken from adjacent video frames that results in the extremely high similarity of iris codes, inaccessible in practice. The same issue concerns [60], they use a small laboratory database and their results cannot be extended to real applications. ICE 2005 database is much closer to the real world, it contains images of varying quality, time and conditions of registration. However, both works [42,65] based on it use interleaving bits. If the bit sequence is fixed and known, this ruins the cryptographic strength. If it is made secret, then it turns to just another secret key, which should be passed securely: the very thing we try to avoid. Although the presented system has the highest FRR, it is practically applicable and has no obvious holes in security.

## **11. Conclusions**

The set of methods allowing to build biometric cryptosystem based on iris images is presented. It contains three main parts: iris segmentation, biometric template generation, and the method of embedding/extracting the cryptographic key to/from biometric features. The original system of iris segmentation methods is described. Its distinction is estimating iris parameters at several steps (initial rough calculation, then refinement), by algorithms of a different kind. The sequence of detection of iris parameters is different from commonly employed as well. The template creation method is a de facto standard Daugman-style convolution. The method for introducing a cryptographic key into iris biometrics is constructed using a fuzzy extractor paradigm. A key of size up to 65 bits can be embedded, for larger sizes no solution has been obtained. The challenge of high variance of biometric features has been overcome by introducing bit majority coding. A high local correlation of errors was removed by quasi-random shuffling. The system was tested on several databases of iris images. A study of cryptography stability is still required.

**Author Contributions:** Conceptualization, I.M.; methodology, I.M.; software, I.M.; validation, I.M. and I.S.; investigation, I.M.; data curation, I.M.; writing—original draft preparation, I.M.; writing—review and editing, I.S.; supervision, I.M.; project administration, I.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


[CrossRef]


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
