*4.3. Improved Face Recognition Using Face Tracking*

For the face tracking task, we build the face tracking algorithm based on a simple online and real-time tracking algorithm (SORT) [46]. SORT uses a Kalman filter for estimating the location of the face in the current frame given the location in the previous frame. It starts with detecting the target face in the initial frame *i*. After that, predict the future location *i* + 1 of the target face from the initial frame using the Kalman filter. Noting that the Kalman filter just approximates the face's new location, which needs to be optimized. Finally, the Hungarian algorithm is used for face location optimization and association.

The main problem we are targeting is the speed/accuracy trade-offs. Continuous face detection and face recognition processing are time-consuming. Moreover, the quality of the face features depends on the face pose, where the frontal face pose generates the best facial features and degrades in a departure from the frontal pose. Therefore, instead of detecting all faces around all input video frames, we assign only each newly detected face a tracker and start the tracking instead of detection. Furthermore, for each new tracker only, the face embedding will be inferred and compared against the stored embeddings by calculating the cosine similarity to generate the user identity (ID), then add the ID to the tracker metadata for fast recognition, i.e., retrieve the ID from the tracker in the successive frames without the need for recognition. These will improve the processing time, recognition rate, and reduce the recognition errors caused by variations from frontal face poses.

In the proposed tracking Algorithm 1, for each input frame, we are detecting faces using face detection and alignment in Section 4.1. Initially, a new tracker for each detector box will be created by applying SORT [46]. SORT analyzes previous and current frames and predicts face locations on the fly by utilizing the Kalman filter and Hungarian algorithm. Then, the user ID will be obtained using face recognition in Section 4.2 and assigned with the face tracker for use in fast recognition in the further frames. Finally, the tracker will be associated with the detected faces and maintained throughout tracking, and the user ID is assigned for each face tracker. We update the tracker in each frame to validate if a face is there inside the box to improve the tracking quality. If not, we are deleting the tracker to prevent unbounded growth in the number of trackers. Moreover, the actual user identity is attached to the face tracker instead of a unique face tracker ID to improve the face recognition speed.

To improve the proposed face tracking algorithm and minimize the tracking error, we obtain the head joint from the tracked skeleton provided by the WS1 Kinect V2 camera and try to assign it with the face center. If the assignment is successful, we update the face tracker with the fine location. Otherwise, the tracker will be deleted. Further, this reduces the number of identity switches through longer periods of occlusions.

#### **5. Experiments and Analysis**

The most effective parts of the face recognition and tracking framework are the face detection and face recognition models. In order to well evaluate the effectiveness of the introduced tracking approach, we trained and evaluate the two models separately.

#### *5.1. Face Detection*

For the face detection, the RetinaFace is trained on the WIDER FACE dataset [52]. It contains 32,203 images and 393,703 face bounding boxes with a high degree of variability in scale, pose, expression, occlusion, and illumination. The evaluation is performed on the WIDER FACE validation set, with Average Precision (AP) of 0.83 for the hard subset.

#### *5.2. Face Recognition*

For the face recognition network, the ArcFace is trained on the MS1MV2 dataset [4,53] for 30 epochs with a batch size of 512, feature scale *s* of 64, and angular margin *m* of 0.5. MS1MV2 is a semi-automatic refined version of the MS-Celeb-1M dataset [53] which contains about 100k identities with 10 million images. The evaluation is performed on large-pose CPLFW and large-age CALFW datasets and achieved performance of 95.45% and 92.07% respectively.

#### *5.3. Results*

The metrics used to measure the overall system performance are precision, recall, F-score, and recognition rate. We classify the predictions into True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). A *True Positive* can be obtained in recognition when the model correctly predicted the subject class (i.e., subject ID), which means that it matches the ground truth. Otherwise, the prediction is considered a *False Positive*.

A *True Negative* can be obtained in recognition when the model is not supposed to predict a subject that is not in the database. Otherwise, the prediction is considered a *False Negative*.

*Precision* is the matching probability of the predicted subject identity relative to the ground truth identity, which shows the results of a correctly recognized subject. It can be calculated as follows:

$$Precision = \frac{TP}{TP + FP}.\tag{3}$$

*Recall* measures the probability of the subjects that were correctly recognized among ground truth subjects, which is the total number of true positives relative to the sum of true positives and false negatives, as follows:

$$Recall = \frac{TP}{TP + FN}.\tag{4}$$

*F-score* is evaluated as the harmonic mean of precision and recall to see which model best performs. It can be calculated as follows:

$$F\text{-score} = \frac{Precision \* Recall}{Precision + Recall} \* 2.\tag{5}$$

The recognition performance can be obtained by the face recognition rate *FRR*, and it is the ratio between the total number of correctly recognized faces and the total detected/tracked faces. It can be calculated as follows:

$$FR\_R = \frac{TP}{Total\,faces} \ast 100.\tag{6}$$

In order to evaluate the proposed framework, we tested it for two different evaluations: dataset, and online evaluations.

#### 5.3.1. Dataset Evaluation

We use the ChokePoint dataset [54] to evaluate the proposed framework. This dataset is a video dataset that was collected and designed for experiments on person identification/verification under real-world surveillance conditions. It contains videos of 25 subjects (six female and 19 male). In total, the dataset consists of 48 video sequences and 64,204 face images with variations in terms of illumination conditions, pose, sharpness, as well as misalignment due to automatic face localization/detection.

The experimental results show the performance of tracking for 25 subjects of the ChokePoint dataset. To show recognition refinements, we have tested the proposed face recognition framework with tracker-assisted and without. The average results are shown in Table 1. Furthermore, the Receiver Operating Characteristic (ROC) curve is obtained in Figure 5, which shows that the tracking approach improves the recognition rate for high false positive rates and reduces the false classification rate.

**Table 1.** The average results of precision, recall, and F-score on ChokePoint dataset.


**Figure 5.** ROC Curve of ChokePoint Dataset for the Proposed Framework.

#### 5.3.2. Online Evaluation

We employ the proposed framework on a real HRI study [6], to further evaluate the framework in real-time HRI and show its robustness. During the experiments in the study, the data for evaluation were collected from 11 subjects (two female and nine male) aged between 20 and 34 years.

The experimental results show the performance of tracking and recognition rate for 11 subjects during the interactions with RoSA [6]. To show recognition refinements, we have tested the proposed face recognition framework with tracker-assisted and without. The proposed framework achieved a face recognition rate of 94% and 76% with tracking

and without tracking, respectively. Figure 6 shows the impact of tracking on the *Precision* of the proposed framework, and the impact of tracking on *Recall* of the proposed framework is shown in Figure 7. Furthermore, Figure 8 shows the *F-score* results of the proposed framework with tracker-assisted and without tracking.

**Figure 6.** Impact of Tracking on Precision of Face Recognition.

**Figure 7.** Impact of Tracking on Recall of Face Recognition.

**Figure 8.** F-score results of the proposed framework with tracker-assisted and without tracking.

Compared to the standard face recognition framework, the proposed framework performance is faster in terms of processing time with frame rates of 25–40 fps. Some results of the proposed framework during the real HRI in our RoSA system [6] are shown in Figure 9.

**Figure 9.** Experimental results of the proposed framework that shows the robustness of the framework against various head posture and illumination conditions.

To confirm the obtained results, we run the experiments again on the recorded videos from the Wizard-of-Oz study [5] with the same results. It contains videos of 36 subjects doing the same tasks on the RoSA study, which were collected on different days with different lighting conditions. For every subject (video), we selected three exemplar face images with different poses and added the extracted embedding to the database to match with video faces. Table 2 shows the precision and recall results for 37 subjects separated by the top ten results.

**Table 2.** Result of precision and recall for the proposed framework.



**Table 2.** *Cont*.

#### *5.4. Computational Efficiency Assessment*

In general, lightweight face networks provide promising results for face recognition. They are able to perform comparably to state-of-the-art very deep face models in most face recognition scenarios. In particular, ResNet100-ArcFace by Deng et al. [4] is one of the best performing state-of-the-art models in the different evaluated scenarios, however, it demands high computational resources. For example, the biggest difference in accuracy between ResNet100-ArcFace and MobileFaceNet (our used network), is 8% in the very large-scale DeepGlint-Image dataset (one of the most challenging databases), while in the remaining databases it is less than 3%. However, regarding the computational complexity, ResNet100-ArcFace requires 19X more storage space and involves 26X more FLOPs and 32X more parameters than MobileFaceNet.

Applying face tracking provides us the advantage of no need to apply face detection and recognition for all input frames. However, to increase the accuracy of our framework and minimize the tracking error, we apply the whole recognition process in each fifth frame.

To calculate the computational efficiency assessment of the proposed framework, we tested it on the collected videos (total of 47 videos) during the RoSA study [6] and the Wizard-of-Oz study [5] and obtained the average processing time for each face recognition module. The hardware setting used was a NVIDIA GeForce GTX 1080 Ti Desktop GPU (12 Gb GDDR5, 3584 CUDA cores). Table 3 shows the average execution time of individual methods used in the proposed framework. To summarize, the average execution time per frame for the whole process takes about 6.7 ms, and the average number of frames per second is ∼35 frames.


**Table 3.** Average execution time of individual methods used in the proposed framework.

#### **6. Limitations, and Future Work**

The study conducted has a complex setting that contains two workstations (WS1 and WS2) synchronized together using the ROS operating system. In addition, extracting face features during the experiments is a challenging task due to the illumination conditions, extreme deviation in head pose angles, and occlusion. However, the aforementioned performance evaluation showed the effectiveness of the proposed framework in recognizing the subject's identity in a multi-person environment.

Few subjects caused a wrong identification during the experiments due to the lack of the registration process and good face feature embedding, which lead to the re-registration of the mentioned subjects.

The advantage of our framework is that it depends on lightweight CNNs for all face recognition stages, including face detection, alignment and feature extraction, to meet the real-time requirements in HRI systems. Furthermore, the developed framework can simultaneously recognize the faces of the cooperating subjects in various poses, face expressions, illumination, and other outdoor-related factors. Although two of the subjects were wearing face masks for the whole experiment, our model succeeded to recognize their identity with reasonable confidence.

Future work would involve a new study with a large number of subjects with different human–robot interaction scenarios to effectively assess the performance of the framework and overcome the limited number of subjects in the RoSA study. In addition future work would involve designing an end-to-end trainable convolutional network framework for all the face recognition stages.

#### **7. Conclusions**

We propose a face recognition system for human–robot interaction (HRI) boosted by face tracking based on deep convolutional neural networks (CNNs). To ensure that our framework can work in real-time HRI systems, we developed our framework based on lightweight CNNs for all face recognition stages, including face detection, alignment, tracking, and feature extraction. Furthermore, we implemented our approach as a modular ROS package that makes it straightforward for integration in different HRI systems. Our results suggest that the use of face tracking alongside face recognition increases the recognition rate.

We utilize the state-of-the-art loss function *ArcFace* for the face recognition task and the *RetinaFace* method for face detection combined with a simple online and real-time face tracker. Furthermore, we propose a face tracker to tackle the challenges faced by the existing face recognition methods including various illumination conditions, continuous changes in head posture, and occlusion.

The face tracker is designed to fuse the tracking information with the recognized identity and associate it with the faces once they are detected for the first time. For the updated trackers, the last recognized identity will be kept alongside the tracker. Despite what preceded, a new identity prediction is required for the new trackers. This method improved the overall processing time and face recognition accuracy and precision for the unconstrained face.

The proposed framework is tested in real-time experiments applied in our real HRI system "RoSA" with 11 participants interacting with the robot to accomplish different tasks. Furthermore, to confirm the obtained results, we tested it on the recorded videos from the Wizard-of-Oz study, which contains videos of 36 subjects doing the same tasks on "RoSA" with the same results. The results showed that the framework can improve the robustness of face recognition effectively and boost the overall accuracy by an average of 25% in real-time. It achieves an average of 99%, 95%, and 97% precision, recall, and F-score respectively.

**Author Contributions:** Conceptualization, A.K. and A.A.-H.; methodology, A.K., A.A.A., D.S., J.H. and T.H.; software, A.K., D.S., J.H. and T.H.; validation, A.K., D.S., J.H. and A.A.A.; investigation, A.K., D.S., J.H. and A.A.A.; resources, A.A.-H.; writing—original draft preparation, A.K., A.A.A., D.S., J.H. and T.H.; writing—review and editing, A.K., A.A.A., D.S., J.H., T.H. and A.A.-H.; visualization, A.K., A.A.A., D.S., J.H. and T.H.; supervision, A.A.-H.; project administration, A.K. and A.A.-H.; funding acquisition, A.A.-H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is funded by the Federal Ministry of Education and Research of Germany (BMBF) RoboAssist No. 03ZZ0448L and Robo-Lab No. 03ZZ04X02B within the Zwanzig20 Alliance 3Dsensation.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki. Ethical approval was done by Ethik Kommision der Otto-von-Guericke Universtiät (IRB00006099, Office for Human Research) 157/20 on 23 October 2020.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
