1. Introduction
Tracking and locating a person automatically in an unconstrained large crowd gathering through face detection and recognition is still challenging. Face detection and recognition is challenging due to various dynamic factors such as low resolution, variable crowd distance from installed cameras, mobility of cameras and the crowd. In this paper, we propose an automatic tracking of the registered missing persons in a large Al-Nabawi mosque gathering scenario where millions of pilgrims gather to perform religious rituals in a covered area of approximately 73 acres.
The contribution in this paper is novel in two aspects. First, to the best of our knowledge, this is one of the few proposals for automatic tracking of the missing persons in a large gathering with low-resolution images. There are various proposals in the literature, which apply face recognition algorithms to large crowd images such as [
1,
2,
3,
4]; however, tracking a person in a large crowd with low-resolution images is rare. Several state-of-the-art deep learning algorithms are used for face recognition but with high-quality images such as [
5]; however, our study of the related work suggested that they do not show good performance on low-resolution images of the unconstrained environment. To the best of our knowledge, dividing the surveillance premises into geofences and estimating set of geofences for probable presence of missing person is used for the first time in such works.Therefore, we believe our proposal in this paper contributes to the existing knowledge.
Second, our proposed mechanism is unique in terms of its methodology. In which we first efficiently reduce the search space to locate the missing person by applying the proposed geofence set estimation algorithm and then employ our face detector algorithm which fuses the three cascades of Viola Jones to optimize the number of localized face regions from each frame. Finally, the proposed mechanism applies the face recognition algorithm using the registered profile images of the missing persons to track them in the crowd.
We consider the unconstrained large gathering scenario of Al-Nabawi mosque, Madinah, Saudi Arabia. We first divided the total area of the mosque in 25 geofences and 20 cameras installed in this covered areas as our source of input data in the form of video sequences. Then, the system extracts frames of low quality images. We developed a mobile and web based system through which a head of the pilgrim’s family or the head of the pilgrims group can report his missing companion with time and location. Then, proposed Geo fence set estimation algorithm will result in suggested set of geofences where the missing person could be found. This will significantly reduce the search space and the system will start tracking from the videos of cameras installed in the suggested geo fences premises. Then, we apply our face detector algorithm which is a fusion of three Viola Jones cascades which produces high number of localized face regions with accuracy which is much higher than applying the viola Jones cascades individually. Then, face recognition algorithms is employed to find the missing person from the detected face regions based on the profile image of reported missing person. The prediction of proposed system in this paper has improved significantly from our previous work [
6] where we just employed a single face detector algorithm. Secondly, it is an automated system as it first reduces the search space to increase the efficiency in terms of time and then employ our proposed face detector and recognition algorithm to track missing person in challenging unconstrained large gathering scenario with low resolution data.
The paper is structured as follows.
Section 2 presents our brief related work of recent studies including gap analysis. Our proposed methodology is presented in
Section 3 including algorithms and technical details. Then, we elaborate the implementation results of training and testing of our proposed methodology on the dataset in
Section 4. Finally, the conclusion and future work is presented in
Section 5.
2. Related Work
Facial recognition is critical for real-time applications of locating missing persons. Therefore, for our presented scenario it is the matter of an immense importance to identify and recognize human faces in a large crowd having unconstrained environment. Therefore, missing person identification could be attained to find vulnerable group of persons including elderly, children and people with disease (i.e., Dementia, Alzheimer, etc.). We now briefly present our review of recent literature related to the tracking of missing persons in the large crowd scenarios using face detection and recognition on video sequences.
According to Sanchez-Moreno, A.S. et al. [
7], some deep neural networks techniques have recently been created to attain state-of-the-art performance on tracking of missing person through face detection and recognition problem. Their work is not for a densely populated environment. They employed the YOLO Face approach for face identification because of its high speed real time detector based on YOLO version 3. Secondly, for classification to recognize faces, a combination of FaceNet and a supervised learning technique, such as the support vector machine (SVM), is proposed. Their experiments are based on unconstrained datasets and show that YOLO-Face performs better when the face under analysis has partial occlusion and position fluctuations. Nonetheless, it can recognise little faces. The Honda/UCSD dataset is used to obtain an accuracy of more than 89.6 percent for face identification. Furthermore, the testing findings showed that the FaceNet+SVM model achieved an accuracy of 99.7 percent when utilizing the LFW dataset. FaceNet+KNN and FaceNet+RF score 99.5 percent and 85.1 percent, respectively, on the identical dataset, while FaceNet achieves 99.6 percent. Finally, when both the face detection and classification phases are active, the suggested system has a recognition accuracy of 99.1 percent and a run-time of 49 ms.
The early work published by Nadeem A. et al. [
6] and Nadeem A. et al. [
8] using an unique integration of face-recognition algorithms, which employs many recognition algorithms in parallel and combines their predictions using a soft voting mechanism, shown improved accuracy. Based on spatio-temporal constructs, this delivers a more sophisticated forecast. However, the technique was used on low-resolution cropped photos of recognized faces in order to discover missing people in the previously described difficult large crowd gathering. That was explored for scenarios involving enormous crowds at the Al-Nabawi mosque in Madinah. It is a highly unregulated environment with a data collection of low-resolution pictures collected from publicly recorded moving CCTV cameras. The proposed model first detects faces in real time from camera-captured photos, then applies face recognition algorithms with soft voting to get better prediction for identifying the missing persons. A tiny series of consecutive frames reveals the presence of a missing individual.
The method suggested by Li, W., and Siddique, A. A., [
9] used the notion of face recognition by utilizing a pre-trained LBPH Face Recognizer to identify the individual in the acquired frame in conjunction with a drone mounted camera to capture the live video stream. An inbuilt Raspberry Pi module analyses the obtained video data, detecting the intended individual with an accuracy of 89.1%.
The authors Ullah, R. et al. [
10] proposed a real-time framework for human face detection and recognition in CCTV images over a 40 K images with different environmental condition, background and occlusions. In addition, they performed a comparison analysis between different machine / deep learning algorithms such as decision trees, random forest, K-NN and CNN. The authors claimed that they have achieved 90% overall accuracy with minimum computing time using CNN.
As we noticed in
Table 1, the authors in [
7] applied state-of-art deep learning techniques for face detection and recognition using conversion of low resolution images to high-quality images, but the technique is not tested in low-resolution images from large gatherings. Moreover, literature in [
1,
2,
3,
4,
5,
11] shows work on recognizing people based on large crowd and low resolution image data, whereas the literature presented in [
12] only depicts exploitation of large crowd data and in [
13] research carried out only on low resolution data. However, emotional expression of human face have been found in [
4] crowded environment showed happy faces are easily by identified. Finally, we found research in [
14] that carried out identifying and tracking of pilgrims using CNN and Yolo v4 in unconstrained environment but used high resolution images data to identify smaller sized faces in the crowd.
By analyzing the state-of-the-art, we therefore state that no significant work found with human facial recognition and tracking of missing person based work on low resolution dataset in unconstrained and large crowded environments with above mentioned constraints. However, an ample amount of literature was found in the large crowd domain based on re-identification, tracking and crowd count, etc. Therefore, in this study, we presented our proposed mechanism considering the gap to find missing person by identification and recognition as well in a large crowded gathering of people with diverse age groups having fully unconstrained environment. In this regard, we used dataset built on the pre-processed frames extracted from publicly filmed CCTV videos in Al-Nabawi Mosque, Al Madinah, KSA.
3. Proposed Methodology
This research work is proposed for the automated tracking of reported missing person from live videos of unconstrained large gathering. The proposed mechanism is general but to prove the concept we consider large gathering scenarios of Al-Nabawi mosque (Madinah, KSA) where thousands of pilgrims daily visit. The probability of losing vulnerable companions such as a child or an older person in such large gatherings is high and their automated tracking, using intelligent video surveillance, is an extremely challenging task. The proposed work tries to mitigate this challenging task by dividing the experimented premises into geofences where each geofence is installed with particular cameras.
The proposed tracking mechanism is efficient as it reduces the search space by estimating the probabilistic region of a missing person through a novel geofence set estimation algorithm. This algorithm uses spatio-temporal information of a missing person reported by his/her group head through a mobile application. The query face image of a missing person is fetched from database and, afterwards, it is recognized in videos by cameras that are installed within output set of estimated geofences. This task is accomplished by applying face recognition algorithm on all detected faces which were detected in earlier stage by applying face detection algorithm on video frames. The main tasks of proposed methodology are depicted in
Figure 1 and the details about all these tasks are given in following subsections.
3.1. Reporting Missing Person
It is a general practice that pilgrims in Al-Nabawi mosque move in the form of groups. Sometimes each family has a separate group headed by a family member, as shown in
Figure 2a, and sometimes it consists of people from different families and headed by a group leader as shown in
Figure 2b.
The missing person reporting is conducted by group leader through a mobile application. The reporting includes selecting the missing person from the list of his/her group members. This information is accompanied by the approximate missing time and location (in terms of geofence). This information is passed on to geofence set estimation algorithm given in next section.
3.2. Geofence Set Estimation
The perimeter of Al-Nabawi mosque (including the courtyard) is calculated as 2.166 km. We further divided this premises into a 5 × 5 matrix of square sized geofences. The partitions are shown in
Figure 3a and the exact dimensions are shown in
Figure 3b.
The whole premise is covered by 20 surveillance cameras that are installed inside the boundary of Al-Nabawi mosque. Each surveillance camera covers a particular set of geofences. The missing person reporting includes spatio-temporal features of missing event. It includes geo-location of missing person that is approximated by mobile based location of group head and it also includes the estimated time laps (in minutes) since the person is missed. Therefore, based upon this information, a set of geofences is obtained by applying geofence set estimation algorithm. This algorithm defines several crowd levels based upon the automated counting score of people. Then, based upon that crowd level score, the maximum possible distance, covered by missing person, is calculated around all four possible directions and finally a set of geofences is calculated where that person can be found. The algorithm’s output reduces the search space and hence the missing person is tracked only in videos of those cameras that are installed within the output set of geofences. The geofence set estimation is given in Algorithm 1.
Algorithm 1 Geo-Fence Set Estimation |
- 1:
Begin input:t (estimated time laps, in minutes, since person missed), l (mobile based location of group head) output:G (set of geo-fences / Range for probability of presence of the missing person) - 2:
Derive geo-fence of group head (i.e., reporting person) based upon his reporting mobile’s location l - 3:
Calculate crowd level (i.e., ) in all geo-fences based upon total sum of automated counting score of persons in placed n camera images in that geo-fence premises. categorized crowd score levels as per following rule: - 4:
- 5:
- 6:
ifthen - 7:
meters outside mosque premises - 8:
end if// (i.e., calculating vertically down from current geo-fence) - 9:
if
then - 10:
meters outside mosque premises - 11:
end if// (i.e., calculating vertically up from current geo-fence) - 12:
ifthen - 13:
meters outside mosque premises - 14:
end if// (i.e., calculating horizontally right from current geo-fence) - 15:
if
then - 16:
meters outside mosque premises - 17:
end if// (i.e., calculating horizontally left from current geo-fence) - 18:
- 19:
if
then - 20:
exit - 21:
else - 22:
i. ii. - 23:
end if - 24:
end
|
In Algorithm 1 is monotonically increasing function and directly proportional to value. This rule is based upon geo-fence length and width (which is nearly equal to 110 m as geo-fence is nearly square shaped) and 0.5 m/s is the observed average walking speed of person when the premises is nearly vacant (which is estimated as under 60–70 persons) and then walking speed reduced as geo-fence premises gets populated. As per the observation, the average speed of walking person reduced to nearly half as it gets double populated (i.e., 90–120 persons in a geo-fence) then further reduced to one-third when it gets populated nearly 150–200 persons and so on.
3.3. Faces Detection in Video Frames
The faces are detected at frame level, whereas the video streams of only those cameras are examined which are installed within the geofences mentioned in output set of estimation algorithm described earlier. A tracking workflow that examines the video streams is presented in Algorithm 2, which is the improved version of our previously proposed tracking workflow in [
6]. It samples every 10th frame and detects the face regions on that frame. There exists several face detectors, but no one is capable of detecting all the faces in given image correctly. Therefore, a sampled frame is simultaneously fed to three established face detectors called the Cascaded CART, the Cascaded Haar and the Cascaded LBP face detector, then output from these detectors is merged to improve the face detection process. A new face fusion technique is proposed in Algorithm 3, which not only controls the merger of detected overlapping faces, but also maintain the bounding box for updated face region. The fusion strategy increases the face count at frames level, which may also increase a person face count in temporal domain. Therefore, it will enhance the missing person tracking by reducing the negative errors.
Algorithm 2 Tracking workflow |
- 1:
// create face detector objects - 2:
- 3:
- 4:
- 5:
while
do - 6:
// get video frame t… - 7:
// face detection by CART…Where BB defines set of bounding boxes - 8:
// face detection by Haar… - 9:
// face detection by LBP… - 10:
- 11:
for do - 12:
// crop the face region - 13:
- 14:
for do - 15:
// for jth algorithm - 16:
end for - 17:
- 18:
end for - 19:
- 20:
// go for next frame - 21:
end while
|
Algorithm 3 Proposed Face fusion |
- 1:
for
do - 2:
for do - 3:
- 4:
if then - 5:
// over write the box c in - 6:
// cut the box h from - 7:
end if - 8:
end for - 9:
end for - 10:
- 11:
for
do - 12:
for do - 13:
- 14:
if then - 15:
// over write the box f in - 16:
// cut the box l from - 17:
end if - 18:
end for - 19:
end for - 20:
|
The face regions detected on a sampled frame are shown in
Figure 4, where face regions detected by individual face detectors can be observed clearly.
Figure 5 shows the comparative analysis of detected faces, where overlapping and non-overlapping face regions can be observed easily. The great extent of overlap recommend fusing these regions to a single face region, where additional efforts may be required to adjust the bounding box over updated face region. The process not only adjust the bounding boxes but also increases the face count on sampled frame.
3.4. Face Recognition
All the face regions detected on a sampled frame are cropped, enhanced and resized to a size of 50 × 50. Then, every face image is simultaneously fed to five recognition algorithms, where each algorithm provides an (ID, Score) pair for that face image. All the five algorithms may or may not predict the same identification result for input face, therefore obtained (ID, Score) pairs are fed to a soft-voting algorithm that produces a mature identification result for input face. The details about the voting scheme can be seen in our previously published paper [
6]. An example of recognizing the detected face regions is presented in
Figure 6a, where predicted identity of every face region is labeled on box and the associated score is illustrated by bounding box color. The score-color scheme is completely in accordance with [
6], where NM-2 stands for no match recommended and NM-1 indicates no match suggested due to the confusion. Face regions with white boxes show no identity, and a tag of NM-1 or NM-2 is mentioned on every white box, which means this face region does not match to any personnel stored in database. The actual identification of faces on sampled frame is presented in
Figure 6b. Only 15 faces were detected on sampled frame, where 9 faces find a proper match in database, while 6 faces did not find any valid match. The predicted identity of 9 faces on sampled frame can be confirmed from
Figure 6b.
3.5. Missing Person Tracking
The missing persons are tracked in all cameras installed in recommended geofences. For example the tracking of subjects ID-47, 51 and 53 in three different camera views are shown in
Figure 7a–c given below.
The presence of every identified personnel in video sequence is recorded temporally. The tracking of above 12 personnel is presented in
Figure 8, where actual and predicted presence of 12 personnel is presented with different color. As the subject in video sequence is free to move his face leftward, rightward or downward, the presence may not be recorded at every frame correctly, and the ID track looks rough in temporal domain. To resolve this issue the IDs tracks are smoothed along time domain, which improves the tracking of missing personnel significantly. The roughness of an ID track is minimized by holding the presence record of that ID over multiple consecutive past frames.
The proposed technique of smoothing the presence track over temporal domain is presented in
Figure 9.
The smoothed presence tracks of 12 personnel are presented in
Figure 10, where false positive and the false negative presence of some of the 12 personnel can be seen easily.