1. Introduction
Calculating the number of people is needed in daily life, such as a tour guide to confirm whether all tour members have arrived at each checkpoint. The checking is repeated again and again during traveling. Another example is calculating the number of people entering a recreation area, a museum, et al. Calculating the number of students attending class is repeated by teachers. If the number of people is large, the calculation process is time-consuming. Accordingly, providing an efficient method to calculate people’s numbers brings great benefits for many applications. Additionally, face detection and recognition can be regarded as an efficient method for calculating the number of people.
There are two main methods for face detection. The first type is pattern comparison, which is also the most commonly used method. Many standard face images should be provided and stored in a database. When a test image is asked for comparison, it is compared with the database’s images for similarity checking to define whether it is an actual human face. Nevertheless, this method requires an extensive database and much comparison time. The second type is the rule-based detection method. The rules should be rich enough for obtaining good performance. However, this method is hard to define a robust threshold for each rule in face detection. If the rule’s threshold is too strict, it causes this method to not be able to recognize many human faces. In contrast, segmented blocks without the human face will cause false alarms in face area detection when the threshold is too loose.
Many face detection applications have been developed [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13]. In the cases of statistical models, Tuncer et al. [
1] proposed using the support vector machine and K-nearest neighbors methods for face recognition, where perceptual hash was used for feature extraction. Lin et al. [
2] proposed using nonlinear least squares to facial landmarks for estimating the poses. Face recognition accuracy is improved by recognizing each image set’s frontal image instead of using a fixed 3D model. Chakraborty et al. [
3] proposed using a hand-crafted cascaded asymmetric local pattern to retrieve and recognize facial images. The method encodes the relationship among the neighboring pixels in horizontal and vertical directions, enabling the encoding method to obtain optimum feature length. Thus, the performance of face recognition is improved. Hssayni and Ettaouil [
4] proposed using a co-occurrence matrix to extract the face information in an image, where several coefficients can represent the face information. Hence, a Bayesian neural network is employed to recognize the face. In the cases of machine learning [
5,
10,
14,
15,
16], Viola and Jones [
5] used three critical methods for face detection. This study proposed using a machine learning approach to visual object detection. An integral image is utilized as a feature. Because the Viola and Jones (VJ) algorithm brings many advantages to visual objection detection, many studies have been proposed to improve the accuracy of the Viola and Jones method. However, these methods are sensitive to parameter settings, enabling the detection performance to vary significantly. Artificial intelligence (AI) technologies are progressively applied in various aspects. Most face detection methods work well for a few face images, but the performance may be greatly degraded for a large number of face images. Sapijaszko and Mikhael [
16] proposed using the 2D discrete wavelet transform and 2D discrete cosine transform to obtain texture features. A multilayer-sigmoid neural network recognizes the faces according to these features.
Recently, convolutional neural networks (CNNs) have been widely used in image recognition [
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23]. Wu et al. [
7] proposed a multi-task CNN for face detection and head pose estimation by extracting more representative features. Zhang et al. [
8] proposed deep cascaded multi-task convolutional neural networks that exploited the inherent correlation between detection and alignment to boost face detection performance. An online hard sample mining strategy was also applied. Li et al. [
9] combined the CNN, a blur-aware bi-channel network, and a self-learning mechanism for exploiting video contexts continuously and face detection. Yang et al. [
11] used a CNN for face detection by scoring facial parts responses according to the spatial structure and arrangement. The scoring mechanism is data-driven. Their method can detect faces under severe occlusion and unconstrained pose variations. Ranjan et al. [
13] proposed an algorithm for simultaneous face detection, landmarks localization, pose estimation, and gender recognition using CNN. They fused the intermediate layers of a deep CNN using a separate CNN followed by a multi-task learning algorithm that operates on the mixed features, enabling the performance to be boosted. Ramya et al. [
17] proposed using canonical correlation analysis to fuse the local face pattern extracted by the AlexNet and a shallow CNN. A multi-support vector machine then classifies the emotion category, where the accuracy rate can reach 87.69%.
Based on the above discussions, the methods in [
8,
13,
17,
24] use a deep cascaded multi-task framework and complex network for multi-function target recognition. The computational duty is high, while the size of the trained net is large. Loading the net for recognition is time-consuming. In this study, we attempt to develop a light app for quickly calculating the number of people, while loading the trained net requires little time. Therefore, the app is lightweight and practical. The VJ algorithm can effectively segment faces using machine learning in a photo. Additionally, CNN can effectively recognize the faces in a photo. In this paper, we propose an automatic and simple app to calculate the number of people in a photo by integrating the VJ algorithm and a face CNN. Initially, the VJ algorithm is employed to segment the faces as blocks in a photo. In this stage, all face-like blocks should be segmented out by the VJ algorithm, no matter how high the false detection rate is, so the miss detection of the face blocks is shallow. These segmented blocks are utilized for training a CNN for face confirmation. The face blocks detected by the VJ algorithm are refined by the face CNN, enabling the segmented blocks without the human face to be excluded; meanwhile, the actual face blocks are preserved. Accordingly, the accuracy of the detected face blocks is much increased. Finally, the number of detected face blocks is summed up to obtain the number of humans in a photo. The experimental results reveal that the proposed method can effectively calculate the number of people in a photo with the average recall rate exceeding 100% in most conditions. It should be mentioned that this work aims to propose a practical app for quick calculating the number of people in a photo rather than develop an approach for face recognition. The proposed app is practical and can be applied in real environments. In response to the current expansion of the COVID-19 epidemic, if there are tour groups, out-of-school teaching activities, etc., the proposed app can help quickly calculate the number of people, avoid crowd gathering, and cause the risk of group infections. In particular, the help in calculating a large number of people is significant in real environments.
The rest of this paper is organized as follows:
Section 2 describes the proposed app for automatically calculating people’s numbers.
Section 3 shows the experimental results. Finally,
Section 4 concludes this study.
2. Proposed App for Automatic Calculating the Number of People
The proposed app firstly utilizes the VJ algorithm [
5] to segment the face areas as blocks. Each segmented block is confirmed whether it belongs to the human face or not by a face CNN. Only the human face blocks are retained, while the segmented blocks without the human face are removed. The number of people can be obtained by summing up the retained human face blocks.
The graphic user interface (GUI) of the proposed app to calculate the number of people is shown in
Figure 1. Firstly, a user must press the “Photo” button to take a snapshot from a webcam. If the user confirms this photo is available to include all members, the “Calculate” button should be pressed to perform face detection. Hence, detected rectangular blocks mark the face, and the number of people in the photo is displayed.
2.1. Human Face Detection
The block diagram of the proposed human face detection method is shown in
Figure 2. Initially, we utilized the VJ algorithm [
5] to segment the candidates of face blocks. The segmented blocks are classified into two categories, i.e., the face block and the segmented blocks without the human face. Each segmented block is manually labeled whether it contains the human face. Hence, the labeled segmented blocks are fed into a CNN to learn the face features, whereas the trained CNN is named the face CNN.
In the application phase, the face-like areas are also detected and marked as face-like blocks using the VJ algorithm. The face-like blocks are fed into the face CNN to determine whether they belong to the face category. The detected blocks without the human face are excluded; meanwhile, the blocks with the face inside are preserved.
Multiple detected results for the same face may exist in various sizes. These results are called repetition errors for face detection. Because the numerous detected results are all real face regions, they cannot be removed by the face CNN. A face at the same location with a different size should be refined to reduce the repetition error. Only the most significant extent at the exact center location is retained, while the others are removed. The repetition error is then reduced. Finally, the number of detected faces is summed up to obtain the number of humans in a photo.
The VJ algorithm [
5] can effectively detect the face and provide flexible tuning parameters. This algorithm can also detect the upper body and contains some functions for detecting the eyes, eyebrows, nose, and mouth. The algorithm is robust and is used to perform initial face-range segmentation in this study.
Figure 3 shows the detected face blocks with various parameters, while the minimum and maximum sizes of detectable objects are set to 20 × 20 and 85 × 85, respectively. By observing
Figure 3a, many segmented blocks without the human face are classified into face blocks when we use a loose condition in the VJ algorithm [
5], i.e., the merging threshold is set to unity. The merging threshold plays a major role in face detection. If the VJ algorithm’s segmentation condition is set to be strict, i.e., the merging threshold is set to three, all of the segmented blocks without the human face can be excluded, as shown in
Figure 3c. Additionally, many face blocks cannot be successfully detected. After carefully tuning the value of the merging threshold, it is better to be set as two. The segmented result is shown in
Figure 3b. Only two face regions cannot be correctly detected, and only one segmented block without the human face is falsely detected.
Figure 3 shows an example of face detection results using the VJ algorithm [
5]. As shown in
Figure 3a, all human faces can be thoroughly detected if the VJ algorithm’s merging threshold is set to a low value (merging threshold = 1). However, many segmented blocks without the human face are also marked as face blocks, causing the false alarm to increase. The merging threshold value should be increased to reduce the false alarm rate, where the detected results are shown in
Figure 3b,c. It can be found that most segmented blocks without the human face disappear, but some face areas cannot be successfully detected. Accordingly, the merging threshold should be carefully selected to obtain an acceptable result. Here, we employ a face CNN to make the segmented blocks without the human face being excluded, enabling the VJ algorithm’s false alarm rate to be significantly reduced. Thanks to the face CNN’s help, the merging threshold of the VJ algorithm only needs a loose setting, i.e., the merging threshold can be fixed to be unity, enabling all face areas to be successfully detected.
2.2. Face CNN Training and Confirmation
As shown in
Figure 3, most face blocks can be captured using a loose segmentation parameter in the VJ algorithm, i.e., the merging threshold is set to unity. Although many segmented blocks without the human face are classified as face blocks, they can be excluded by using the face CNN; meanwhile, the segmented blocks without the human face are effectively removed. Thus, the detection accuracy of the human faces is significantly improved.
The face-like blocks captured by the VJ algorithm are fed into a CNN to train a face CNN. The block size is 75 × 75. In turn, eight 2D convolutional filters were employed to capture facial features, where the size is 3 × 3. The batch normalization is also performed to ensure training accuracy. The maximum polling operation is then performed with the window size 2 × 2, where the stride is set to two, where the row and column resolutions are down-sampled by the factor two. The convolutional layer, batch normalization, and maximum polling layers are repeated five times to capture facial features effectively. Therefore, the accuracy rate for face detection is improved. Next, the 2D features are flattened and fed into a fully-connected layer with two outputs, which correspond to the human face and non-human, respectively.
A softmax function is used as the activation function at the output of the fully connected layer. Finally, one of the two outputs with a larger score corresponding to the recognized result of either representing a human face or a non-human one is selected as the face of CNN’s output. The recognized result of the fully connected layer
i* can be expressed by
where
p1 and
p0 represent the probabilities of the human face and non-human one, respectively.
In (1), only the segmented blocks with larger value of
p1 will be preserved, i.e.,
. These blocks correspond to the existence of the face confirmed by the face CNN. The other segmented blocks, which are recognized as
by the face CNN, are removed, i.e., the VJ algorithm’s detected results corresponding to the segmented blocks without the human face being removed. Therefore, the face’s detection accuracy is significantly improved for the VJ algorithm using the face CNN. The detailed structure of the face CNN is shown in
Table 1.
The face CNN can effectively extract face features and correctly classify the face and non-human face categories.
Figure 4 shows the relationship between the accuracy rate and the number of various convolution layers of a face CNN. As the number of the convolutional layers increases, the accuracy rate increases. If the number of convolutional layers increases to five or more, the accuracy rate can reach 100%. Accordingly, the number of convolutional layers is selected to five in the experiments.
2.3. Repetition Removal
The face CNN can effectively exclude the segmented blocks without the human face. However, the face CNN cannot cope with repetition errors. These repeated detected face blocks are real human faces and correspond to identical faces; they cannot be removed by the face CNN. Two examples are shown in
Figure 5a,c. It can be found that some human faces have the problem of repeated segmentation; that is, the same face is repeatedly marked twice at the neighbor location. To remove the repeated segmented blocks with various sizes for the same face, we only preserve the block with the smallest size, while the others are removed.
The coordinate position of each detected block’s center point is calculated first. Hence, the distance between the center points of a pair of neighbor blocks is calculated. The distance between two detected blocks is calculated by
where (
,
) represents the center coordinate of the segmented blocks
and
.
i and
j denote the indices of detected blocks, respectively.
If there are many center points of the detected blocks too close, these blocks will correspond to the same face. Only the block with the largest size is preserved; the others are removed, given as
where
(
= 5) represents the distance threshold of repeated blocks for a face in an analysis photo.
denotes the area of the segmented block
.
represents the removal for a segmented block, i.e., the segmented block is regarded as a repeated block and should be removed.
In (3), only one central point of the segmented blocks with the smallest area is retained, while the other segmented blocks in the neighborhood are removed. This enables the repeated segmented blocks for the same face to be removed. Two examples of repetition removal for face detection are shown in
Figure 5. The detected results using the VJ algorithm and the face CNN suffered from repetition error for some faces are shown in
Figure 5a,c. The refinement using (3) can effectively remove each detected face’s repeated blocks, enabling the detection accuracy to be much improved.
4. Conclusions
This paper proposes a valuable app for calculating the number of people in a photo. This app applies a face CNN with a VJ algorithm to segment face regions in an image. Firstly, the VJ algorithm’s loose condition is utilized to detect face-like blocks as much as possible, even though the false detection rate is high. In turn, the face blocks and non-human ones are utilized for training a face CNN, which is used to determine whether a detected block is an actual face. The trained face CNN removes the segmented blocks without the human face. The number of face blocks that remained is the number of people in the photo. Experimental results show that the proposed app can obtain the average accuracy rate, recall rate, and F-measure reaching 95%, 97%, and 96%, respectively. The proposed app is efficient for calculating the number of humans in a photo and can thus be used in practical work; for example, a tour guide can calculate how many tour members gather at a viewpoint. The limitation of this app is that the human face should face the camera, where slight angle movement is also available. In the future, we will collect the photos so that the human faces do not meet the camera as training data, enabling the proposed app to calculate the human faces not facing the camera accurately.