Small-Scale Face Detection Based on Improved R-FCN

Tang, Chaowei; Chen, Shiyu; Zhou, Xu; Ruan, Shuai; Wen, Haotian

doi:10.3390/app10124177

Open AccessArticle

Small-Scale Face Detection Based on Improved R-FCN

by

Chaowei Tang

^1,*,

Shiyu Chen

¹,

Xu Zhou

^1,2,*,

Shuai Ruan

¹ and

Haotian Wen

¹

School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

²

Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2020, 10(12), 4177; https://doi.org/10.3390/app10124177

Submission received: 10 May 2020 / Revised: 12 June 2020 / Accepted: 15 June 2020 / Published: 18 June 2020

Download

Browse Figures

Versions Notes

Abstract

:

Face detection is an important basic technique for face-related applications, such as face analysis, recognition, and reconstruction. Images in unconstrained scenes may contain many small-scale faces. The features that the detector can extract from small-scale faces are limited, which will cause missed detection and greatly reduce the precision of face detection. Therefore, this study proposes a novel method to detect small-scale faces based on region-based fully convolutional network (R-FCN). First, we propose a novel R-FCN framework with the ability of feature fusion and receptive field adaptation. Second, a bottom-up feature fusion branch is established to enrich the local information of high-layer features. Third, a receptive field adaptation block (RFAB) is proposed to ensure that the receptive field can be adaptively selected to strengthen the expression ability of features. Finally, we improve the anchor setting method and adopt soft non-maximum suppression (SoftNMS) as the selection method of candidate boxes. Experimental results show that average precision for small-scale face detection of R-FCN with feature fusion branch and RFAB (RFAB-f-R-FCN) is improved by 0.8%, 2.9%, and 11% on three subsets of Wider Face compared with that of R-FCN.

Keywords:

small-scale face detection; R-FCN; feature fusion; receptive field adaptation

1. Introduction

Many achievements have been made in small-scale face detection, which can be divided into two aspects. One is to preprocess the input image. For example, in [1], a scale proposal network was designed to estimate the scale of a face; then, the input image was resized and sent to the network. In [2], a generative adversarial network was used to reconstruct small-scale faces. Both methods mentioned above increase the computational cost. The second idea is to improve the ability to detect small-scale faces from the network itself, such as enhancing the expression ability of features. For example, Li J et al. [3] adopted a dual shot face detector to merge a feature pyramid network (FPN) [4] and receptive field block (RFB) [5] to enhance the characteristics of the shared feature map. Although the detection precision is improved through the dual-channel detection method, the detection speed needs to be improved. Chi C et al. [6] proposed receptive field enhancement module (RFEM) to generate receptive fields with various shapes to capture faces with extreme poses, but the effect of RFEM is insignificant. Considering that image preprocessing is time consuming, we improve the precision of face detection from the network itself.

In recent years, the face-detection algorithm based on faster region-based convolutional neural network (faster R-CNN) [7] has made great progress. However, faster R-CNN destroys the spatial structure of features and makes the network insensitive to the position of the faces because it uses the fully connected layer for classification and detection. As a result, Dai et al. [8] proposed region-based fully convolutional networks (R-FCN) and adopted a fully convolutional network (FCN) to encode the position information of the target in the region of interest (RoI) to solve the problems of classification invariance and position sensitivity of faster R-CNN. R-FCN has good detection performance on pattern analysis, statistical modelling and computational learning visual object classes (PASCAL VOC), and its detection speed is faster than that of faster R-CNN. Therefore, we adopt R-FCN to detect small-scale faces. The framework of R-FCN is shown in Figure 1a. We propose a novel R-FCN framework based on R-FCN, which is shown in Figure 1b. We redesign the feature extraction network, which includes the original feature extraction network branch of R-FCN and a new feature fusion branch. The new feature fusion branch is used to fully fuse the local information of the features extracted from the low and middle layers of the original feature extraction network branch and the semantic information of the features extracted from the top layer of the original feature extraction network branch. The improved R-FCN is called R-FCN with a feature fusion branch (f-R-FCN). In addition, we add a receptive field adaptation block (RFAB) based on f-R-FCN to enhance the discrimination of the features of small-scale faces. The improved f-R-FCN is called R-FCN with a feature fusion branch and RFAB (RFAB-f-R-FCN). In Figure 1, a region proposal network (RPN) is a fully convolutional neural network (CNN), which can generate k boxes with different scale and aspect ratio with each pixel as the center in the feature map, called anchor. After the anchors are classified and regressed, the corresponding region in the original image is called the proposal.

The main contributions of this study are as follows:

(1): We propose a novel R-FCN framework, that is, we add a feature fusion branch and RFAB in original R-FCN.
(2): A new feature fusion method is proposed to alleviate the problem of low detection rate caused by the original R-FCN.
(3): RFAB is proposed to enhance the ability of feature expression of small-scale faces.
(4): We improve the anchor setting method and adopt soft non-maximum suppression (SoftNMS) as the selection method of candidate boxes.

2. Related Works

According to the CNN-based face detection process, face detection based on deep learning can be divided into three categories: cascaded CNN, two-stage algorithm, and single-stage algorithm. The cascade-based method trains multiple cascaded CNNs for face detection, such as multi-task cascade convolutional networks (MTCNN) [9] and inside contextual CNN [10]. MTCNN and inside contextual CNN cascade three CNNs for face detection: the first CNN quickly generates candidate boxes, the second CNN refines the candidate boxes, and the third CNN selects the candidate boxes in detail. The two-stage algorithm divides the detection into two steps. The first step is to generate candidate boxes for the input image, and the second step is to determine whether the candidate boxes contain faces. For example, Jiang H et al. [11] improved faster R-CNN by combining the specific attributes of face detection. Center loss [12] and online hard example mining (OHEM) [13] were adopted, but the scale of faces was ignored. In [14], face R-FCN was proposed, and position-sensitive average pooling was designed to generate some hidden features to enhance the difference. Zhang C et al. [15] adopted a deformable layer based on light-head R-CNN [16] to reduce the number of channels for face detection. The single-stage face detection algorithm directly classifies and regresses the image, which greatly improves the detection speed. For example, Wang Y et al. [17] proposed real-time face detection based on YOLOv3, in which intersection over union (IoU) is used to cluster the size of initial candidate boxes and the accuracy and speed can be guaranteed in complex environments. Zhang S et al. [18] proposed single shot scale-invariant face detector (S³FD) to solve the problem that the performance of anchor-based detector decreases sharply as the target becomes smaller. Wang J et al. [19] provided different anchor attention mechanisms for different layers to highlight the face area for solving the occlusion problem.

Many face detection applications are based on low constraint scenarios; for example, polices need to track criminals according to surveillance video, and the attendance system may need to count people based on faces in dense crowds. Faces in low-constraint scenarios are very small with a scale even lower than 10 × 10. Thus, small-scale face detection has become a major research field in recent years. Many methods have been proposed to solve the problem of small-scale face detection. In 2017, Hu P et al. [20] proposed a tiny face, which integrated multilayer features and effectively used context information to detect faces on the down-sampled image, but the image pyramid reduced the detection speed. Bai Y et al. [21] proposed multi-scale FCN and trained different FCN networks in the feature layers of different scales. Each FCN is responsible for detecting faces of corresponding scales. However, training corresponding detectors at each scale is time consuming. Zhu C et al. [22] proposed an expected max overlapping (EMO) score to explain the low ratio of intersection between the anchor and the human face and improved the performance by reducing the step size of the anchor. Zhang F et al. [23] adopted selective refinement network (SRN) to improve the recall rate of detection, used IoU loss [24] to make the regression position accurate, and utilized the max-out label to reduce simple negative samples.

The current detection algorithms for small-scale faces are still in the stage of developing. On the one hand, effectively extracting or enhancing the features of small-scale face detection is still a problem to be solved. On the other hand, different faces in the same image often have distinct scales, and an excellent face detector must be scale friendly. Thus, the detector must consider the various scales of faces when detecting small-scale faces.

3. Selection Method of Candidate Boxes and Anchor Setting of Region-Based Fully Convolutional Networks (R-FCN)

In this section, we change the non-maximum suppression (NMS) of R-FCN to SoftNMS and reset the anchor.

R-FCN is a universal object detector, which has some differences from the face detector. First, the general objects have large scales and different shapes, while most of the faces are rectangular and the ratio of length to width is fixed. Second, face detection is affected by various interference factors, such as illumination, expression, occlusion, and other factors; however, these factors only slightly affect the detection of general objects. Therefore, R-FCN should be modified to make the framework suitable for small-scale face detection.

Original R-FCN uses NMS to filter candidate boxes, but the threshold setting in NMS affects the precision of detection. If the threshold is too low, then the true positive samples will be suppressed. If the threshold is too high, then the false positive samples will increase. In face detection in a low-constraint environment, face occlusion is most common, and NMS will cause missed detection. In this study, SoftNMS [25] is used to improve the performance of face detection under occlusion conditions. SoftNMS is defined as follows:

s_{i} = {\begin{array}{l} s_{i} (1 - I o U (M, t_{i})), & I o U (M, t_{i}) \geq N_{t} \\ s_{i}, & I o U (M, t_{i}) < N_{t} \end{array}

(1)

s_i denotes the score of the i-th candidate box. M and t_i are the coordinates of the candidate box with the highest score and the coordinates of i-th candidate box.

I o U (.)

represents the ratio of the intersection of the candidate box i with M to the union of candidate box i with M. N_t is a preset threshold.

Formula (1) shows that SoftNMS will attenuate the scores of candidate boxes when IoU of candidate boxes with M is greater than the threshold N_t. Therefore, candidate boxes that are far from M will be unaffected, and candidate boxes that are close to M will be punished. However, NMS directly sets the score of the candidate boxes to 0 when IoU of the candidate box with M is greater than the threshold N_t. Compared with NMS, SoftNMS retains candidate boxes that are easy to delete by mistake.

In original R-FCN, the base size of the anchor is 16 according to the general object detection. The scale of anchor is set to (8, 16, 32) according to the distribution of the scales on PASCAL VOC dataset. The aspect ratio is (0.5,1,2); thus, 1 pixel corresponds to 9 anchors. However, many small-scale faces are involved in face detection; we modify the anchor according to the specific properties of the faces to prevent the small-scale faces from being missed due to the improper setting of anchor. Considering that most faces are rectangles with a length greater than or equal to width, we set the aspect ratio of anchor as (1, 1.3, 1.5). Wang J et al. [19] emphasized that 80% of the face scales on the Wider Face training set are between 16 and 406 pixels. Accordingly, we set the base size of anchor as 8 and scale as (1, 2, 4, 8, 16, 32, 64). The modified R-FCN has 21 anchors, which can cover most of the face samples on Wider Face.

4. Feature Fusion Branch

When R-FCN is used for face detection, it is ineffective for small-scale face detection. The main reason is that the position-sensitive score map used for position-sensitive pooling is convoluted from the highest-level feature map of the backbone network. The pixel step corresponding to the original image is 16. When the input face is less than 16 × 16, only 1 pixel can be obtained for detection, which is difficult for the detector to classify and regress. Second, an increasing amount of semantic information will be integrated into the feature map with the increase in convolution layers, while the proportion of local information of small-scale faces becomes lower. The lower-level features have more local information, while the features of higher-level have rich semantic information on CNN. Therefore, we establish a feature fusion branch that can use the context information of CNN to improve the robustness of small-scale face detection.

FPN constructs a top-down feature pyramid, which contains information of various scales, but it still has deficiencies. First, FPN recursively adds high-level features to low-level features, and too many semantic features may damage the details of low-level features. This condition results in insufficient detection accuracy for small targets. Second, FPN manually assigns anchors with different scales to distinct layers, which may assign targets to feature layers that are not conducive to detection. The feature fusion branch we proposed adopts the bottom-up fusion method to integrate the information of the low-level layer with the high-level layer features. The scales of the features of each layer in the branch are the same as those of the corresponding features in the backbone, as shown in Figure 2.

The same as those in FPN, C_i represents the last feature map of each residual group of ResNet, and

i \in [2, 5]

. F_i is the feature obtained by fusing C_i and F_i₋₁. To start iteration, C₂ is directly taken as the fused feature F₂. The scale of features from lower layers is large, and the number of channels is small. Thus, 3 × 3 convolution is used to reduce the feature scale for adding the corresponding elements with the upper layer features. The feature fusion operation can be expressed by Formula (2).

F_{i} = C_{i} + L (C o n v (F_{i - 1}))

(2)

Conv represents convolution operation and L denotes L2 normalization. When i = 2, F_i = C_i.

The feature maps from different layers have distinct properties in terms of the number of channels, the scale of value, and the norm of feature map pixels. The norm of features of shallow layers is generally large, while the norm of features of deep layers is usually small. If a simple element addition operation is performed on two features, then the shallow layer features will dominate the deep layer features. Therefore, the L2 normalization proposed by ParseNet [26] is introduced to normalize the feature pixels before feature fusion. The formula is as follows:

\hat{X} = \frac{X}{{‖ X ‖}_{2}}

(3)

{‖ X ‖}_{2} = {(\sum_{i = 1}^{d} {| x_{i} |}^{2})}^{\frac{1}{2}}

(4)

X = (x_{1}, x_{2}, \dots, x_{d})

is the feature before normalization. d is the number of channels.

\hat{X} = ({\hat{x}}_{1}, {\hat{x}}_{2}, \dots, {\hat{x}}_{d})

is the normalized feature.

| x_{i} |

is the absolute value of x_i. The values of the feature will be changed and the difficulty of training will increase after normalization. Therefore, the normalized pixel value should be scaled, and Formula (5) should be used to scale the normalized feature along the channel:

y_{i} = γ_{i} {\hat{x}}_{i}

(5)

y_i is the value of the feature after scaling.

γ_{i}

is the scale factor that can be obtained by learning in the training stage.

In summary, we add a feature fusion branch to the feature extraction network of R-FCN and perform face detection on F₅. The structure of the feature fusion branch in f-R-FCN is shown in Figure 3.

5. Receptive Field Adaptation Block (RFAB)

Traditional CNN uses fixed-size convolutional kernels for feature extraction. The size of the receptive field of each layer of neurons is fixed, which may damage the discrimination of features. In the human visual cortex, the size of the receptive field is affected by many factors. For example, the size of population receptive field (pRF) is a function of the eccentricity of retinal imaging. As the eccentricity increases, the receptive field also increases [27]. It can be understood that the brain gets more information from the area closer to the visual center, but less from the area farther away from the visual center. RFB is proposed according to [27], which highlights the importance of the sampling center area. The insensitivity of CNN to small spatial changes is improved, and it achieves good results in object detection.

However, in the human visual system, the size of the receptive field is related not only to the eccentricity of retinal imaging but also the stimulation of the visual nerves. That is, the stimulation of the same neuron is different, and the corresponding receptive field size of the neuron is not fixed. In RFB, inception structure and dilated convolution are used to simulate the receptive field mechanism in human retina imaging, and linear superposition is used to integrate the features with different receptive fields in the spatial dimension. Although dilated convolution increases the weight of features closer to the sampling center to achieve the purpose of enhancing the discrimination of features, the effect of neuron stimulation on the receptive field is ignored. Therefore, we add a RFAB based on f-R-FCN to enhance the feature of small-scale faces by combining RFB and selective kernel (SK) [28] module. In this way, the influence of eccentricity and the neuronal stimulation on the receptive field are considered. The structure of RFAB is shown in Figure 4.

The former part of RFAB is the same as that of RFB. The input is divided into three branches by convolution operation of different sizes. The corresponding receptive field of each branch is different. Two 3 × 3 convolutions are used to replace the 5 × 5 convolution for reducing the calculation. “rate” in Figure 4 represents the dilation rate of dilated convolution. Then, the features of the three branches are fused, after global average pooling, 1 × 1 convolution and softmax operation, and the corresponding probability value is obtained for each channel. The CNN selects the size of the receptive field for each channel through element product. RFAB also uses a shortcut to retain the original features and adds the processed features to obtain the final features. RFAB is used in the last layer of feature fusion branch to improve the discrimination of the shared feature map.

We add a RFAB based on f-R-FCN to enhance the feature of small-scale faces by combining RFB and SK module. The structure of RFAB-f-R-FCN is shown in Figure 5. Compared with those in Figure 3, C₅ is removed, and the feature fusion branch is constructed based on C₂, C₃, C₄ The reason is that we find that average precision (AP) of detection on F₄ is higher than that on F₅. See Section 6.2.1 for details.

6. Experiments

The datasets are Wider Face [29] and face detection dataset and benchmark (FDDB) [30]. Wider Face has 32,203 images, with 393,703 faces, and the small-scale faces account for a large proportion. FDDB contains 2845 images and 5171 faces, which is usually used to evaluate the performance of the model. We use the Wider Face training dataset to train the model and conduct a verification test on the verification set. We also evaluate our method on FDDB.

The backbone network used in the experiments is ResNet50 pretrained on ImageNet and, during the training, stochastic gradient descent (SGD) is used to update the parameters. We set the training hyperparameters according to [8], the weight decay is 0.0005, the momentum is 0.9 and the initial learning rate is 0.001. The shortest side of the input image is 600, and the longest side is 1000. The network is trained with 80,000 iterations and the learning rate is reduced to 0.0001 after 60,000 iterations. We adopt the multitask loss function of R-FCN, which is shown in Formula (6).

L (p, u, t^{u}, v) = \frac{1}{N_{c l s}} \sum_{i} L_{c l s} (p, u) + λ \frac{1}{N_{l o c}} (u > 0) L_{l o c} (t^{u}, v)

(6)

L_{c l s} (p, u) = - [u \log (p) + (1 - u) \log (1 - p)]

(7)

L_{l o c} (t^{u}, v) = {\begin{matrix} 0.5 {(v - t^{u})}^{2}, if | v - t^{u} | < 1 \\ | v - t^{u} | - 0.5, otherwise \end{matrix}

(8)

N_{c l s}

is the number of classified samples and

N_{l o c}

is the total number of candidate boxes.

L_{c l s}

and

L_{l o c}

represent the classification loss and smooth L1 regression function, respectively. p denotes the classification score. u is the true label.

u \in [0, 1]

, u = 0 means background, and u = 1 means the sample is the face. t^u denotes the coordinate of the prediction box. ν is the true coordinate of the face. λ is the balancing factor and is usually set to 1.

6.1. Results of Modified R-FCN

6.1.1. Comparison between Non-Maximum Suppression (NMS) and Soft Non-Maximum Suppression (SoftNMS)

Table 1 shows AP on the three verification sets of Wider Face when R-FCN adopts NMS and SoftNMS. After changing NMS to SoftNMS, AP increases by 1%, 3.3%, and 4.1% on three subsets, respectively, which verifies the effectiveness of SoftNMS. Notably, the improvement of AP on the Easy subset is not obvious. SoftNMS aims at the problem of missed detection of occlusion, while the detection difficulty on Easy subset is low, and occlusion is relatively rare. The effect of SoftNMS is equivalent to that of NMS. More occlusions and small-scale faces are observed on Medium and Hard subsets. Thus, the improvement effect of AP by adopting SoftNMS is more obvious.

6.1.2. Comparison of Different Anchor Settings

Table 2 shows AP on three subsets of Wider Face with different anchor settings. As shown in the table, the scale setting of anchor largely influences AP. When the scale range of anchor is sufficiently large, AP on the Hard subset can be significantly improved. However, when AP on the Hard subset increases, AP on the Easy subset decreases slightly. To balance the reduction of AP on the Easy subset with the increase in anchor scale and improve AP on the Hard subset, we finally set the anchor according to the parameters in the last row of Table 2. This result is taken as the benchmark result of subsequent experiments. Compared with those in R-FCN, AP decreases by 2.5% on the Easy subset and 3.1% on the Medium subset, but AP on the Hard subset increases by 11.8%.

6.2. Results of Feature Fusion Branch

6.2.1. Comparison of Different Fusion Feature Layers

We try to predict different fusion layers experimentally. The first scheme is to access RPN at F₄ layer, construct position-sensitive score map at F₅, and perform classifications and regression operations. The second scheme discards the fifth residual group of ResNet, only uses C₂, C₃, C₄ to construct the fusion branch, accesses the RPN at final fusion layer F₄, and constructs the position-sensitive score maps for detection, which means RPN of the second scheme shares the F₄ with the detection network. The comparison between the two schemes and R-FCN is shown in Table 3.

Table 3 shows that regardless of whether F₄ or F₅ prediction is adopted for f-R-FCN, AP on three subsets of Wider Face is improved to different degrees compared with those in R-FCN, in which the AP of small-scale face detection on F₄ layer is higher than that on F₅ layer. The reason is that RPN and the classification network share the fusion feature map on F₄, as shown in Figure 3 and Figure 5. Compared with those in R-FCN, AP is improved by 0.9% on the Easy subset, 1.8% on the Medium subset, and 7.5% on the Hard subset.

6.2.2. Comparison with R-FCN

Figure 6 shows precision–recall (PR) curve of f-R-FCN on Wider Face. f-R-FCN is marked as fusion R-FCN in Figure 6. When recall is the same, the precision of f-R-FCN on the three subsets is higher than that of R-FCN, and the PR curve obtained is closer to the right side and steeper than that of R-FCN, which verifies that f-R-FCN has better performance than R-FCN.

Figure 7 intuitively shows the detection effect of R-FCN and f-R-FCN on small-scale faces in the Wider Face dataset. The number of faces detected by f-R-FCN in the same picture is significantly higher than R-FCN, which verifies that the f-R-FCN is more effective for small-scale face detection.

6.3. Effects of RFAB

In the experiment, multi-scale training is adopted, and the input image is rescaled to {600, 1200}. In the training stage, original R-FCN selected 6000 anchors and reserved 300 anchors after NMS, but in our experiments, 10,000 anchors with the highest score are selected, and 1000 anchors are reserved after SoftNMS since the number of anchors increased after resetting. In the test stage, 2000 anchors with the highest score are selected, and 600 RoIs are retained after SoftNMS. In addition, the OHEM strategy is adopted when training, considering the characteristics of small-scale faces, the size of position-sensitive pooling is changed from 7 × 7 to 5 × 5. PR curves of various methods are compared on Wider Face to evaluate the proposed method comprehensively, and discrete receiver operating characteristic (discROC) curve and continuous receiver operating characteristic (contROC) curves of our method and other methods are drawn on FDDB.

6.3.1. Comparison with and without RFAB

Table 4 shows AP on three subsets of Wider Face for R-FCN, f-R-FCN, and RFAB-f-R-FCN. Under the same conditions, AP of RFAB-f-R-FCN is 0.1% lower than that of f-R-FCN on the Easy subset, 1.1% higher than that of f-R-FCN on the Medium subset, and 3.5% higher than that of f-R-FCN on the Hard subset. AP of RFAB-f-R-FCN is 0.8%, 2.9%, and 11% higher than that of R-FCN. The reason is that RFAB enhances the discrimination of F₄ layer, which shows the effectiveness of RFAB.

F₄+RFAB means R-FCN with both feature fusion branch and RFAB module, and uses fusion layer F₄ for detecting.

Figure 8 shows the effect of the proposed method and R-FCN on the Wider Face dataset. The green box is the detection result of R-FCN, and the red box is the detection effect of the proposed method. R-FCN has a high rate of missed detection in the case of dense small-scale faces, and the proposed method can effectively detect small-scale faces, which shows the effectiveness of the proposed method for small-scale face detection.

6.3.2. Comparison of AP between the Method Proposed and the Classical Methods

Table 5 shows AP of the proposed method and several typical methods on the Wider Face verification set. AP of the several typical methods are from the official website of the Wider Face dataset. Table 5 shows that AP of the proposed method is higher than that of the typical methods. Although the Hard subset of Wider Face has several small-scale faces, the proposed method is still higher than comparative methods on the Hard subset. Therefore, the proposed method is more robust for the small-scale faces than other methods.

6.3.3. Comparison of Precision–Recall (PR) Curves between the Proposed Method and the Classical Methods

Figure 9 shows the PR curves of the proposed method and other classical methods on Wider Face. As shown in the figure, the PR curve of the proposed method is closer to the right side than that of other methods. Under the same recall rate, the precision of the proposed method is the highest, which shows that the proposed method is superior to the classical methods of comparison. The reason is that the proposed method adds the feature fusion branch and RFAB to improve feature discrimination.

6.3.4. Comparison of the Proposed Method and Classical Methods on Face Detection Dataset and Benchmark (FDDB)

Figure 10 shows the discrete receiver operating characteristic (ROC) curves of the proposed method on the FDDB dataset. Two evaluation criteria of FDDB dataset are considered: discROC and contROC curve. For discROC, if IoU of prediction box and the ground truth is greater than 0.5, then it will be judged as a true positive sample, while contROC must be calculated by weighting IoU. The proposed method uses an unlimited training method. Thus, it is first trained with the Wider Face dataset and then tested on FDDB. The detector uses rectangles to mark faces, while FDDB uses ellipses to mark faces. The true positive rate of the proposed method in contROC will be lower than that in discROC.

Figure 10 shows that, if discROC is taken as the evaluation criteria, then the performance of the proposed method is better than that of multitask cascade CNN [9] and LDCF + [31] but is slightly worse than that of ScaleFace [32]. This result is due to the fact that the proposed method maps all RoIs to the feature layer of the same depth, while Scaleface uses different networks to detect RoIs of a specific scale. If contROC is used as the evaluation criteria, then the performance of the proposed method is better than that of ScaleFace and LDCF + but is worse than that of multitask cascade CNN. The reason is that multitask cascade CNN uses three cascaded networks to continuously fine tune the position of the prediction box. Thus, the final prediction results are relatively higher.

In conclusion, the results of different methods under distinct evaluation standards vary. Compared with those of Wider Face, the small-scale face samples of FDDB are limited. Table 5 and Figure 9 and Figure 10 show that our method has a better comprehensive performance for small-scale face detection than the other methods.

6.4. Inference Time

We trained R-FCN, f-R-FCN and RFAB-f-R-FCN with a single GeForce GTX 1080 graphics processing unit (GPU), and gave their inference time, as shown in Table 6. Although the detection time increases after adding feature fusion branch and RFAB, it still meets the real-time detection requirements.

7. Conclusions

Face detection in real life, such as small-scale and occlusion face detections in extreme scenes, is still very challenging. Many algorithms in small-scale face detection have either precision that is too low or detection time that is too long. In this study, small-scale face detection based on R-FCN is explored. First, we propose a novel R-FCN framework, that is, we add feature fusion module and RFAB in R-FCN, to address the problem of the small-scale face detection. Second, a bottom-up feature fusion method is proposed to enrich the local information of high-layer features. Finally, RFAB is proposed, which enables the network to adaptively select the receptive field, enhances the expression ability of face features, and improves the detection rate of the small-scale faces. Furthermore, we improve the anchor setting method and adopt SoftNMS as the selection method of candidate boxes. The experimental results show that the proposed method has better comprehensive performance for small-scale face detection than other methods.

Author Contributions

C.T. is the leader of this research. He proposed the basic idea and participated in the discussion. S.C. developed the algorithm and wrote and revised the manuscript. X.Z. conducted the experiments for the revision. He participated in the discussion and development associated with this research. S.R. participated in research discussions and experimental design. H.W. is responsible for data curation. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by a research Grant from Chongqing Science and Technology Commission (Project Code: cstc2016shmszx40005) and the Ministry of Industry and Information Technology 2018 Industrial Internet Innovation and Development Project (Project Code: Z20180898).

Conflicts of Interest

The authors declare no conflict of interest.

References

Hao, Z.; Liu, Y.; Qin, H.; Yan, J.; Li, X. Scale-Aware Face Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 1913–1922. [Google Scholar]
Bai, Y.; Zhang, Y.; Ding, M.; Ghanem, B. Finding Tiny Faces in the Wild with Generative Adversarial Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake Sity, UT, USA, 18–23 June 2018; pp. 21–30. [Google Scholar]
Li, J.; Wang, Y.; Wang, C.; Tai, Y.; Qian, J. DSFD: Dual shot face detector. arXiv 2019, arXiv:1810.10220. [Google Scholar]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Receptive Field Block Net for Accurate and Fast Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Chi, C.; Zhang, S.; Xing, J.; Lei, Z.; Li, S.Z. Selective Refinement Network for High Performance Face Detection. In Proceedings of the AAAI Conference on Artificial Intelligence 2019, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8231–8238. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Int. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Proceedings of the Advances in Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef] [Green Version]
Zhang, K.; Zhang, Z.; Wang, H.; Li, Z.; Qiao, Y. Detecting Faces Using Inside Cascaded Contextual CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 3190–3198. [Google Scholar]
Jiang, H.; Learned-Miller, E. Face Detection with the Faster R-CNN. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition, Washington, DC, USA, 30 May–3 June 2017; pp. 650–657. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A Discriminative Feature Learning Approach for Deep Face Recognition; Springer International Publishing: Cham, Switzerland, 2016; pp. 499–515. [Google Scholar]
Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 761–769. [Google Scholar]
Wang, Y.; Ji, X.; Zhou, Z.; Wang, H.; Li, Z. Detecting faces using region-based fully convolutional networks. arXiv 2017, arXiv:1709.05256. [Google Scholar]
Zhang, C.; Xu, X.; Tu, D. Face detection using improved faster RCNN. arXiv 2018, arXiv:1802.02142. [Google Scholar]
Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y. Light-head R-CNN: In defense of two-stage object detector. arXiv 2017, arXiv:1711.07264. [Google Scholar]
Wang, Y.; Zheng, J. Real-time face detection based on YOLO. In Proceedings of the IEEE International Conference on Knowledge Innovation and Invention(ICKII 2018), Jeju Island, Korea, 23–27 July 2018; pp. 221–224. [Google Scholar]
Zhang, S.; Zhu, X.; Lei, Z.; Shi, H.; Wang, X. S3FD: Single Shot Scale-Invariant Face Detector. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 192–201. [Google Scholar]
Wang, J.; Yuan, Y.; Yu, G. Face attention network: An effective face detector for the occluded faces. arXiv 2017, arXiv:1711.07246. [Google Scholar]
Hu, P.; Ramanan, D. Finding tiny faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 1522–1530. [Google Scholar]
Bai, Y.; Ghanem, B. Multi-scale Fully Convolutional Network for Face Detection in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 2078–2087. [Google Scholar]
Zhu, C.; Tao, R.; Luu, K.; Savvides, M. Seeing Small Faces from Robust Anchor’s Perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake Sity, UT, USA, 18–23 June 2018; pp. 5127–5136. [Google Scholar]
Zhang, F.; Fan, X.; Ai, G.; Song, J.; Qin, Y. Accurate face detection for high performance. arXiv 2019, arXiv:1905.01585. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T.S. UnitBox: An advanced object detection network. arXiv 2016, arXiv:1608.01471. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS-Improving Object Detection With One Line of Code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 5562–5570. [Google Scholar]
Liu, W.; Rabinovich, A.; Berg, A.C. ParseNet: Looking wider to see better. arXiv 2015, arXiv:1506.04579. [Google Scholar]
Wandell, B.A.; Winawer, J. Computational neuroimaging and population receptive fields. Trend. Cognit. Sci. 2015, 19, 349–357. [Google Scholar] [CrossRef] [Green Version]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Yang, S.; Luo, P.; Loy, C.C.; Tanal, X. WIDER FACE: A Face Detection Benchmark. In Proceedings of the 9 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 5525–5533. [Google Scholar]
Jain, V.; Learned-Miller, E. FDDB: A Benchmark for Face Detection in Unconstrained Settings [R]; UMass Amherst Technical Report; University of Massachusetts: Amherst, MA, USA, 2010. [Google Scholar]
Ohn-Bar, E.; Trivedi, M.M. To Boost or Not to Boost? On the Limits of Boosted Trees for Object Detection. In Proceedings of the 23rd International Conference on Pattern Recognition, Cancun, Mexico, 4–8 December 2016; 2016; pp. 3350–3355. [Google Scholar]
Yang, S.; Xiong, Y.; Loy, C.C.; Tang, X. Face detection through scale-friendly deep convolutional networks. arXiv 2017, arXiv:1706.02863. [Google Scholar]
Zhu, C.; Zheng, Y.; Luu, K.; Savvides, M. CMS-RCNN: Contextual multi-scale region-based CNN for unconstrained face detection. In Deep Learning for Biometrics; Springer: Cham, Switzerland, 2017; pp. 57–79. [Google Scholar]

Figure 1. The framework of region-based fully convolutional networks (R-FCN) and small-scale face detection based on R-FCN.

Figure 2. Architecture of feature fusion branch.

Figure 3. Structure of feature fusion branch in R-FCN with a feature fusion branch (f-R-FCN).

Figure 4. Structure of receptive field adaptation block (RFAB).

Figure 5. Structure of RFAB-f-R-FCN for small-scale face detection.

Figure 6. The precision–recall (PR) curve of f-R-FCN and R-FCN on Wider Face.

Figure 7. Detection effect of R-FCN (a) and f-R-FCN (b). (a) The faces detected by R-FCN. (b) The faces detected by f-R-FCN.

Figure 8. The effect diagram of face detection based on the proposed method. we use the same image to test the effect of R-FCN and the proposed method on small-scale faces. In the same image, the green boxes are the faces detected by R-FCN, and the red boxes are the face detected by the proposed method.

Figure 9. PR curves of the proposed method and typical face detection methods.

Figure 10. Comparison of the proposed method and typical face detection methods.

Table 1. Average precision comparison of non-maximum suppression (NMS) and soft non-maximum suppression (SoftNMS).

	Easy	Medium	Hard
NMS	0.920	0.865	0.525
SoftNMS	0.930	0.898	0.566

Table 2. Average precision (AP) comparison of different anchor sizes.

Anchor Setting			AP
Base_Size	Scale	Aspect Ratio	Easy	Medium	Hard
16	(8, 16, 32)	(0.5, 1, 2)	0.930	0.898	0.566
16	(8, 16, 32)	(1, 1.3, 1.5)	0.927	0.898	0.573
16	(2, 4, 6, 16, 32)	(1, 1.3, 1.5)	0.926	0.898	0.641
16	(1, 2, 4, 6, 16, 32)	(1, 1.3, 1.5)	0.927	0.900	0.646
8	(1, 2, 4, 6, 16, 32)	(1, 1.3, 1.5)	0.897	0.866	0.695
8	(1, 2, 4, 6, 16, 32, 64)	(1, 1.3, 1.5)	0.905	0.867	0.684

Table 3. Average precision comparison of using different fusion layers.

Methods	Easy	Medium	Hard
R-FCN	0.905	0.867	0.684
f-R-FCN(F₅)	0.917	0.868	0.728
f-R-FCN(F₄)	0.914	0.885	0.759

Table 4. AP comparison of different improvement schemes.

Methods	Easy	Medium	Hard
R-FCN	0.905	0.867	0.684
f-R-FCN (F₄)	0.914	0.885	0.759
RFAB-f-R-FCN (F₄+RFAB)	0.913	0.896	0.794

Table 5. AP comparison between the proposed method and typical face detection methods.

Methods	Easy	Medium	Hard
Multi-scale Cascade CNN [29]	0.691	0.664	0.424
Locally decorrelated channel features (LDCF+) [31]	0.790	0.769	0.522
Multitask Cascade CNN [9]	0.848	0.825	0.598
ScaleFace [32]	0.868	0.867	0.772
Contextual Multi-Scale Region-based CNN (CMS-RCNN) [33]	0.899	0.874	0.624
The proposed method	0.913	0.896	0.794

Table 6. Comparison of inference time between different improvement schemes.

Methods	R-FCN	f-R-FCN	RFAB-f-R-FCN
Inference time(ms/img)	69	83	109

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, C.; Chen, S.; Zhou, X.; Ruan, S.; Wen, H. Small-Scale Face Detection Based on Improved R-FCN. Appl. Sci. 2020, 10, 4177. https://doi.org/10.3390/app10124177

AMA Style

Tang C, Chen S, Zhou X, Ruan S, Wen H. Small-Scale Face Detection Based on Improved R-FCN. Applied Sciences. 2020; 10(12):4177. https://doi.org/10.3390/app10124177

Chicago/Turabian Style

Tang, Chaowei, Shiyu Chen, Xu Zhou, Shuai Ruan, and Haotian Wen. 2020. "Small-Scale Face Detection Based on Improved R-FCN" Applied Sciences 10, no. 12: 4177. https://doi.org/10.3390/app10124177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Small-Scale Face Detection Based on Improved R-FCN

Abstract

1. Introduction

2. Related Works

3. Selection Method of Candidate Boxes and Anchor Setting of Region-Based Fully Convolutional Networks (R-FCN)

4. Feature Fusion Branch

5. Receptive Field Adaptation Block (RFAB)

6. Experiments

6.1. Results of Modified R-FCN

6.1.1. Comparison between Non-Maximum Suppression (NMS) and Soft Non-Maximum Suppression (SoftNMS)

6.1.2. Comparison of Different Anchor Settings

6.2. Results of Feature Fusion Branch

6.2.1. Comparison of Different Fusion Feature Layers

6.2.2. Comparison with R-FCN

6.3. Effects of RFAB

6.3.1. Comparison with and without RFAB

6.3.2. Comparison of AP between the Method Proposed and the Classical Methods

6.3.3. Comparison of Precision–Recall (PR) Curves between the Proposed Method and the Classical Methods

6.3.4. Comparison of the Proposed Method and Classical Methods on Face Detection Dataset and Benchmark (FDDB)

6.4. Inference Time

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI