**1. Introduction**

One of the most fundamental yet difficult problems in computer vision and human–computer interaction is face detection, the object of which is to detect and locate all faces within a given image or video clip. Face detection is fundamental in that it serves as the basis for many applications [1] that involve the human face, such as face alignment [2,3], face recognition/authentication [4–7], face tracking and tagging [8], etc. Face detection is a hard problem because unlike face localization, no assumptions can be made regarding whether any faces are located within an image [9,10]. Moreover, faces vary widely based on gender, age, facial expressions, and race, and can dramatically change in appearance depending on such environmental conditions as illumination, pose (out-of-plane rotation), orientation (in-plane rotation), scale, and degree of occlusion and background complexity. Not only must a capable and robust face detection system overcome these difficulties, but for many of today's applications, it must also be able to do so in real time.

These challenges have resulted in a large body of literature reporting different methods for tackling the problem of face detection [11]. Yang et al. [12], who published a survey of face detection algorithms developed in the last century, have divided these earlier algorithms into four categories: knowledge-based methods, feature invariant approaches, template-matching methods, and appearance-based methods, the latter demonstrating some superiority compared with the other algorithms thanks to the rise in computing power. In general, these methods formulate face detection

as a two-class pattern recognition problem that divides a 2D image into subwindows that are then classified as either containing a face or not [13]. Moreover, these approaches take a monocular perspective in the sense that they forgo any additional sensor or contextual information that might be available.

Around the turn of the century, Viola and Jones [14] presented a 2D detection method that has since become a major source of inspiration for many subsequent face detectors. The famous Viola–Jones (VJ) algorithm achieved real-time object detection using three key techniques: an integral image stratagem for efficient Haar feature extraction, a boosting algorithm (AdaBoost) for an ensemble of weak classifiers, and an attentional cascade structure for fast negative rejection. However, there are some significant limitations to the VJ algorithm that are due to the suboptimal cascades, the considerable pool size of the Haar-like features, which makes training extremely slow, and the restricted representational capacity of Haar features to handle, for instance, variations in pose, illumination, facial expression, occlusions, makeup, and age-related factors [15]. These problems are widespread in unconstrained environments, such as those represented in the Face Detection Dataset and Benchmark (FDDB) [16] where the VJ method fails to detect most faces [17].

Some early Haar-like extensions and enhancements intended to overcome some of these shortcomings include rotated Haar-like features [18], sparse features [19], and polygon features [20]. Haar-like features have also been replaced by more powerful image descriptors, such as local binary patterns (LBP) [21], spatial histogram features [22], histograms of oriented gradients (HoG) [23], multidimensional local Speeded-Up Robust Features (SURF) patches [24], and, more recently, by normalized pixel difference (NPD) [17] and aggregate channel features [25], to name but a few.

Some older feature selection and filtering techniques for reducing the pool size, speeding up training, and improving the underlying boosting algorithm of the cascade paradigm include the works of Brubaker et al. [26] and Pham et al. [27]. In Küblbeck et al. [28], the illumination invariance and speed were improved with boosting combined with modified census transform (MCT); in Huang et al. [29], a method for detecting faces with arbitrary rotation in-plane and rotation off-plane angles in both still images and videos is proposed. For an excellent survey of face detection methods prior to 2010, see [11].

Some noteworthy 2D approaches produced in the last decade include the work of Li et al. [15] at Intel labs, who introduced a two-pronged strategy for the faster convergence speed of the SURF cascade, first by adopting, as with [24], multidimensional SURF features rather than single-dimensional Haar features to describe local patches, and second, by replacing decision trees with logistic regression. Two simple approaches that are also of note are those proposed in Mathias et al. [30], which obtained top performance compared with such commercial face detectors as Google Picasa, Face.com, Intel Olaworks, and Face++. One method is based on rigid templates, which is similar in structure to the VJ algorithm, and the other detector uses a simple deformable part model (DPM), which, in brief, is a generalizable object detection approach that combines the estimation of latent variables for alignment and clustering at the training time with multiple components and deformable parts to manage intra-class variance.

Four 2D models of interest in this study are the face detectors proposed by Nilsson et al. [31], Asthana et al. [32], Liao et al. [33], and Markuš et al. [34]. Nilsson et al. [31] used successive mean quantization transform (SMQT) features that they applied to a Split up sparse Network of Winnows (SN) classifier. Asthana et al. [32] employed face fitting, i.e., a method that models a face shape with a set of parameters for controlling a facial deformable model. Markuš et al. [34] combined a modified VJ method with an algorithm for localizing salient facial landmark points. Liao et al. [33], in addition to proposing the aforementioned scale-invariant NPD features, expanded the original VJ tree classifier with two leaves to a deeper quadratic tree structure.

Another powerful approach for handling the complexities of 2D face detection is deep learning [35–41]. For instance, Girshick et al. [36] were one of the first to use Convolutional Neural Networks (CNN) in combination with regions for object detection. Their model, appropriately named Region-CNN (R-CNN), consists of three modules. In the testing phase, R-CNN generates approximately 2000 category-independent region proposals (module 1), extracts a fixed-length deep feature vector from each proposal using a CNN (module 2), and then classifies them with Support Vector Machines (SVMs) (module 3). In contrast, the deep dense face detector (DDFD) proposed by Farfade et al. [37] requires no pose/landmark annotations and can detect faces in many orientations using a single deep learning model. Zhang et al. [39] proposed a deep learning method that is capable of extracting tiny faces, also using a single deep neural network.

Motivated by the development of affordable depth cameras, another way to enhance the accuracy of face detection is to go beyond the limitations imposed by the monocular 2D approach and include additional 3D information, such as that afforded by the Minolta Vivid 910 range scanner [42], the MU-2 stereo imaging system [43], the VicoVR sensor, the Orbbec Astra, and Microsoft's Kinect [44], the latter of which is arguably the most popular 3D consumer-grade device on the market. Kinect combines a 2D RGB image with a depth map (RGB-D) that initially (Kinect 1) was computed based on the structured light principle of projecting a pattern onto a scene to determine the depth of every object but which later (Kinect 2) exploited the time-of-flight principle to determine depth by measuring the changes that an emitted light signal encounters when it bounces back from objects.

Since depth information is insensitive to pose and changes in illumination [45], many researchers have explored depth maps and other kinds of 3D information [46]; furthermore, several benchmark datasets using Kinect have been developed for both face recognition [44] and face detection [47]. The classic VJ algorithm was adapted to consider depth and color information a few years after Viola and Jones published their groundbreaking work [48,49]. To improve detection rates, most 3D face detection methods combine depth images with 2D gray-scale images. For instance, in Shieh et al. [50], the VJ algorithm is applied to images to detect a face, and then its position is refined via structured light analysis.

Expanding on the work of Shotton et al. [51], who used pair-wise pixel comparisons in depth images to quickly and accurately classify body joints and parts from single depth images for pose recognition, Mattheij et al. [52] compared square regions in a pair-wise fashion for face detection. Taking cues from biology, Jiang et al. [53] integrated texture and stereo disparity information to filter out locations unlikely to contain a face. Anisetti et al. [54] located faces by applying a course detection method followed by a technique based on a 3D morphable face model that improves accuracy by reducing the number of false positives, and Taigman et al. [6] found that combining a 3D model-based alignment with DeepFace trained on the Labeled Faces in the Wild (LFW) dataset [55] generalized well in the detection of faces in an unconstrained environment. Nanni et al. [9] overcame the problem of increased false positives when combining different face detectors in an ensemble by applying different filtering steps based on information in the Kinetic depth map.

The face detection system proposed in this paper is composed of an ensemble of face detectors that utilizes information extracted from the 2D image and depth maps obtained by Microsoft's Kinect 1 and Kinect 2 3D devices. The goal of this paper, which improves the method presented in [9], is to test a set of filters, which includes a new wave-based filter proposed here, on a new collection of face detectors. The main objective of this study is to find those filters that preserve the ensemble's increased rate of true positives while simultaneously reducing the number of false positives. Creating an ensemble of classifiers is a feasible method for improving performance in face detection (see [9]), as well as in many other classification problems. The main reason that ensembles improve face detection performance is that the combination of different methods increases the number of candidate windows and thus the probability of including a previously lost true positive. However, the main drawback of using ensembles in face detection is the increased generation of false positives. The rationale behind the proposed approach is to use some filtering steps to reduce false positives. The present work extends [9] by adding to the proposed ensemble additional face detectors.

The best performing system developed experimentally in this work is validated on the challenging dataset presented in [9] that contains 549 samples with 614 upright frontal faces. This dataset includes depth images as well as 2D images. The results in the experimental section demonstrate that the filtering steps succeed in significantly decreasing the number of false positives without significantly affecting the detection rate of the best-performing ensemble of face detectors. To validate the strength of the proposed new even system further, we validate it on the widely used BioID dataset [56], where it obtains a 100% detection rate with a limited number of false positives. Our best ensemble/filter combination outperforms the method proposed by Markuš et al. [34], which has been shown to surpass the performance of these well-known state-of-the-art commercial face detection systems: Google Picasa, Face++, and Intel Olaworks.

The organization of this paper is as follows. In Section 2, the strategy taken in this work for face detection is described along with the face detectors tested in the ensembles and the different filtering steps. In Section 3, the experiments on the two above-mentioned datasets are presented, along with a description of the datasets, definition of the testing protocols, and a discussion of the experimental results. The paper concludes, in Section 4, by providing a summary with some notes regarding future directions. The MATLAB code developed for this paper, along with the dataset, is freely available at https://github.com/LorisNanni.
