2.3.1. Image Size Filter (SIZE)

SIZE [10] rejects candidate faces based on the size of the face region extracted from the depth map. First, the 2D position and dimension (*W*2*D*, *h*2*D*) in pixels of a candidate face region are identified by the face detector. Second, this information is used to estimate the corresponding 3D physical dimension in mm (*W*3*D*, *h*3*D*) as follows:

$$W\_{3D} = W\_{2D} \frac{\overline{d}}{f\_{\mathbf{x}}} \text{ and } h\_{3D} = h\_{2D} \frac{\overline{d}}{f\_{\mathbf{x}}} \text{ } \tag{6}$$

where *fx* and *fy* are the Kinect camera focal lengths computed by the calibration algorithm in [57], and *d* is the average depth of the samples in the candidate bounding box. Face candidate regions are rejected when they lie outside the fixed range in cm [0.075, 0.35]. Note that *d* is defined as the median of the depth samples and is necessary for reducing the impact of noisy samples in the average computation.
