*3.2. Solution*

As previously mentioned, we can see that the WSPD contains the largest number of images and bounding boxes among the available person detection datasets. Furthermore, the WSPD contains a wide variety of person images collected from various locations around the world. The semi-automatically collected dataset has millions of bounding boxes which may be useful for pre-training. We used a WSPD pre-trained model to apply self-training to another dataset to collect high-quality bounding boxes and to investigate the impact of each age attribute on the miss rate. Our self-training pipeline is shown in Figure 1. First, we input images from the Places365 dataset [13] to the SSD, a detector pre-trained with the WSPD, to estimate the location of the bounding box. We assigned a pseudo-label of "person" to the predicted bounding box. The determination of the location of the bounding box when generating the pseudo-label is expressed by the following equation:

$$(y', b'\_{\text{box}}) = D(x; \theta), \tag{1}$$

where *y'* and *b'* represent the predicted values of the object category and bounding box, respectively, and *θ* represents the learned parameters of the detector. Our self-training approach allows us to automatically extend the dataset. We refer to the WSPD and the generated pseudo-labeled Places365 dataset together as the Self-Trained Person Dataset (STPD).

**Figure 1.** The self-training approach. We used a model pre-trained using the WSPD dataset with the SSD to infer the location of bounding boxes for unlabeled images from the Places365 dataset. We then gave each predicted bounding box a "person" attribute label. By combining these pseudo-labels with the WSPD labels and pre-training them with the SSD, we were able to build a larger model to verify miss rates.

Furthermore, we pre-trained the constructed STPD and compared its detection performance with the model pre-trained using the WSPD. In order to examine the disparity in the miss rate among age attributes, it is essential to add an age attribute to the bounding box. Then, in order to evaluate the miss rate for each age attribute, we assigned "adult" and "children" labels to the INRIA Person Dataset, which is commonly used for person detection, using the models pre-trained with the WSPD and STPD, respectively. We also re-annotated the location of the bounding box. These two age attributes follow the age categories defined by the Statistics Bureau of the Ministry of Internal Affairs and Communications in Japan for (i) children (0–14 years) and (ii) adults (15 years and older). As a result, we constructed a pedestrian detection dataset consisting of 902 images and 2993 bounding boxes for training and evaluation. We named this dataset the Fairness-Aware INRIA Person Dataset (FA-INRIA). An example of the annotations and the breakdown of the dataset attributes are shown in Figure 2 and Table 3, respectively.

**Figure 2.** Examples of age attribute annotation in Fairness-Aware INRIA Person Dataset (FA-INRIA).


**Table 3.** The age attributes in the Fairness-Aware INRIA Person Dataset (FA-INRIA).

### *3.3. Experimental Settings*

In this paper, we compared the results under the same pre-training conditions. The batch sizes for pre-training the SSD were set to 64, 128 and 256, the number of epochs was set to 10, and the learning rate was set to 0.0005. When we conducted fine-tuning with the FA-INRIA using the pre-trained models on each dataset, the batch size was set to 4, the number of iterations was set to 12,000, and the learning rate was set to 0.0005. Furthermore, the training and test datasets were used with the same configuration as the original INRIA Person Dataset. The experimental settings described below also conform to these conditions.

### *3.4. Evaluation Metric*

We only used the miss rate as an evaluation metric to assess the detection performance for adults and children. In person detection, the relationship between the miss rate and false positives is often evaluated for each image. However, our goal is to detect all ground truth bounding boxes. Therefore, we calculated the miss rate by examining the breakdown of the age attributes of the bounding boxes that could not be detected. The miss rate *M* is derived by the following equation:

$$MR = 1 - Recall \tag{2}$$

In this paper, we calculated the standard deviation to represent the miss rate disparity among age attributes:

$$MR\_{s}td = \frac{1}{n} \sum\_{i=1}^{n} (MR\_{i} - MR)^{2} \,. \tag{3}$$

where *n* refers to the number of classes of attributes, which in this study was two ("Adult" and "Children").
