*3.5. Results*

Table 4 shows the miss rate in the FA-INRIA Person Dataset using each of the pretrained models. Compared to the model pre-trained with the WSPD, the model pre-trained with STPD reduced the miss rate by up to 0.4% for adults and up to 3.9% for children. In the WSPD pre-trained model, the disparity between the miss rates of adults and children was a maximum standard deviation of 4.6% and a minimum of 3.1%. In contrast, the STPD pre-trained model had a maximum standard deviation of 2.9% and a minimum standard deviation of 2.1%.

**Table 4.** Detection performance comparisons for our FA-INRIA. We use standard deviation to describe the disparity in detection rates between attributes. It is clear that our approach reduces the miss rate for all attributes.


Then, the detection results of fine-tuning with the FA-INRIA using the pre-trained detectors on each dataset are shown in Figure 3, illustrating that the STPD pre-trained model is able to detect people that the WSPD pre-trained model misses.

**Figure 3.** Comparison of detection results of WSPD and STPD.

### **4. Analysis and Discussion**

#### *4.1. The Relationship between the Bias in the Quantity of Data and the Miss Rate*

In the aforementioned results, we successfully generated a pseudo bounding box containing a person from the Places365 dataset. In Figure 1, we present a visualization of the location of a person's bounding box that was predicted during the process of selftraining. This method was implemented based on the success of self-training in object detection [14] and was found to reduce the miss rate for adults and children, respectively. Moreover, it is effective in collecting data on pedestrians regardless of their age attributes, and not only on children for whom the number of data is small. If the bias in the quantity of data between age attributes is the primary cause of the disparity in detection performance, then it is only the bounding boxes for children that need to be more efficiently collected. However, manual annotation is very costly and impractical. Therefore, we applied data augmentation to the children's bounding boxes in the FA-INRIA training data to investigate the effect on the miss rate for adults and children. In our work, we tried to augmen<sup>t</sup> the children's bounding boxes by applying horizontal flip.

Table 5 shows the detection performance when data augmentation is applied to the children's bounding boxes. It can be seen that when the batch size is 256, the miss rate for both attributes decreases. However, when the batch size is 64 or 128, the miss rate for children does not change, while the miss rate decreases for adults. These results indicate that applying data augmentation is effective in improving the overall detection performance. On the other hand, when we focus on the standard deviation, we must not forget that the disparity in detection performance between age attributes is expanding. First and foremost, a "person" can be an adult or a child. If the detection performance for adults is improved solely by increasing the data of children, we would consider that the bias in the quantity of data between classes is not directly relevant.

**Table 5.** The impact of applying data augmentation (horizontal flip) only to the bounding boxes of the children in the training data. The results show that applying data augmentation is effective in improving the overall detection performance. On the other hand, it may increase the disparity in detection performance among age attributes.


#### *4.2. The Relationship between the Size of a Person's Bounding Box and the Miss Rate*

Detecting small objects is a difficult task in object and person detection research because of the limited information that can be obtained from a bounding box with a small image size. It is clear that children have smaller bodies than adults. Therefore, the bounding boxes of children tend to be smaller than those of adults. Thus, we thought it would be important to investigate the size of bounding boxes in the FA-INRIA.

Figure 4 presents the distribution of the size of the bounding boxes for adults and children. Adults are shown in red and children are shown in blue. This distribution indicates that most of the bounding boxes that exceed the size of 600 pixels × 300 pixels in height and width, respectively, are for adults. In other words, the difference in the size distribution of the bounding boxes may be one of the factors affecting the disparity in the miss rate. Figure 5 also shows the distribution of the size of the bounding boxes in the image for the FA-INRIA (test set): the bounding boxes that could be detected are shown in red and the missed bounding boxes are shown in blue. As you can see in these figures, most of the missed bounding boxes are biased towards the smaller image size. In other words, in order to further mitigate the disparity in the miss rate, it is necessary to use detectors that can detect small persons.

In this paper, we investigated the effect of changing the image size of the input on the miss rate of each attribute. The SSD resizes the input image to a set size regardless of the size of the original image. This process is likely to result in the missing details of the image. In order to detect small bounding boxes, we thought that increasing the size of the input image would suppress the missing information. We examined three patterns of input image sizes: (i) 150 pixels × 150 pixels; (ii) 300 pixels × 300 pixels; and (iii) 600 pixels × 600 pixels. The default size for the SSD is 300 pixels × 300 pixels. For more accurate validation, we also used a sub-dataset with the same number of bounding boxes for adults and children in the training data.

**Figure 4.** Distribution of bounding boxes for adults and children in the FA-INRIA Person Dataset. Children's bounding boxes tend to be relatively smaller than those of adults.

**Figure 5.** Whether bounding boxes can be detected in test data (red: detected, blue: missed).

Table 6 shows the miss rate when the input image size of the SSD is changed. It can be seen that increasing the size of the input image is a major factor in reducing the miss rate. On the other hand, when the input size is small (150 pixels × 150 pixels), the miss rate for children is very poor. We consider that this is because image information is also missing due to the relatively smaller bounding box. As shown in Figure 4, children's bounding boxes are more difficult to detect when the input size is small because children have a relatively higher proportion of small bounding boxes than adults. Based on this result and Figures 4 and 5, we conclude that the unbalanced distribution of the bounding box sizes is one of the main reasons for the disparity in detection performance between adults and children.

**Table 6.** The effect of changing the input size of the image to the SSD on the detection performance for each age attribute. The results show that increasing the input size decreases the miss rate. In addition, children are more strongly affected by changes in the size of the input. We conclude that the bias in the size of the bounding box is a major factor in the disparity in detection performance.


### *4.3. Appearance Difference*

We considered two aspects: the bias in the quantity of data between classes and the size of the bounding boxes. However, as shown in Figure 5, we can see that some people are not detected even though the bounding box is relatively large. Moreover, as mentioned in Section 4.1, we found that the bias in the quantity of data between classes is most likely not

relevant. These results sugges<sup>t</sup> that there might be other factors that generate disparities in detection performance between age attributes. Subsequently, we hypothesized that there would be apparent differences between the distributions of bounding box sizes of adults and children as they differ significantly in size.

Figure 6 shows the compression of the image features using t-SNE and the visualization of the distribution. It is difficult to imagine that there is a disparity in detection performance based on the appearance of the distribution which is not clearly divided by age attribute and is evenly distributed. This result supports the fact that applying data augmentation to the children's bounding boxes was more effective in improving the detection rate for adults than for children. Since there is no apparent difference between adults and children, we reiterate that we do not need to consider the bias in the quantity of data between classes to reduce the miss rate for children.

**Figure 6.** Data visualization of bounding boxes using t-SNE (blue: adults, red: children). There is no apparent significant difference between the bounding boxes of children and adults. As mentioned in Section 4.1, when data augmentation was applied to children's bounding boxes, the miss rate was strongly affected for adults but not for children. This data visualization supports the consideration that the bias in the quantity of data between classes has little to do with the disparity in detection performance.
