Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection

Zhang, Qiangbo; Liu, Yunxiang; Zhang, Yu; Zong, Ming; Zhu, Jianlin

doi:10.3390/s23229089

Open AccessArticle

Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection

by

Qiangbo Zhang

^†,

Yunxiang Liu

^*,†,

Yu Zhang

,

Ming Zong

and

Jianlin Zhu

School of Computer Science and Information Engineering, Shanghai Institute of Technology, Shanghai 201418, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2023, 23(22), 9089; https://doi.org/10.3390/s23229089

Submission received: 26 September 2023 / Revised: 8 November 2023 / Accepted: 8 November 2023 / Published: 10 November 2023

(This article belongs to the Section Vehicular Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Occluded pedestrian detection faces huge challenges. False positives and false negatives in crowd occlusion scenes will reduce the accuracy of occluded pedestrian detection. To overcome this problem, we proposed an improved you-only-look-once version 3 (YOLOv3) based on squeeze-and-excitation networks (SENet) and optimized generalized intersection over union (GIoU) loss for occluded pedestrian detection, namely YOLOv3-Occlusion (YOLOv3-Occ). The proposed network model considered incorporating squeeze-and-excitation networks (SENet) into YOLOv3, which assigned greater weights to the features of unobstructed parts of pedestrians to solve the problem of feature extraction against unsheltered parts. For the loss function, a new generalized intersection over union_{intersection over groundtruth} (GIoU_IoG) loss was developed to ensure the areas of predicted frames of pedestrian invariant based on the GIoU loss, which tackled the problem of inaccurate positioning of pedestrians. The proposed method, YOLOv3-Occ, was validated on the CityPersons and COCO2014 datasets. Experimental results show the proposed method could obtain 1.2% MR⁻² gains on the CityPersons dataset and 0.7% mAP@50 improvements on the COCO2014 dataset.

Keywords:

occluded pedestrian detection; false positives; false negatives; loss function

1. Introduction

Pedestrian detection is widely used in many fields to complete perception tasks [1,2,3,4]. However, there are considerable difficulties in pedestrian detection: (i) significant differences in pedestrian appearance; (ii) occlusion; and (iii) complex background. Among them, the occlusion problem has attracted much attention, and the research on occluded pedestrian detection has produced immense achievements in recent years [5,6,7,8]. However, occlusion is still regarded as an urgent problem in the field of pedestrian detection. In automatic driving scenarios, crowd occlusion is the main aspect of the occlusion problem, and the problem of crowd occlusion will lead to false positives and false negatives, which weakens the performance of pedestrian detectors. Occluded pedestrian detection methods have undergone a transition from manual feature extraction-based methods to deep learning-based methods.

Manual feature extraction-based methods feed manually extracted features into a classifier. Dala et al. proposed the histogram of oriented gradient (HOG) feature [9], which used the feature of gradient direction distribution to represent the local shape of a target, but the feature descriptor had a high dimension and the acquisition process was complicated. Dollar et al. introduced the integral channel feature (ICF) [10], using the integral map to sum up the local rectangular area of the channel image. The detection effect was outstanding when there was not much occlusion, but the adaptability to the occluded environment was poor. Felzenszwalb et al. proposed the deformable part model (DPM) [11], which built a model for the relationship between parts of a human body. Although it could solve the problem of changing pedestrian postures, the speed was slow. Although manual feature extraction-based methods have high accuracy, they need to manually extract a high dimension of features, which is time-consuming, and the detection effect on crowd occlusion scenes is poor.

Deep learning-based methods use deep neural networks to automatically extract useful features for the task, which is faster and has become the mainstream method in occluded pedestrian detection. Zhang et al. proposed a part model based on a faster region-based convolutional neural network (Faster R-CNN) [12], which combined the method of two-stage object detection with the part model, and added an attention mechanism [13,14,15] to guide the model to focus on the visible parts of the body. However, the features extracted by the convolutional part of the model were not aimed at the unobstructed part. Huang et al. proposed a post-processing method: R²non-maximum suppression (R²NMS) [16], which used the visible part of the pedestrian to reduce false positives, but it caused the problem of false negatives. Ref. [17] proposed a novel NMS algorithm to refine the predicted boxes. However, the algorithm generated a small number of false negatives in the crowd scenes. Ref. [18] proposed a model to estimate a set of highly overlapped pedestrians for each proposal, which could be applied to numerous proposal-based detectors. However, false negatives were generated when encountering extremely crowded scenes. Ref. [19] applied the model of the transformer to crowd detection, which focused on the unobstructed parts of pedestrians. However, the positioning accuracy of a pedestrian was poor. Chi et al. proposed a novel joint head and human detection network to detect the head and human body simultaneously in the crowd scenes [20], but there were false negatives in the heavily overlapped scenes. Although deep learning-based methods are faster, the efficiency of classification and the degree of the positioning accuracy of occluded pedestrians are low and the number of false negatives is high.

To address the problem of false positives and false negatives caused by crowd occlusion, an improved you-only-look-once version 3 (YOLOv3) based on squeeze-and-excitation networks (SENet) and optimized generalized intersection over union (GIoU) loss, called YOLOv3-Occlusion (YOLOv3-Occ), is proposed. It uses uncovered parts to accurately classify and locate pedestrians in crowd occlusion scenes. The contributions of the proposed YOLOv3-Occ are summarized as follows:

The channel attention mechanism of squeeze-and-excitation networks (SENet) is adopted to be incorporated between the feature extraction layers of YOLOv3, which gives larger weights to the features of the non-overlapping parts of pedestrians to address the problem of feature extraction of uncovered parts.
The positional loss function of Generalized Intersection over Union_{Intersection over Groundtruth} (GIoU_IoG) is proposed by replacing IoU in the GIoU loss function with IoG, which makes the areas of predicted frames of pedestrians constant to solve the issue of inaccurate location of pedestrians.

The rest of this paper is arranged as follows: Section 2 introduces the related works, including loss function works and network model works. Section 3 presents the proposed YOLOv3-Occ method, including preliminary works, the architecture of YOLOv3-Occ, and the optimized loss function. Section 4 describes the experimental results and analysis. Section 5 concludes this paper.

2. Related Works

2.1. Loss Function Works for Pedestrian Location

The accuracy of pedestrian localization can be improved by optimizing the positional loss function. Mean square error loss (MSE Loss) [21] calculated the mean value of the square of the Euclidean distance between the predicted tensor and the target tensor in n-dimensional space, but the loss value changed drastically in the early stage of training and was sensitive to outliers. Wang et al. [22] proposed smooth_L1 loss, which used the l₁ and l₂ norms of the distance vector between the predicted tensor and the target tensor. However, this loss function was not equivalent to IoU and did not take the relevance of the coordinates of bounding boxes into account. The IoU positioning loss function [23] was proposed by Yu et al., where the coordinates of bounding boxes were regarded as a whole to construct the loss function, but when the prediction frame and the target frame were disjoint, the loss function could not be optimized. The GIoU loss function [24] was proposed by R et al., which introduced the normalized area between the prediction box and the target box. But when the prediction frame and the target frame intersected, the loss function could be optimized by reducing the area of the prediction frame, causing the prediction frame to be far away from the target frame. Ref. [25] proposed the measurement standard of IoG, changing the denominator of IoU to the area of the target box to make the prediction box close to the target box. Ref. [26] proposed two repulsion losses, which introduced penalties to predictions that overlapped with considerable ground truths and predictions. However, the weights of the two losses were not evaluated by experiments. Ref. [27] proposed NMS loss, which added the penalty of false positives and false negatives to the loss to reduce them in the training process. However, it was only suitable for binary classification tasks. Previous loss function works did not solve the problem that a prediction area reduces when optimizing a loss function. To overcome the problem, GIoU_IoG loss is proposed in the paper. The loss function works are summarized in Table 1.

2.2. Network Model Works for Occluded Pedestrian Detection

It is imperative to design a robust network model to handle crowd occlusion. Occlusion-aware region-convolutional neural network (OR-CNN) [28] used part-based models, which divided pedestrians into several parts and merged the results of the part detection as the final result. Since every part was represented as a rectangle, the model would produce noise. Ref. [29] proposed a set of decision trees capturing the overall distribution of all parts, which were shared by the part detectors. A channel attention network was proposed to add to the CNN method [30], using channel-wise attention to focus on the unobstructed parts of the occludee. Ref. [31] proposed an attention-guided neural network model (AGNN), which selected features representing the body parts of pedestrians. Previous network model works did not transfer visible part features to detection branches at different scales in the mission of multi-scale object detection. To solve this drawback, SENet is incorporated into YOLOv3 in the proposed method. The network model works are summarized in Table 2.

3. Proposed Method: YOLOv3-Occ

In this part, YOLOv3-Occ is proposed to address the problem of crowd occlusion. SENet is used to be integrated into the feature extraction layer of the YOLOv3 network model, giving the unobstructed parts larger attention, which solves the problem of key body parts being occluded. GIoU_IoG loss is proposed as the positioning loss function, making the prediction frame quickly approach the target frame, which solves the problem of the prediction frame rejecting the target frame. Soft-NMS is adopted as the post-processing method. In summary, SENet makes the category and bounding box parameters more accurate, and then reduces the initial value of GIoU_IoG loss. GIoU_IoG loss makes the positioning of the model more accurate, which in turn makes the workload of Soft-NMS smaller and improves the reasoning speed. The three modules work together to reduce false negatives and false positives in the scene of crowd occlusion.

3.1. Preliminary Work

3.1.1. SENet

SENet is a channel attention network, which introduces attention scores in the channel dimension. The architecture of SENet is shown in Figure 1, which is mainly divided into three modules: (1) Global Average Pooling (GAP) layer for compressing the shape of the input feature map to (1, 1, C); (2) Multi-Layer Perception (MLP) for obtaining the attention scores of all channels, as shown in the feature map marked by diverse colors in Figure 1; and (3) scale operation to obtain the original feature map injected with the attention scores. The ReLU activation function in MLP is shown in Formula (1):

c_i^ReLU = max(0, c_i¹),

(1)

where max is the maximum function, c_i¹ is one of values of the nodes output by the first fully connected layer, and c_i^ReLU is c_i¹ activated by the ReLU function.

3.1.2. GIoU Loss

GIoU loss is a positioning loss function based on IoU loss, which introduces the smallest enclosing rectangle to optimize the situation where the prediction and the target do not overlap. GIoU loss is shown in Formula (2):

GIoU loss = 1 − [IoU − (S^S − S^U)/S^S],

(2)

where S^U is the union area of the prediction box and the target box, and S^S is the area of the smallest external rectangle.

3.1.3. Soft-NMS

Soft-NMS is a post-processing method based on NMS, which introduces a one-dimensional Gaussian kernel function to reduce the confidences of significantly overlapped predictions rather than discarding the predictions. The kernel function is shown in Formula (3):

k(IoU(M, d_i)) = e^{−IoU(M, di)*IoU(M, di)/σ},

(3)

where M is the prediction frame with the highest confidence at present, d_i is one of the remaining prediction frames after removing M, IoU(M, d_i) is the IoU between M and d_i, and σ is the hyperparameter which needs to be adjusted. Equation (3) is the Gaussian kernel function with a mean of 0 and a variance of σ.

3.2. The Architecture of the Proposed YOLOv3-Occ

Figure 2 shows the architecture of YOLOv3-Occ. For an input image, the output is the image with detection boxes. Note that three SENets [32] are incorporated into the prediction branch of three scales, respectively, so that the feature channels of the visible parts on pedestrians can be given larger weight. In addition, the outputs of the first two SENets are fused with the outputs of two basic residual blocks in CNN layers, respectively, so the features of the visible part in the previous scale branch are used as the input of the SENet in the subsequent scale branch to refine the features. Therefore, the accuracy of pedestrians’ classification and positioning in the three scales could be improved by harnessing the refined features of the unobstructed parts. The proposed GIoU_IoG loss is used to compute the positioning loss and optimize the model. The specific information about the loss is described in Section 3.3.

3.3. The Proposed Loss Function: GIoU_IoG Loss

GIoU_IoG loss is the positioning loss function of YOLOv3-Occ. It is proposed by replacing IoU in GIoU loss with IoG. The computation process of GIoU_IoG loss is shown in Algorithm 1. The input is the coordinates of the upper left corner and the lower right corner of the prediction box and the target box, and the output is the function value of GIoU_IoG loss. The time complexity of Algorithm 1 is O(1) since there are no circulated and recursive processes, and the space complexity is O(1) since the required memory of the algorithm does not change with the problem size. GIoU_IoG loss is shown in Formula (4):

GIoU_IoG loss = 1 − [IoG − (S^S − S^U)/S^S],

(4)

where S^U is the union area of the prediction box and the target box, and S^S is the area of the smallest external rectangle. When there is no intersection between the prediction box and the target box, IoG is equal to 0. While optimizing (4) in the situation, the vertices of the prediction frame are moved to the direction where the prediction frame and the target frame overlap. When there is an intersection between the prediction frame and the target frame, IoG is greater than 0. While optimizing (4) in this case, the intersection area between the prediction box and the target box increased instead of reducing the prediction area like GIoU loss. Therefore, the accuracy of localization can be improved by harnessing the loss function whether in the overlapped scenes or non-overlapped scenes.

Algorithm 1: The computation process of GIoU_IoG loss

input: the coordinates of the upper left corner and the lower right corner of
    prediction(B^p) and ground truth(B^g):
    B^p = (x₁^p, y₁^p, x₂^p, y₂^p), B^g = (x₁^g, y₁^g, x₂^g, y₂^g)
output: GIoU_IoG loss
Step 1 Calculating the coordinates of the common area(I) of B^p and B^g:
    x₁^I = maximum(x₁^p, x₁^g), y₁^I = maximum(y₁^p, y₁^g), x₂^I = minimum(x₂^p, x₂^g),
    y₂^I = minimum(y₂^p, y₂^g)
    where (x₁^I, y₁^I) is the coordinates of the upper left corner, and (x₂^I, y₂^I) is the
    coordinates of the lower right corner.
Step 2 Calculating the area of I:
    S^I = (x₂^I − x₁^I) × (y₂^I − y₁^I)
    where x₂^I − x₁^I is the width of I, and y₂^I − y₁^I is the height of I.
Step 3 Calculating the area of B^p:
    S^p = (x₂^p − x₁^p) × (y₂^p − y₁^p)
    where x₂^p − x₁^p is the width of B^p, and y₂^p − y₁^p is the height of B^p.
Step 4 Calculating the area of B^g:
    S^g = (x₂^g − x₁^g) × (y₂^g − y₁^g)
    where x₂^g − x₁^g is the width of B^g, and y₂^g − y₁^g is the height of B^g.
Step 5 Calculating the area of the union between B^p and B^g:
    S^U = S^p + S^g − S^I
    where the reason for the operation, minus S^I, is that S^I is calculated twice in the
    calculation process of S^p + S^g.
Step 6 Calculating the IoG:
    IoG = S^I/S^g
    where IoG is the ratio of the intersection area to the target area.
Step 7 Calculating the coordinates of the smallest external rectangle(B^s) surrounding B^p
    and B^g:
    x₁^S = minimum(x₁^p, x₁^g), y₁^S = minimum(y₁^p, y₁^g), x₂^S = maximum(x₂^p, x₂^g),
    y₂^S = maximum(y₂^p, y₂^g)
    where (x₁^S, y₁^S) is the coordinates of the upper left corner, and (x₂^S, y₂^S) is the
    coordinates of the lower right corner.
Step 8 Calculating the area of B^s:
    S^s = (x₂^S − x₁^S) × (y₂^S − y₁^S)
    where x₂^S − x₁^S is the width of B^s, and y₂^S − y₁^S is the height of B^s.
Step 9 Calculating the GIoU_IoG:
    GIoU_IoG = IoG − (S^s − S^U)/S^s
    where GIoU_IoG is generated by replacing IoU in GIoU with IoG.
Step 10  Calculating the GIoU_IoG loss:
    GIoU_IoG loss = 1 − GIoU_IoG

4. Experimental Results and Analyses

4.1. Experiment Settings

4.1.1. Datasets

An ideal method of pedestrian detection for crowd occlusion scenes should be robust to instance distributions, i.e., not only effective for crowded detections but also stable for detecting a single person. Two datasets, CityPersons [33] and COCO2014 [34], are adopted for comprehensive evaluations on moderately and slightly occluded scenes, respectively. Table 3 lists the sizes and overlaps of the datasets. Table 4 shows different annotation types for the datasets. The categories of CityPersons are divided into six classes: fake humans, pedestrians, riders, sitting persons, other persons with unusual postures, and groups of people; and COCO2014 uses 80 classes: person, bicycle, car, and other common categories in life. The size of an image in CityPersons is 1024 × 2048 pixels and in COCO2014 480 × 640 pixels. Since the proposed approach aims to improve the performance of crowded detections, numerous experiments are performed on CityPersons. In addition, experiments on COCO2014 are performed to verify whether the proposed method undermines uncrowded detections.

4.1.2. Evaluation Metrics

Precision (P), Recall (R), Average Precision₅₀ (AP₅₀), mean AP@50 (mAP@50), and log-average Miss Rate on False Positive Per Image (FPPI) in [10⁻², 100] (MR⁻²) are used as the evaluation metrics of the model:

Both P and R are for a single category of a single picture. Larger P and R indicate better performance. The formula of P and R are shown in (5) and (6), respectively:

P = (true positives)/(true positives + false positives),

(5)

R = (true positives)/(true positives + false negatives),

(6)
AP₅₀ is aimed at a single category of all pictures, which is the area enclosed by the P–R curve and the R axis when the iou-threshold is 0.5. It is used to measure the performance of the model in a given category. The larger the AP₅₀ is, the better the performance is.
mAP@50 is the mAP when the iou-threshold is 0.5, which is used to measure the performance of the model in all categories. The larger the mAP@50 is, the better the performance is. The formula of mAP@50 is shown in (7):

mAP@50 = (1/C)Σ_c=1^CAP₅₀^c,

(7)

where c is one category, C is the number of classes, and AP₅₀^c is the AP₅₀ of the class represented by c.
MR⁻², the area enclosed by the MR-FPPI curve and the FPPI axis, is commonly used in pedestrian detection. A smaller MR⁻² suggests better performance.

4.1.3. Detailed Settings

For the anchor setting, the same scale and shape [35] are used. Mini-batch Gradient Descent (MGD) is used as the optimized algorithm. We set the batch size as 32 images since the batch size could reach the fastest training speed under the GPU memory. As is seen in Table 5, the initial learning rate is set to 10⁻³, which is used to train the first 65 epochs, and the learning rate is reduced by 10 times in the last 20 epochs, set to 10⁻⁴. At the same time, the gradual warmup is used in the first 1000 steps of training. As the number of steps increases, the learning rate increases slowly. When the training step reaches the 1001st step, the constant learning rate is used for training.

4.2. Experiments on CityPersons

All the pedestrian detectors are trained on the training set of CityPersons and evaluated on the validation set. The settings of the training and validation process are shown in Table 5.

4.2.1. Ablation Study

Table 6 shows the ablation experiments of the proposed method in Section 3, including the SENet module and GIoU_IoG loss. It is explicit that the performances in all criteria are basically improved by adding two contributions, respectively. Specifically, adding the GIoU_IoG loss causes P to obtain a 1.1% improvement which is the peak of two improvements in P, indicating that the loss could improve the accuracy of the match between prediction frames and target frames as we expected. In addition, integrating the SENet module gives a 0.8% improvement to R which is the maximum of two improvements in R, suggesting that more true positives could be found. More importantly, adding two contributions simultaneously boosts all the evaluation metrics, indicating that all contributions are compatible with each other.

4.2.2. Comparisons with Previous Works

Table 7 lists some other state-of-the-art methods on the CityPersons validation set. Although our approach reduces the mAP@50 by 45.6% over the method of EMD-RCNN, the MR⁻² of it transcends most of the listed methods. Especially, it boosts the MR⁻² by 1.2% and 0.8% over the method of Adaptive-NMS and MGAN, respectively. SENet is incorporated into the network model to make the class and positional parameters more accurate and GIoU_IoG loss is proposed to make the positioning of the prediction more accurate. The two modules help reduce the number of false positives. Soft-NMS is used to retain predictions of other targets, which reduces the number of false negatives. Finally, the MR⁻² of the model is reduced.

4.2.3. The Impact of the Hyperparameters on YOLOv3-Occ

Figure 3 shows the change in mAP@50 under different batch sizes. All curves present a similar trend, rising rapidly and then remaining stable. Specifically, when the batch size is set to 32, mAP@50 reaches the peak the fastest and spends 20 epochs reaching the peak. However, mAP@50 achieves the summit the slowest against the batch size of eight. It indicates that the batch size influences the training speed of the proposed method and the larger batch size under the GPU memory produces the faster training speed.

Figure 4 shows the change in mAP under diverse iou-thresholds. The iou-threshold is used to judge whether a prediction is a true positive. All curves reach the peaks at almost the same speed. Note that the peak of mAP is the biggest under the iou-threshold of 0.5 and is the smallest under the iou-threshold of 0.9. It suggests that the mAP performance of the proposed method is affected by the iou-threshold and the mAP performance is better with the smaller iou-threshold.

4.2.4. Visual Comparison

The visualized results of our method on the CityPersons validation dataset are shown in Figure 5. The visualization threshold is set to 0.5 to remove the redundant boxes in the results. For example, the third column is the outputs of the image through the three models. It can be seen in the last output that the proposed method could detect the four pedestrians accurately without false positives and false negatives. However, there is one false positive and two false negatives in the output of the baseline model and EMD R-CNN, respectively. Therefore, the proposed method could reduce the number of false positives and false negatives in the crowd occlusion scenes as we expected.

4.3. Experiments on COCO2014

According to Table 3, the crowdedness of the COCO2014 dataset is relatively low, which is not our design purpose. Therefore, a significant performance gain on the dataset is not expected. The reason why the dataset is introduced is to validate whether the proposed method is robust to different crowdedness levels. All the object detectors are trained on the training set of COCO2014 and evaluated on the validation set. For a fair comparison, most of the involved methods are retrained under the same settings in Table 5.

4.3.1. Ablation Study

Table 8 shows the ablation experiments of the proposed method on the COCO2014 dataset. It can be seen in the second and third rows that the performances in all indexes are promoted by adding two contributions, respectively. It is noted that SENet also plays a more important role in the increase in R and GIoU_IoG loss with the increase in P. In addition, adding the two contributions simultaneously improves all the evaluation metrics. Therefore, the SENet and the GIoU_IoG loss not only work on the CityPersons dataset but also on the COCO2014 dataset.

4.3.2. Robustness Experiments

Figure 6 shows the P–R curve plotted based on the validated results of the proposed method. The areas enclosed by many curves and the R-axis are greater than 0, indicating that the AP₅₀s of the classes represented by these curves are relatively high. Figure 7 compares the mAP@50 of other methods and YOLOv3-Occ. All curves exhibit similar trends, with mAP@50 plateauing after around the 20th epoch. However, compared to the baseline, YOLOv2, and YOLOv4, the performance of YOLOv3-Occ continues to be better, and finally, it is about 0.7%, 4.7%, and 15.4% higher than the three methods, respectively. The experiments suggest our method is also able to deal with relatively uncrowded scenes without a significant drop in performance.

4.4. Computation Cost and Limitation

Compared with YOLOv3, a limitation of the proposed method is that it is time-consuming. Table 9 shows the time-related indexes of the two methods. After adding the proposed contributions to YOLOv3, the parameter quantity and average training time per epoch slightly increased. The incorporation of SENet produces the increase in the parameter quantity as shown in column 2 of Table 9, while the integration of SENet and the improvement in GIoU loss causes the rise of the training time as shown in column 3 of Table 9. Although the proposed method improves MR⁻², the training cost of the method rises slightly. Although the training time of the proposed method is longer than for YOLOv3, the inference time of the proposed method is almost the same as for YOLOv3.

5. Conclusions and Future Directions

This paper proposes a novel method of occluded pedestrian detection in crowd scenes: YOLOv3-Occ. Based on YOLOv3, SENet is adopted to be integrated into the feature extraction layer, and GIoU_IoG loss is proposed as the positioning loss function. The ablation experimental results on the CityPersons dataset show that P, R, and mAP@50 have been improved by 2.0%, 1.8%, and 2.4%, respectively, after adding the two contributions. Experimental results show that the MR⁻² of YOLOv3-Occ is relatively high compared with a series of the state-of-the-art methods, which reaches an advanced level. Meanwhile, experiments are performed on the COCO2014 dataset to test the generality of the two contributions and validate the robustness of YOLOv3-Occ.

In summary, YOLOv3-Occ reduces the false positives and false negatives of pedestrians under the scenes of crowd occlusion and is robust to numerous degrees of occlusion. However, YOLOv3-Occ faces new challenges, especially regarding poor performance in scenes of severe pedestrian occlusion and multi-class object detection. Therefore, the next step is to analyze the reasons for the poor performance. According to these reasons, we need to find out solutions, such as integrating suitable attention mechanisms into suitable positions in the network model, developing a new loss function, and improving the post-processing method.

Author Contributions

Methodology, Q.Z.; Supervision, Y.L., Y.Z., M.Z. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bakalos, N.; Voulodimos, A.; Doulamis, N.; Doulamis, A.; Ostfeld, A.; Salomons, E.; Caubet, J.; Jimenez, V.; Li, P. Protecting water infrastructure from cyber and physical threats: Using multimodal data fusion and adaptive deep learning to monitor critical systems. IEEE Signal Process. Mag. 2019, 36, 36–48. [Google Scholar] [CrossRef]
Othman, N.A.; Aydin, I. A new IoT combined body detection of people by using computer vision for security application. In Proceedings of the 2017 9th International Conference on Computational Intelligence and Communication Networks (CICN), Girne, Northern Cyprus, 16–17 September 2017. [Google Scholar]
Makantasis, K.; Doulamis, A.; Doulamis, N.; Psychas, K. Deep learning based human behavior recognition in industrial workflows. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016. [Google Scholar]
Veres, G.; Grabner, H.; Middleton, L.; Van Gool, L. Automatic workflow monitoring in industrial environments. In Proceedings of the Computer Vision—ACCV 2010: 10th Asian Conference on Computer Vision, Queenstown, New Zealand, 8–12 November 2010. [Google Scholar]
Gao, H.; Cheng, B.; Wang, J.; Li, K.; Zhao, J.; Li, D. Object classification using CNN-based fusion of vision and lidar in autonomous vehicle environment. IEEE Trans. Ind. Inform. 2018, 14, 4224–4231. [Google Scholar] [CrossRef]
Zhao, Y.; Hu, C.; Zhu, Z.; Qiu, S.; Chen, B.; Jiao, P.; Wang, F.Y. Crowd sensing intelligence for ITS: Participants, methods, and stages. IEEE Trans. Intell. Veh. 2023, 8, 3541–3546. [Google Scholar] [CrossRef]
Gao, H.; Lv, C.; Zhang, T.; Zhao, H.; Jiang, L.; Zhou, J.; Liu, Y.; Huang, Y.; Han, C. A structure constraint matrix factorization framework for human behavior segmentation. IEEE Trans. Cybern. 2022, 52, 12978–12988. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Chan, P.H.; Donzella, V. Semantic-aware video compression for automotive cameras. IEEE Trans. Intell. Veh. 2023, 8, 3712–3722. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005. [Google Scholar]
Dollár, P.; Tu, Z.; Perona, P.; Belongie, S. Integral channel features. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 7–10 September 2009. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
Chi, C.; Zhang, S.; Xing, J.; Lei, Z.; Li, S.Z.; Zou, X. Pedhunter: Occlusion robust pedestrian detector in crowded scenes. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 3 April 2020. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Gao, H.; Su, H.; Cai, Y.; Wu, R.; Hao, Z.; Xu, Y.; Wu, W.; Wang, J.; Li, Z.; Kan, Z. Trajectory prediction of cyclist based on dynamic Bayesian network and long short-term memory model at unsignalized intersections. Sci. China Inf. Sci. 2021, 64, 172207. [Google Scholar] [CrossRef]
Huang, X.; Ge, Z.; Jie, Z.; Yoshie, O. NMS by representative region: Towards crowded pedestrian detection by proposal pairing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Adaptive NMS: Refining pedestrian detection in a crowd. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chu, X.; Zheng, A.; Zhang, X.; Sun, J. Detection in crowded scenes: One proposal, multiple predictions. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Lin, M.; Li, C.; Bu, X.; Sun, M.; Lin, C.; Yan, J.; Ouyang, W.; Deng, Z. DETR for crowd pedestrian detection. arXiv 2020, arXiv:2012.06785. [Google Scholar]
Chi, C.; Zhang, S.; Xing, J.; Lei, Z.; Li, S.Z.; Zou, X. Relational learning for joint head and human detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 3 April 2020. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. UnitBox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Gao, H.; Zhu, J.; Zhang, T.; Xie, G.; Kan, Z.; Hao, Z.; Liu, K. Situational assessment for intelligent vehicles based on stochastic model and gaussian distributions in typical traffic scenarios. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 1426–1436. [Google Scholar] [CrossRef]
Wang, X.; Xiao, T.; Jiang, Y.; Shao, S.; Sun, J.; Shen, C. Repulsion loss: Detecting pedestrians in a crowd. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Luo, Z.; Fang, Z.; Zheng, S.; Wang, Y.; Fu, Y. NMS-loss: Learning with non-maximum suppression for crowded pedestrian detection. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 21–24 August 2021. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-aware R-CNN: Detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhou, C.; Yuan, J. Multi-label learning of part detectors for heavily occluded pedestrian detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Zhang, S.; Chen, D.; Yang, J.; Schiele, B. Guided attention in CNNs for occluded pedestrian detection and re-identification. Int. J. Comput. Vis. 2021, 129, 1875–1892. [Google Scholar] [CrossRef]
Zou, T.; Yang, S.; Zhang, Y.; Ye, M. Attention guided neural network models for occluded pedestrian detection. Pattern Recognit. Lett. 2020, 131, 91–97. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, W.; Liao, S.; Ren, W.; Hu, W.; Yu, Y. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Pang, Y.; Xie, J.; Khan, M.H.; Anwer, R.M.; Khan, F.S.; Shao, L. Mask-guided attention network for occluded pedestrian detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
The Official Code of Yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 2 June 2020).

Figure 1. The architecture of SENet. The input is a feature map. The output is the feature map injected with the attention scores. Firstly, the GAP layer compresses the shape of the input feature map to (1, 1, C). Secondly, two fully connected layers and activated functions obtain the attention scores of all channels, as shown in the feature map marked by diverse colors. Finally, the operation of multiplying channel by channel is implemented between the input feature map and the attention scores to generate the output.

Figure 2. The architecture of the proposed YOLOv3-Occ. For an input image, the output is the image with detection boxes. The black arrow represents the data flow. CNN layers consist of a set of CBLs and basic residual blocks, which are used to extract fine-grained features of the input image. The outputs of two basic residual blocks in CNN layers are used as the inputs of two Concat layers in the model.

Figure 3. Comparison of mAP@50s among three batch sizes on the CityPersons training set. The curves show how the mAP@50s of these batch sizes change with the number of epochs. The three batch sizes and their corresponding curve colors are represented in the upper left corner of the figure.

Figure 4. Comparison of mAPs among three iou-thresholds on the CityPersons training set. The curves show how the mAPs of these iou-thresholds change with the number of epochs. The three iou-thresholds and their corresponding curve colors are represented in the upper left corner of the figure.

Figure 5. Visual comparison of the baseline, EMD-RCNN, and our approach. The first row is the results of the baseline. The second row is the results generated by EMD-RCNN. The third row is the results of YOLOv3-Occ. The blue boxes are the detection results, the white boxes are false negatives, and the yellow boxes are false positives.

Figure 6. P–R curve on one batch of the COCO validation set. There are eight classes of P–R curves and four of them coincide with the rest of the curves. The eight classes and their corresponding curve colors are represented in the lower right corner of the figure.

Figure 7. Comparison of mAP@50s among YOLOv3-Occ, Faster R-CNN with FPN [38], our baseline, RetinaNet [39], SSD523 [40], YOLOv2, YOLOv4 [41], YOLOv5 [42] on the COCO validation set. The curves show how the mAP@50s of these methods change with the number of epochs. The eight methods and their corresponding curve colors are represented in the lower right corner of the figure.

Table 1. The loss function works for pedestrian locations.

Achievements	Effect	Disadvantage
MSE Loss [21]	Euclidean distance between a prediction and a target	Drastic change in the loss
Smooth_L1 Loss [22]	The l₁ and l₂ norms of the distance vector between a prediction and a target	Inequivalent to IoU
IoU Loss [23]	Coordinates of bounding boxes regarded as a whole	Unoptimizable when a prediction and a target are disjoint
GIoU loss [24]	The normalized area between a prediction and a target supplementing the IoU loss	Changing areas of prediction frames during the optimization of the loss
Repulsion Loss [26]	Loss of predictions overlapped with other ground truths and predictions	The unevaluated weights of two losses
NMS Loss [27]	The penalty of false positives and false negatives supplementing the loss	Only suitable for binary classification tasks

Table 2. The network model works for occluded pedestrian detection.

Achievements	Effect	Disadvantage
OR-CNN [28]	Divide pedestrians into several parts	Noise production
Multi-label Learning [29]	A set of decision trees shared by the part detectors	/
Guided Attention [30]	Channel-wise attention to pay attention to the unobstructed parts of the occludee	/
AGNN [31]	Select features representing the body parts of pedestrians	/

Table 3. Volume and overlapped extent of each dataset. The overlap of an image is the average of the overlaps of all people in the image. The overlap of a person = 1 − (the area of the visible box)/(the area of the full box).

Dataset	Size of Training Set/Imgs	Size of Validation Set/Imgs	Size of Test Set/Imgs	Overlaps per Img
CityPersons	2975	500	1575	0.32
COCO2014	117,264	5000	—	0.015

Table 4. Annotation types of each dataset. Full bbox denotes the box of the full body of a pedestrian, visible bbox the box of visible parts of a pedestrian, and head bbox the box of a pedestrian’s head; Bbox: Bounding box. The symbol of ✓ denotes that the box exists in the annotation of the dataset and the symbol of Sensors 23 09089 i001

denotes the non-existence of the box.

Table 4. Annotation types of each dataset. Full bbox denotes the box of the full body of a pedestrian, visible bbox the box of visible parts of a pedestrian, and head bbox the box of a pedestrian’s head; Bbox: Bounding box. The symbol of ✓ denotes that the box exists in the annotation of the dataset and the symbol of Sensors 23 09089 i001

denotes the non-existence of the box.

Dataset	Full Bbox	Visible Bbox	Head Bbox
CityPersons	✓	✓
COCO2014	✓

Table 5. Parameter settings: σ is the variance of the Gaussian kernel function used in the Soft-NMS; iou-threshold is the standard to judge as a true positive.

Name of Parameters	Value of Parameters
device	NVIDIA GeForce RTX 3090 from USA
GPU memory	24 GB
batch size	32
epoch	85
learning rate	10⁻³ (epoch ≤ 65); 10⁻⁴ (epoch > 65)
momentum	0.9
σ	0.5
iou-threshold	0.5

Table 6. Ablation experiments evaluated on the CityPersons validation set. The baseline model (the first line) is YOLOv3. SE—SENet. GL—GIoU_IoG Loss.

SE	GL	P/%	R/%	mAP@50/%
		49.7	50.6	48.1
✓		50.5	51.4	49.6
	✓	50.8	51.2	49.4
✓	✓	51.7	52.4	50.5

Table 7. Comparisons of different methods on the CityPersons validation set.

Method	Backbone	MR⁻²/%	mAP@50/%
EMD-RCNN [18]	ResNet-50	10.7	96.1
NMS-Ped [27]	ResNet-50	10.1	—
CSP [36]	ResNet-50	11.0	—
Adaptive-NMS [17]	VGG-16	11.9	—
MGAN [37]	VGG-16	11.5	—
Ours	Darknet-53	10.7	50.5

Table 8. Ablation experiments evaluated on the COCO2014 validation set. The baseline model is YOLOv3 (the first line). SE—SENet. GL—GIoU_IoG Loss.

SE	GL	P/%	R/%	mAP@50/%
		48.3	49.6	47.5
✓		48.8	50.3	48.1
	✓	49.3	50.1	48.6
✓	✓	50.5	51.9	49.7

Table 9. Comparison of time-related indexes in two methods on the CityPersons training and validation set. YOLOv3 is the original method without adding the proposed contributions. YOLOv3-Occ is the proposed method. M: million. S: seconds. FPS: Frame Per Second. #: The Number of.

Method	# Parameters/M	Average Training Time per Epoch/S	FPS in the Inference Process/Imgs
YOLOv3	61	1.33	3
YOLOv3-Occ	62	2.40	3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Q.; Liu, Y.; Zhang, Y.; Zong, M.; Zhu, J. Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection. Sensors 2023, 23, 9089. https://doi.org/10.3390/s23229089

AMA Style

Zhang Q, Liu Y, Zhang Y, Zong M, Zhu J. Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection. Sensors. 2023; 23(22):9089. https://doi.org/10.3390/s23229089

Chicago/Turabian Style

Zhang, Qiangbo, Yunxiang Liu, Yu Zhang, Ming Zong, and Jianlin Zhu. 2023. "Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection" Sensors 23, no. 22: 9089. https://doi.org/10.3390/s23229089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection

Abstract

1. Introduction

2. Related Works

2.1. Loss Function Works for Pedestrian Location

2.2. Network Model Works for Occluded Pedestrian Detection

3. Proposed Method: YOLOv3-Occ

3.1. Preliminary Work

3.1.1. SENet

3.1.2. GIoU Loss

3.1.3. Soft-NMS

3.2. The Architecture of the Proposed YOLOv3-Occ

3.3. The Proposed Loss Function: GIoUIoG Loss

4. Experimental Results and Analyses

4.1. Experiment Settings

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Detailed Settings

4.2. Experiments on CityPersons

4.2.1. Ablation Study

4.2.2. Comparisons with Previous Works

4.2.3. The Impact of the Hyperparameters on YOLOv3-Occ

4.2.4. Visual Comparison

4.3. Experiments on COCO2014

4.3.1. Ablation Study

4.3.2. Robustness Experiments

4.4. Computation Cost and Limitation

5. Conclusions and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3. The Proposed Loss Function: GIoU_IoG Loss