Residual Transformer YOLO for Detecting Multi-Scale Crowded Pedestrian

Ye, Hechao; Wang, Yanni

doi:10.3390/app132112032

Open AccessArticle

Residual Transformer YOLO for Detecting Multi-Scale Crowded Pedestrian

by

Hechao Ye

and

Yanni Wang

^*

School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(21), 12032; https://doi.org/10.3390/app132112032

Submission received: 8 October 2023 / Revised: 25 October 2023 / Accepted: 2 November 2023 / Published: 4 November 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Crowding and occlusion pose significant challenges for pedestrian detection, which can easily lead to missed and false detections for small-scale and occluded pedestrian objects in dense pedestrian scenarios. To enhance dense pedestrian detection accuracy, we propose the Residual Transformer YOLO (RT-YOLO) algorithm in this paper. The RT-YOLO algorithm enhances the multi-scale fusion strategy based on YOLOv7 and introduces a dedicated detection layer for small-scale occluded targets. It also integrates Resnet and Transformer structures to improve the small-scale feature layer and detection head, enhancing feature extraction capabilities. Additionally, the RT-YOLO algorithm incorporates the Normalization-based Attention Module (NAM) into the backbone and neck networks to identify the region of interest. The experiments demonstrate that on the CrowdHuman and WiderPerson datasets, at IOU (Intersection over Union) = 0.5, the overall improvement in

m A P_{50}

is 3.8% and 3.4%. In the IOU range from 0.5 to 1, the improvement in

m A P_{50}

: 95 is 5.1% and 4%. RT-YOLO achieves an FPS of 67, maintaining real-time performance. On the VOC2007 dataset,

m A P_{50}

has been enhanced by 5.1%, indicating higher effectiveness and robustness.

Keywords:

crowded pedestrian detection; residual transformer YOLO; multi-scale fusion; transformer; normalization-based attention module

1. Introduction

Object detection is one of the main research directions in computer vision, where the main task is to recognize specific classes and precise coordinates of objects in images on demand [1]. With the rapid development of autonomous driving, video surveillance, and other fields, human-centered object detection has received a great deal of attention. In pedestrian detection tasks, crowded and occluded scenarios pose significant challenges. Due to factors such as the angle and distance captured by the detection lens, pedestrian targets vary in scale, and they can overlap with each other, resulting in a limited number of effective feature pixels within the detection area. This makes it difficult to distinguish boundaries, and in such cases, the detector may mistakenly identify multiple pedestrian targets as a single target or experience detection box drift. The challenges mentioned above in crowded pedestrian detection are also commonly encountered in numerous contemporary object detection tasks [2].

Pedestrian detection algorithms can be categorized into two types: traditional manual feature detection and deep learning detection algorithms. Traditional detection algorithms depend on manually designed features for object characterization. These features include Haar wavelet features [3], which combine human movement and appearance. They also encompass HOG features that describe pedestrian contours using edge direction information [4], LBP features with grayscale and rotation invariance [5], and structural SIFT features with scale invariance [5]. Traditional manual feature detection methods have greatly contributed to the development of pedestrian detection research. However, such algorithms exhibit limited robustness in complex scenarios. In recent years, due to the rapid development of deep learning, the research focus on pedestrian detection has shifted from manual features to deep learning detection algorithms.

The pedestrian detection algorithm based on deep learning, X. Liang et al. improved Fast R-CNN [6] for pedestrian detection by designing 2 sub-networks for large-scale and small-scale pedestrians [7], and enhanced the efficiency of pedestrian detection by using candidate regions extracted by the ACF detector [8]. L. Zhang et al. analyze the performance of Faster R-CNN [9] in pedestrian detection and use a mixed strategy to classify the candidate regions extracted by RPN through random forest [10]. Y. Tian et al. aim to solve the occlusion issues in crowded pedestrian detection and propose a part-based detection scheme named DeepParts [11]. M. Hong et al. propose SSPNet [12], which suppresses interference from complex backgrounds using features at different scales and preserves valuable pedestrian information by sharing features. S. Huang et al. [13] propose a feature-aligned pyramid network to improve the efficiency of dense pedestrian detection by aligning up-sampled features with local features via pixel offsets. The above algorithms have excellent performance in pedestrian detection, but the feature extraction capability for small-scale and occluded pedestrian objects still needs to be enhanced, which is prone to missed and false detection.

In order to solve the false and missed detection in dense pedestrian detection, we propose Residual Transformer YOLO (RT-YOLO) based on YOLOv7 [14], and verify its performance through ablation, comparison, and generalization experiments. The main contributions of this paper can be summarized as follows:

Proposing a new small-scale, occluded object detect head named Bottleneck Transformer Detect Head (BTDetect), which is mainly used to receive the two largest scale feature maps and fully extract detail information to achieve accurate localization for small-scale, occluded objects;
According to the Resnet Bottleneck structure, Bottleneck Transformer Encoder Layer (BOTrans) is proposed to enhance the prediction potential of the model using Self-Attention in Transformer;
Combining convolutional neural networks with Transformer structure to enhance detection performance while reducing gradient dispersion and feature loss caused by the deepening of network hierarchy;
A new multi-scale fusion strategy is proposed. It includes dedicated feature network layers designed for occluded and small-scale objects, increasing the receptive field of network and efficiently fuses low-level detail information with high-level semantic information, and also improves the generalization ability of the model;
Combining the Normalization-based Attention Module (NAM attention), the model focuses on the areas of interest from both spatial and channel perspectives. This effectively utilizes network parameters and reduces interference from background noise.

2. Related Work

YOLOv7 [14] is the most representative model in the YOLO series [14,15,16,17,18,19] at present, and many subsequent versions of YOLO have also drawn inspiration from its structure. Within the range of 5 FPS to 160 FPS, YOLOv7 outperforms most known object detectors in terms of both speed and accuracy. Its main structure is divided into Backbone, Neck network, and YOLO detection head, as shown in Figure 1.

The Backbone is the initial feature extraction network of YOLOv7. When a 640 × 640 image is input into the Backbone, a set of extracted feature points is obtained in the form of feature maps. Efficient Aggregation Networks (ELAN) [14] will allow the deep network to efficiently extract features and converge reasonably by controlling the shortest and longest paths, as shown in Figure 2a, and finally, Backbone will output feature maps at different scales: 80 × 80, 40 × 40, and 20 × 20.

The Neck is the enhanced feature extraction network of YOLOv7. Three feature maps obtained from Backbone are fused by up-sampling in Neck. Subsequently, the fused features are down-sampled, where the Extending Efficient Aggregation Networks (E-ELAN) extends the channel and cardinality by introducing group convolution based on ELAN and it is shown in Figure 2b, which applies the same parameter group and channel multiplier to all blocks of the computational layer, and the number of channels remains the same as ELAN, so the utilization of parameters and computational efficiency improves further. Neck finally outputs three enhanced effective feature maps at sizes of 80 × 80, 40 × 40, and 20 × 20.

The YOLO Head is the classifier and regressor in YOLOv7. It includes RepConv [20] reparameterization to fuse training multi-branch weights and enhance detection speed [21]. The three feature maps obtained from Neck have width, height, and a number of channels, effectively forming a set of all feature points. YOLO Head assesses whether objects are present at the feature points corresponding to the anchor box and then outputs the class and location information of these objects.

YOLOv7 possesses a deep network architecture, and it performs excellently in most application scenarios. However, as the network depth increases, it may not extract features from occluded or densely packed objects adequately, potentially leading to the loss of fine-grained details. These issues are also reflected in the context of crowded pedestrian detection.

3. The Proposed Method

In this paper, we propose the RT-YOLO algorithm, which includes a small-scale occlusion object prediction head, integrates Transformer, and utilizes NAM attention to improve the detection performance of the algorithm for objects of varying scales and occlusion in crowded pedestrian scenarios.

3.1. Residual Transformer YOLO

The structure of the RT-YOLO algorithm is shown in Figure 3 and consists of three components. The first part is the backbone feature extraction network. After the initial feature extraction from the input image, it generates four feature maps at layers 4, 6, 9, and 12. Following this, these feature maps undergo channel adjustment through convolutional groups (Conv) and are subsequently connected to the Neck backbone for down-sampling, thereby achieving feature fusion.

Neck enhances feature extraction by merging shallow and deep features, with the aim of combining information from different scales. The feature maps from Backbone are stacked with Neck’s main path for up-sampling to achieve feature fusion, and the already obtained feature maps continue to down-sample after BOTrans and ELAN to achieve feature fusion once more. Finally, the obtained feature maps are sent to the YOLO Head for decoupling.

The third part is the YOLO Head, which has four detection heads, including two small-scale, occluded object prediction heads named BTDetect. These heads receive two feature maps containing the richest detailed information. Afterward, a 1 × 1 Conv layer is applied to decouple the output results for object location and category information. In addition, introducing NAM attention before the residual branches can reduce the interference of irrelevant features and enhance the algorithm convergence efficiency without increasing computational and spatial complexity.

3.2. Bottleneck Transformer Encoder Layer and Detect Head

Transformer [22] signifies a pivotal advancement in deep learning designed to enhance NLP performance. Its remarkable capabilities inspire DETR [23] to integrate it into object detection. Transformer exhibits superior performance and potential in various vision tasks compared to traditional convolutional networks. However, Transformer comes with high computational and spatial complexity due to its self-attentive computation of global features, which directly relates to the number of feature pixels. To address this issue, combining Transformer with ResNet [24] reduces computational and spatial complexity, resulting in a more potent visual model compared to conventional convolutional network structures. BOTrans uses the global Multi-Head Self-Attention (MHSA) of the Transformer Encoder to replace the Conv layer of ResNet, reducing parameters and lowering latency. The structure of BOTrans is shown in Figure 4b.

Resnet typically has 4 stages with strides (4, 8, 16, 3), and BOTrans is derived by replacing 3 groups of 3 × 3 Conv layers in stage4 with MHSA, as detailed in Table 1. This substitution reduces BOTrans parameter count by 18% when compared to ResNet using a 1024 × 1024 input size. BOTrans maintains an identical structure to ResNet, allowing seamless integration into object detection models. This integration enhances their feature extraction capabilities. In RT-YOLO, BOTrans focuses on various regions within the image, acquiring relationships and contexts between these regions. This mechanism helps capture the global context and semantic information of objects, leading to a more accurate differentiation of occluded object boundaries and precise localization of small-scale objects.

Figure 5 shows the structure of MHSA, which differs from the canonical MHSA in three main aspects. Firstly, the number of heads is reduced from 8 to 4. Secondly, the content-position encoding

r

is two-dimensional instead of one-dimensional, with

R_{h}

and

R_{w}

representing relative information in the vertical and horizontal directions. Thirdly, position encoding is embedded after the MHSA layer. MHSA multiplies the input image

x

with learnable parameters

W_{Q}

,

W_{K}

, and

W_{V}

matrices to obtain corresponding query matrix

q

, matching matrix

k

, and information matrix

v

. In addition, relative position encoding r is used in Self-Attention. By multiplying

r^{T}

with

q

, position-related information is obtained, thereby integrating position sense into the attention calculation. This enables the model to consider spatial relationships between features and their positions, ultimately enhancing feature extraction efficiency and robustness. Subsequently,

q

is multiplied with

k^{T}

and then summed with position encoding

r

. After softmax, it obtains weights for

v

. These weights are used to calculate weighted relevance scores, which are the results of the self-attention computation. MHSA takes parallel computation between multi-heads and exchanges information by sharing parameters

W

and the features are learnt from different attention heads are combined to obtain the results. The multi-head computation can be calculated as follows:

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1} \dots, h e a d_{n}) W

(1)

W h e r e h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, {V W}_{i}^{V}) = s o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}

(2)

Here, Equation (1) represents the result after multi-head self-attention computation, while Equation (2) represents the result of a single head’s computation, where “

i

” represents the head number.

Q

,

K

,

V

are the query, matching, and information matrices generated based on the input image features, which is the set of

q

,

k

, and

v

mentioned above.

d

represents the length of vector

k

. Using

\sqrt{d_{k}}

is to prevent the values from becoming too large after multiplying the Q and K matrices, which could result in small gradients after softmax.

T

represents matrix transpose.

W

is the learnable shared parameter, and

n

is the number of attention heads, in BOTrans,

n

= 4.

As shown in Figure 6, the RT-YOLO replaces the IDetect head in the original structure with the Bottleneck Transformer Detect Head (BTDetect). BTDetect has the capability to globally localize small-scale and occluded objects, enhancing the predictive potential of the neural network model. As shown in Figure 3, there are four various scale prediction heads in RT-YOLO. To alleviate algorithmic complexity and reduce memory cost, BTDetect is selectively applied exclusively after Feature maps 3 and 4, which contain the largest scale and the most detailed information. This selective approach ensures real-time performance for RT-YOLO while achieving precise localization of small-scale and occluded objects.

3.3. Multi-Scale Fusion Strategy

RT-YOLO combines BOTrans and employs a multi-scale early fusion strategy during training. This strategy starts with feature fusion, followed by sending the fused feature maps to the prediction head. As the depth of the network progresses, the feature map scale changes with the receptive field, which is the size of the area mapped on the original image by the pixel points on the feature map. High-level feature maps with low resolution, rich semantic information and a global receptive field are advantageous for detecting medium and large objects. Conversely, low-level feature maps with high resolution, detailed information, and a small receptive field are better suited for detecting small-scale objects. Therefore, by fusing feature maps at different scales and combining high-dimensional semantic information with low-dimensional detail information, the network characterization ability is improved. This reduces the instances of missing or falsely detecting small-scale objects.

The multi-scale fusion training in the RT-YOLO algorithm involves three steps. First, the Backbone extracts image features to create four feature maps at different scales. Next, these four feature maps are fused with the main path of the Neck and up-sampled to produce new feature maps. In the third step, E-ELAN and BOTrans enhance the new feature maps for feature extraction, and then they are down-sampled to achieve feature fusion, resulting in four feature maps of various scales. In YOLOv7, three feature maps are sent to the detection head for prediction, and one of them is dedicated to detecting small objects. In RT-YOLO, there are four feature maps in total, with two of them specifically used for detecting small objects.

As shown in Figure 3 and Figure 7, the two low-level feature maps from layers 4 and 6 in the Backbone have high resolution and rich detailed information, making them suitable for small-scale object detection. The two feature maps are first modified in terms of channel numbers using a 1 × 1 Convolution with a stride of 1. The output features from layer 4 are then combined with the primary features in Neck layer 19 after being up-sampled. Subsequently, the output features from Layer 4 are fused with the output features from Layer 19, and after enhanced feature extraction by BOTrans, Feature Map 4 is obtained. The output features from Layer 6 are fused with the output features from Layer 18, then combined with Feature Map 4 to yield Feature Map 3. These feature maps combine detailed information from low-level networks and rich semantic information from deep-level networks. They are designed for the specific purpose of detecting small-scale and occluded objects. The acquisition method for Feature Map 2 and Feature Map 1 remains consistent with YOLOv7. In order to observe objects information of interest for each scale of feature map, all objects localization information is presented through gradient heat maps.

The GradCam [25] tool with a confidence threshold (conf) = 0.5 is used to visualize the gradient heat map, as shown in Figure 8. Objects in Figure 8a have head and human body, which are small-scale and partially obscured. Feature map 1 in Figure 8c lacks detailed information and primarily extracts object information at larger scales in the near field. Feature map 2 in Figure 8d extracts the global large-scale object information based on Figure 8c, but struggles to distinguish edge features. Feature map 3 in Figure 8e becomes more sensitive to small-scale human features after the initial fusion with low-level features. However, it may not accurately locate smaller-scale head features. Feature map 4 in Figure 8f is rich in detail and semantic information, making it sensitive to small-scale objects and accurately locating head features in the image. In summary, RT-YOLO undergoes multi-scale fusion training, enhancing its ability to detect small-scale and occluded objects while retaining its capability to detect large-scale objects.

3.4. Normalization-Based Attention Module

The attention mechanism allows the neural network to concentrate on essential image areas, suppressing interference from irrelevant regions. Its lightweight, plug-and-play nature enhances the performance with lower cost. Attention mechanisms are generally categorized into spatial, channel, and hybrid domains, such as CBAM [26], GAM [27], and other attention mechanisms that excel in neural networks are designed in these perspectives.

In RT-YOLO, Normalization-based Attention (NAM) considers weight factors contribution to attention. It employs the scaling factor of Batch Normalization to calculate these weights, eliminating the need for repetitive stacking of fully connected and convolutional layers. NAM enhances RT-YOLO focus on specific features or channels, automatically adjusting weights based on input data characteristics and network requirements, reducing interference, and enabling the model to concentrate on meaningful features. The NAM adopts the modular integration approach from CBAM and introduces weight contribution factors to reconfigure the spatial and channel attention modules. This enables NAM to be seamlessly integrated directly after the network layer and residual structure. The Batch Normalization (BN) weight contribution factor is calculated as follows:

B_{o u t} = B N (B_{i n}) = γ \frac{B_{i n} - μ_{β}}{\sqrt{σ_{β}^{2} + ϵ}} + β

(3)

where

μ_{β}

and

σ_{β}

are mean and standard deviation of output feature map

B

.

γ

and

β

are trainable mapping parameters and also variance in

B N

. In the model training process, the larger variance represents the richer feature information contained in the channel, which is a more important region to concern.

In RT-YOLO, NAM is added before the feature fusion is performed to improve the feature extraction efficiency, and the specific operation is to calculate channel attention and spatial attention for input features, whose calculation formula is computed as follows:

y = M_{s} (M_{c} (X))

(4)

Here,

X

represents the input,

y

represents the output, and the structure of the channel attention sub-module

M_{c}

redesigned by weight contribution factor is shown in Figure 9 and Equation (5).

M_{c} = s i g m o i d (W_{γ} (B N (F_{1})))

(5)

Here,

γ

is the scaling factor of each channel and weight is

W_{i} = \frac{γ_{i}}{\sum_{j = 0} γ_{j}}

, when the feature map passes the channel attention sub-module in process called pixel normalization, the obtained feature map is input to the spatial attention sub-module

M_{s}

that combines weight contribution factors, whose structure is shown in Figure 10 and Equation (6).

M_{s} = s i g m o i d (W_{λ} (B N (F_{2})))

(6)

Here,

λ

is the scaling factor of the space, and the weight is

W_{i} = \frac{λ_{i}}{\sum_{j = 0} λ_{j}}

. In order to eliminate effects caused by irrelevant weights, NAM adds a regular term to its loss function, whose expression is Equation (7).

L o s s = \sum_{(x, y)} l (f (x, W), y) + p \sum g (γ) + p \sum g (λ)

(7)

Here,

W

is the weight,

l (\cdot)

is the loss function,

g (\cdot)

is

l_{1}

paradigm penalty function, and

p

is the penalty for balancing

g (γ)

and

g (λ)

.

4. Experiments and Results

4.1. Dataset and Experimental Environment

In this paper, the experiment platform is based on Intel(R) Xeon(R) E5-1650 v3 and NVIDIA GeForce RTX 2080Ti, based on Pytorch 1.12.1, CUDA 11.6. The training data is the publicly crowded pedestrian dataset CrowdHuman [28], and the WiderPerson [29] dataset is used to verify the robustness of the algorithm. The CrowdHuman dataset comprises a training set with 15,000 images, a test set with 5000 images, and a validation set with 4370 images. Due to variations in the scenes, the image dimensions range from 1000 × 600 to 2000 × 1500 pixels. The training and validation set together contain around 470,000 instances, with approximately 23 people per image. It provides three types of annotations: head, full body, and visible body, while also featuring various occlusions. As shown in Figure 11, the “full body” annotation interferes with the precise localization of the targets, so it has been removed. The WiderPerson dataset consists of 13,382 images with approximately 400,000 annotations for various occlusion scenarios. Image dimensions vary between 1000 × 600 and 2000 × 1500 pixels. As there was no official test set provided, we randomly divided the dataset into 8000 training images, 4382 testing images, and 1000 validation images. In WiderPerson, there are five annotations: pedestrians, riders, partially visible persons, ignore regions, and crowd. Notably, riders and partially visible persons are less frequent, so they are grouped together with pedestrians for a more comprehensive analysis. As shown in Figure 12, the ignore regions and crowd labels are removed from the dataset, because they do not match the experimental requirements.

4.2. Evaluation Metrics and Experimental Details

In this paper, the experimental evaluation metrics include FPS (frame/s, representing the detection speed of the algorithm), Precision, Recall, mAP (mean average precision). F1 score is the harmonic mean of precision and recall, used to assess a model performance in classifying both positive and negative samples. GFLOPS (Giga Floating-point Operations Per Second) measures the computational complexity of a neural network model. It is typically considered in conjunction with hardware computational performance to evaluate the computational requirements and speed, and IOU (Intersection over Union) represents the overlap between the ground truth box and the prediction box. All formulas as follows:

P r e c i s i o n = \frac{T P}{(T P + F P)}

(8)

R e c a l l = \frac{T P}{(T P + F N)}

(9)

F 1 = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(10)

m A P = \frac{\sum_{i = 1}^{k} A P_{i}}{k}

(11)

where

T P

represents the positive samples that are correctly classified.

F P

represents the positive samples that are incorrectly classified.

F N

represents the negative samples that are incorrectly classified.

k

represents the number of categories.

A P_{i}

represents the average precision of the current category.

In all experiments in this section, the initial learning rate lr = 0.01, lr is adjusted by cosine annealing during training, the model weight optimizer is SGD. mAP is divided into

m A P_{50}

(mean Average Precision at 0.5 IOU) and

m A P_{50 : 95}

(mean Average Precision from 0.5 to 0.95 IOU). Usually,

m A P_{50}

is a common IOU threshold used to determine the effectiveness of a detection.

m A P_{50 : 95}

provides a more comprehensive performance assessment, as it considers multiple IOU thresholds.

4.3. Comparison Experiment

In order to verify the detection and real-time performance of RT-YOLO algorithm, we compare it with the classical and state-of-the-art algorithms tested on CrowdHuman in terms of FPS, mAP, GFLOPS and number of parameters. Since the CrowdHuman training dataset is substantial, all algorithms do not use pre-trained weights. Each algorithm is trained for 200 epochs with input size of 640 × 640, and the best training weights are used to calculate results on the CrowdHuman test set. The results are shown in Table 2 and Figure 13.

In Table 2, The RT-YOLO includes extra calculation modules, resulting in slightly lower speed compared to YOLOv7, similar to the state-of-the-art YOLOv8-l. However, it still reaches 67 FPS, ensuring real-time detection, with a 3.8% improved

m A P_{50}

and 4.9% improved

m A P_{50 : 95}

. In terms of model complexity, RT-YOLO has 46.1 fewer GFLOPS and 4.1 million fewer parameters compared to the latest YOLOv8-l. Compared to YOLOv7, both the parameter count and complexity increased, but it achieved a significant improvement in accuracy. Figure 13 shows the PR curves enclosed by Precision and Recall. RT-YOLO stands out with the highest Recall value, indicating the lowest rate of missed detections for pedestrian objects, surpassing other algorithms. In addition, RT-YOLO is deployed on RTX 2080Ti with Tensorrt FP32 precision acceleration. With a slight reduction in mAP, the FPS has increased to 84, still maintaining its practical value.

The YOLOv7, YOLOv8, and RT-YOLO, which perform well in experiments, are applied to real-life scenes and compared for detection effects. The detection results are shown in Figure 14. From the detection results, it can be seen that RT-YOLO excels in practical applications, particularly in multi-scale feature fusion, and enhances the detection of small-scale and occluded objects. In Figure 14I, RT-YOLO successfully detects distant pedestrian objects that are missed by YOLOv7 and YOLOv8-l.; In Figure 14II, RT-YOLO successfully detects crowded pedestrians in the upper part of the image that are missed by YOLOv7 and YOLOv8-l; In Figure 14III, where the objects in the image are blurred and difficult to identify, RT-YOLO successfully detects a greater number of distant and near-field pedestrian objects.

In summary, RT-YOLO effectively distinguishes occluded pedestrian boundaries, preserves features of small-scale pedestrians through multi-scale fusion, significantly enhances the detection accuracy of crowded pedestrians, and maintains real-time performance.

4.4. Ablation Experiment

In order to verify the performance of the improved components, combinations of each part are trained in ablation experiments. Subsequently, they are tested on the CrowdHuman and WiderPerson datasets. In the table, higher GFLOPS indicates greater model complexity. FPS stands for the detection speed of the algorithm. Params represents the model size, and Layers refers to network depth. F1 score ranges from 0 to 1 and is used to measure the classification ability.

Analyzing Table 3, Table 4 and Table 5, the RT-YOLO compared to the original algorithm, exhibits an increase in model complexity of 12.6 GFLOPS and an addition of 2 million parameters. This leads to a reduction in speed by 24 FPS. Notably, in CrowdHuman, the overall

m A P_{50}

improves by 3.8%, and

m A P_{50 : 95}

improves by 5.1%. In WiderPerson, it achieves a 3.4% overall

m A P_{50}

improvement and a 4% improvement in

m A P_{50 : 95}

. The F1 score also improves on both datasets. In summary, RT-YOLO increases model complexity but significantly improves detection performance and classification performance.

In the CrowdHuman dataset, most person labels have dimensions ranging from approximately 50 × 100 to 150 × 400, while head labels have dimensions between 20 × 20 and 150 × 200. These represent two distinct object scales, with person labels addressing challenges related to occlusions, and head labels addressing challenges related to smaller-scale objects. According to the results presented in Table 6, RT-YOLO exhibits significant improvements in detecting both label categories. In head category, the overall

m A P_{50}

improves by 3.5%, and

m A P_{50 : 95}

improves by 4.5%. In person category, there is an overall 4.1% improvement in

m A P_{50}

and a 5.3% improvement in

m A P_{50 : 95}

.

Considering the above tables, we analyze the effects of different components on the performance of RT-YOLO from three aspects:

Effect of multi-scale fusion strategy with BOTrans: The addition of a BOTrans object detection layer increases the network layers from 415 to 509. GFLOPS increase from 106.5 to 119.5. The FPS has decreased from 81 to 67, but it still maintains real-time requirements. In CrowdHuman, the overall $m A P_{50}$ improves by 2.7%. The $m A P_{50 : 95}$ , which better reflects overall performance, improves by 3.5%. In WiderPerson, the improvements are 2.2% and 2.5%, respectively. In the BOTrans detection layer, self-attention efficiently distinguishes occluded pedestrian boundaries, reducing miss and false detection. During model training, the combination of shallow-level details with deep semantic information, alleviating the issue of effective feature loss due to increasing network depth. The resulting feature maps contain high-level semantic information and shallow-level details, improving the efficiency of detecting small-scale and occluded pedestrians. The additional detection layer increases the overall depth and complexity of the network but brings a significant mAP improvement and provides the basis for BTDetect, so it is acceptable.
Effect of BTDetect prediction head: The two large-scale feature maps output by Neck are passed into BTDetect. The Bottleneck residual structure of BTDetect can alleviate the issues of gradient disappearance and feature information loss. Additionally, the Transformer efficiently captures detailed information from the larger-scale feature maps, focusing on small-scale and obscured object features. This effectively enhances the detection performance. Ultimately, BTDetect divides the feature map into grids and decouples within each cell to predict object positions and categories. The overall $m A P_{50}$ improvement is 0.8% in the CrowdHuman and 0.7% in the WiderPerson. The $m A P_{50 : 95}$ improvement in CrowdHuman is 1.2%, and in WiderPerson, it is 1.3%.
Effect of NAM attention: NAM Attention is a lightweight, plug-and-play module without increasing complexity. The NAM attention in RT-YOLO computes weighted attention on images from channel and space, which reduces the influence of irrelevant regions in the feature map. The overall $m A P_{50}$ improvement is 0.3% in the CrowdHuman and 0.5% in the WiderPerson. The $m A P_{50 : 95}$ improvement in CrowdHuman is 0.4%, and in WiderPerson, it is 0.2%.

As shown in Figure 15, YOLOv7 has some degree of missed detection in various complex scenes. RT-YOLO is able to fully extract object feature information, achieving accurate localization and classification of small-scale, obscured objects.

In summary, RT-YOLO increases the depth of the network within the acceptable range of computational and complexity increments. This enhances the high-level features with richer semantic information, while the multi-scale fusion training effectively fuses rich details from low-level features into feature maps, compensating for the loss of detailed features caused by the deep network layers and thereby improving the algorithm’s feature extraction capability.

4.5. Validity Verification

To verify the effectiveness and robustness of the RT-YOLO algorithm through generalization experiments, experimental data are selected for Pascal VOC2007, which contains 20 categories and can be used to assess the performance of each algorithm. By comparing the RT-YOLO algorithm with other algorithms, the experimental results are presented in Figure 16 and Table 7. According to experimental results, RT-YOLO exhibits excellent performance on Pascal VOC2007, surpassing classical object detection algorithms in all categories. The RT-YOLO achieves a 5.4% improvement in

m A P_{50}

compared to YOLOv7.

In RT-YOLO, BOTrans performs multi-head self-attention computations, handling different features and channels in parallel. This enables the identification of features for small-scale and occluded objects, as well as the extraction of features for large-scale objects. RT-YOLO integrates scale-diverse features across different network layers. It directs the network’s focus to effective features with NAM. These components notably enhance the robustness and generalization performance of RT-YOLO.

The generalization and robust performance of RT-YOLO provide practical applications value in current popular fields. In video surveillance [33], detecting objects at various scales and under occlusion is a significant challenge, and all subsequent tasks rely on successful object detection before region or behavior analysis. In the field of autonomous driving [34], detectors must be efficient and fast, demanding high performance, lightweight algorithms. RT-YOLO’s performance serves as a reference for further research and lightweight performance improvements.

5. Conclusions

This paper introduces the RT-YOLO algorithm, aimed at enhancing the detection accuracy of small-scale and obscured objects in crowded pedestrian scenarios. The powerful BOTrans module combines convolutional network with Transformer structure, replacing the E-ELAN structure in the original network. BOTrans globally extracts features and integrates contextual information to distinguish crowded and occluded objects. Based on the above enhancements, the multi-scale fusion strategy changes to design special network layers for small-scale and occluded objects, allowing them to obtain feature maps with different scale receptive fields. This design ensures that the feature maps have semantic information from high-level features while retaining detailed information from low-level features. In RT-YOLO, the NAM attention mechanism is introduced, allowing the network to focus on the object region of interest efficiently, thereby improving network parameter utilization. Finally, BTDetect decouples the maximum scale feature map to obtain classification and location information for objects, resulting in precise object localization.

In the filtered CrowdHuman and WiderPerson datasets, experimental results demonstrate that RT-YOLO improves

m A P_{50}

by 3.8% and 3.4%,

m A P_{50 : 95}

by 5.1% and 4% when compared to YOLOv7. Generalization experiments on Pascal VOC2007 validate its robustness and effectiveness, showcasing a 5.1% improvement in mAP over YOLOv7. Although the addition of extra modules slightly increases algorithm speed and complexity, RT-YOLO still meets real-time requirements. From the experimental results, it can be seen that there is room for improvement in RT-YOLO in terms of lightweighting and loss functions. For example, using dilated convolutions instead of regular convolutions can reduce model complexity, although it may lead to a performance drop. Adjusting loss function allocation strategies with Focal Loss or choosing to pair GIOU and CIOU loss functions based on the object type. Subsequent work will continue to achieve better algorithmic performance.

Author Contributions

Conceptualization, H.Y. and Y.W.; methodology, H.Y.; software, H.Y.; validation, H.Y. and Y.W.; writing—review and editing, H.Y. and Y.W.; visualization, H.Y.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Shaanxi Province, China (Nos. 2020JM-499 and 2020JQ-684) and the National Natural Science Foundation of China (No. 61803294).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

The source of all experimental data in the manuscript is open-access datasets publicly available for pedestrian-related research, with corresponding references to the datasets: https://arxiv.org/abs/1805.00123 and https://paperswithcode.com/dataset/widerperson, accessed on 25 October 2023.

Data Availability Statement

Data available on request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2019, 111, 257–276. [Google Scholar] [CrossRef]
Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C.H. Deep Learning for Person Re-Identification: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2872–2893. [Google Scholar] [CrossRef] [PubMed]
Viola, P.A.; Jones, M.J. Rapid Object Detection Using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Girshick, R.B. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Liang, X.; Shen, S.; Xu, T.; Feng, J.; Yan, S. Scale-Aware Fast R-CNN for Pedestrian Detection. IEEE Trans. Multimed. 2015, 20, 985–996. [Google Scholar]
Yang, B.; Yan, J.; Lei, Z.; Li, S. Aggregate Channel Features for Multi-View Face Detection. In Proceedings of the 2022 56th Asilomar Conference on Signals, Systems, and Computers 2014, Pacific Grove, CA, USA, 31 October–2 November 2022; pp. 1–8. [Google Scholar]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Lin, L.; Liang, X.; He, K. Is Faster R-CNN Doing Well for Pedestrian Detection? In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Volume 2, pp. 443–457. [Google Scholar]
Tian, Y.; Luo, P.; Wang, X.; Tang, X. Deep Learning Strong Parts for Pedestrian Detection. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1904–1912. [Google Scholar]
Hong, M.; Li, S.; Yang, Y.; Zhu, F.; Zhao, Q.; Lu, L. Sspnet: Scale selection pyramid network for tiny person detection from uav images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Huang, S.; Lu, Z.; Cheng, R.; He, C. FaPN: Feature-Aligned Pyramid Network for Dense Image Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 864–873. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Althoff, L.; Farias, M.C.; Weigang, L. Once Learning for Looking and Identifying Based on YOLO-v5 Object Detection. In Proceedings of the Brazilian Symposium on Multimedia and the Web, Curitiba, Brazil, 7–11 November 2022; pp. 298–304. [Google Scholar]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. Comput. Vis. Pattern Recognit. 2018, 1804, 1–6. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934v1. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430v2. [Google Scholar]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 779–788. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13728–13737. [Google Scholar]
Bello, I.; Fedus, W.; Du, X.; Cubuk, E.D.; Srinivas, A.; Lin, T.-Y.; Shlens, J.; Zoph, B. Revisiting resnets: Improved training and scaling strategies. Adv. Neural Inf. Process. Syst. 2021, 34, 22614–22627. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2016, 128, 336–359. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. CrowdHuman: A Benchmark for Detecting Human in a Crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar]
Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.; Guo, G. WiderPerson: A Diverse Dataset for Dense Pedestrian Detection in the Wild. IEEE Trans. Multimed. 2019, 22, 380–393. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Volume 1, pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Chen, W.; Xu, X.; Jia, J.; Luo, H.; Wang, Y.; Wang, F.; Jin, R.; Sun, X. Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 15050–15061. [Google Scholar]
Berroukham, A.; Housni, K.; Lahraichi, M.; Boulfrifi, I. Deep learning-based methods for anomaly detection in video surveillance: A review. Bull. Electr. Eng. Inform. 2023, 12, 314–327. [Google Scholar] [CrossRef]
Hu, Y.; Yang, J.; Chen, L.; Li, K.; Sima, C.; Zhu, X.; Chai, S.; Du, S.; Lin, T.; Wang, W. Planning-Oriented Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 17853–17862. [Google Scholar]

Figure 1. YOLOv7 network structure.

Figure 2. The ELAN and E-ELAN structures.

Figure 3. The complete structure of RT-YOLO.

Figure 4. The structure of Bottleneck layer.

Figure 5. Multi-head Self-Attention in BOTrans.

Figure 6. The structure of Detect.

Figure 7. Multi-scale fusion process.

Figure 8. Gradient heat map of Feature map.

Figure 9. Channel attention sub-module.

Figure 10. Spatial attention sub-module.

Figure 11. CrowdHuman annotation comparison.

Figure 12. Deleted annotations in WiderPerson.

Figure 13. The comparison of algorithm parameters.

Figure 14. RT-YOLO and classical algorithm detection effect comparison.

Figure 15. The detection results.

Figure 16. Various categories of Map.

Table 1. The parameters of ResNet and BOTrans.

Stage	Output	Resnet	BOTrans
0	512 × 512	7 × 7 Conv, 64, s = 2	7 × 7, 64, s = 2
0	512 × 512	3 × 3, max pool, s = 2	3 × 3, max pool, s = 2
1	256 × 256	{[1 × 1 Conv,64] [3 × 3 Conv,64] [1 × 1 Conv,256]} × 3	{[1 × 1 Conv,64] [3 × 3 Conv,64] [1 × 1 Conv,256]} × 3
2	128 × 128	{[1 × 1 Conv,128] [3 × 3 Conv,128] [1 × 1 Conv,512]} × 3	{[1 × 1 Conv,128] [3 × 3 Conv,128] [1 × 1 Conv,512]} × 3
3	64 × 64	{[1 × 1 Conv,256] [3 × 3 Conv,256] [1 × 1 Conv,1024]} × 3	{[1 × 1 Conv,256] [3 × 3 Conv,256] [1 × 1 Conv,1024]} × 3
4	32 × 32	{[1 × 1 Conv,512] [3 × 3 Conv,512] [1 × 1 Conv,2048]} × 3	{[1 × 1 Conv, 512] [MHSA, 512] [1 × 1 Conv,2048]} × 3

Table 2. Algorithm Comparison Results.

Methods	Params	GFLOPS	$m A P_{50}$ (%)	$m A P_{50 : 95}$ (%)	FPS
SSD [30]	34.3 M	386.2	69.6	35.5	41
RetinaNet [31]	45.7 M	218.3	76.1	45.7	50
Faster-RCNN [9]	41.5 M	207.1	79.3	46.6	52
SOLIDER [32]	43.9 M	259.5	82.2	-	19
YOLOV5-l	46.5 M	109.1	80.2	50.4	71
YOLOX-l [18]	54.2 M	155.6	81.0	49.0	45
YOLOv7	37.6 M	106.5	83.1	52.3	81
YOLOv8-l	43.7 M	165.2	86.3	56.5	68
RT-YOLO (ours)	39.6 M	119.1	86.9	57.2	67
RT-YOLO (Tensorrt)	-	-	86.2	55.9	84

Table 3. RT-YOLO network parameters.

Methods	Layers	Params	GFLOPS	FPS
YOLOv7	415	37.6 M	106.5	81
YOLOv7 + multi-scale	509	37.8 M	119.5	65
Previous + 2 × BTDetect	535	38.0 M	119.0	68
RT-YOLO (previous + NAM)	550	39.6 M	119.1	67

Table 4. CrowdHuman experimental results.

Methods	F1	$m A P_{50}$ (%)	$m A P_{50 : 95}$ (%)
YOLOv7	0.80	83.1	52.3
YOLOv7 + multi-scale	0.83	85.8 (+2.7)	55.8 (+3.5)
Previous + 2 × BTDetect	0.85	86.6 (+0.8)	57.0 (+1.2)
RT-YOLO (previous + NAM)	0.85	86.9 (+0.3)	57.4 (+0.4)

Table 5. WiderPerson experimental results.

Methods	F1	$m A P_{50}$ (%)	$m A P_{50 : 95}$ (%)
YOLOv7	0.72	76.7	45.6
YOLOv7 + multi-scale	0.74	78.9 (+2.9)	48.1 (+2.5)
Previous + 2 × BTDetect	0.76	79.6 (+0.7)	49.4 (+1.3)
RT-YOLO (previous + NAM)	0.78	80.1 (+0.5)	49.6 (+0.2)

Table 6. CrowdHuman experimental results for different scale objects.

Methods	$m A P_{50}$ (%)		$m A P_{50 : 95}$ (%)
Methods	Head	Person	Head	Person
YOLOv7	84.2	81.9	51.7	52.6
YOLOv7 + multi-scale	86.7 (+2.5)	84.8 (+2.9)	54.8 (+3.1)	56.0 (+3.4)
Previous + 2 × BTDetect	87.5 (+0.8)	85.8 (+1.0)	56.0 (+1.2)	57.6 (+2.6)
RT-YOLO (previous + NAM)	87.7 (+0.2)	86.0 (+0.2)	56.2 (+0.2)	57.9 (+0.3)

Table 7. Generalization experiment results.

Methods	$m A P_{50}$ (%)	Aero	Bicycle	Bird	Boat	Bottle	Bus	Car	Cat	Chair	Cow
Faster-RCNN [9]	73.3	76.5	79.1	70.1	65.5	52.2	83.2	84.7	86.5	52.1	81.9
SSD [30]	74.3	75.5	80.2	72.3	66.3	47.6	83.1	84.2	86.1	54.7	78.3
YOLOv5-l	82.2	91.9	95.5	80.2	84.2	79.5	99.7	91.6	95.8	65.3	85.4
YOLOX-l [18]	84.4	90.3	94.1	86.1	81.4	76.4	99.5	94.4	95.1	74.4	92.5
YOLOV7 [14]	89.2	93.8	98.5	82.5	90.6	78.9	100	94.4	98.6	76.5	97.5
RT-YOLO	94.3	97.9	99.1	96.9	98.1	89.2	100	98.5	99.6	90.1	97.5
Methods	$m A P_{50}$ (%)	Table	Dog	Horse	Moto	Person	Plant	Sheep	Sofa	Train	Tv
Faster-RCNN	73.3	65.7	84.8	84.6	77.5	76.7	38.9	73.7	73.9	83.1	72.6
SSD	74.3	73.9	84.5	85.3	82.6	76.2	48.6	73.9	76	83.4	74
YOLOv5-l	82.2	59.7	91.4	83.1	90	85.8	54.4	68.8	79.1	91.2	71.1
YOLOX-l	84.4	64.0	88.4	86.2	90.9	88.2	63.1	69.2	77.95	95.5	80.3
YOLOV7	89.2	75.2	91.8	91.4	93.3	89.4	57.6	82.5	88.9	98.2	88.1
RT-YOLO	94.3	88	96.2	93.9	98.4	96.2	78	89.5	92.7	97.8	94.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, H.; Wang, Y. Residual Transformer YOLO for Detecting Multi-Scale Crowded Pedestrian. Appl. Sci. 2023, 13, 12032. https://doi.org/10.3390/app132112032

AMA Style

Ye H, Wang Y. Residual Transformer YOLO for Detecting Multi-Scale Crowded Pedestrian. Applied Sciences. 2023; 13(21):12032. https://doi.org/10.3390/app132112032

Chicago/Turabian Style

Ye, Hechao, and Yanni Wang. 2023. "Residual Transformer YOLO for Detecting Multi-Scale Crowded Pedestrian" Applied Sciences 13, no. 21: 12032. https://doi.org/10.3390/app132112032

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Residual Transformer YOLO for Detecting Multi-Scale Crowded Pedestrian

Abstract

1. Introduction

2. Related Work

3. The Proposed Method

3.1. Residual Transformer YOLO

3.2. Bottleneck Transformer Encoder Layer and Detect Head

3.3. Multi-Scale Fusion Strategy

3.4. Normalization-Based Attention Module

4. Experiments and Results

4.1. Dataset and Experimental Environment

4.2. Evaluation Metrics and Experimental Details

4.3. Comparison Experiment

4.4. Ablation Experiment

4.5. Validity Verification

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI