The network structure of the robust lightweight pedestrian detection algorithm, RSTDet-Lite, is depicted in
Figure 2, utilizing a modified GhostNet network as the backbone network and a Simple-BiFPN network as the feature fusion network. Additionally, an attention mechanism, the CBAM module, is inserted between the backbone network and the feature fusion network to enhance feature selection. Finally, an REP module is introduced in front of the YOLO detection head to increase network complexity and improve performance.
3.1. Design of Backbone Network
The efficient and lightweight GhostNet is chosen as the base network structure for the backbone network in this paper. The design philosophy behind the YOLOv5 backbone network shares similarities with YOLOv4, as both draw inspiration from the CSP network architecture. While the CSP network yields commendable performance, its intricate and redundant network parameters lead to sluggish detection speeds, underutilized hardware device capabilities, and demanding hardware specifications. Therefore, we chose the more efficient GhostNet instead. In many efficient neural network models, a large number of feature maps are generated, but many of these features are highly similar. Therefore, the concept of Ghost Module is proposed, which uses simple convolutional and linear operations to obtain more features at a smaller computational cost. This reduces the number of convolution filters used to generate the feature maps, resulting in better performance.
Generally speaking, for the input data
, where c represents the number of input channels, the h and w denote the height and width of the input data, respectively, and the convolution operation at any layer can be described as Equation (1).
where
represents the normal convolution operation,
b is the bias term,
is the output feature map,
n is the number of channels of the output feature,
is the convolution filter of this layer, and
is the size of the convolution kernel. The output feature map usually contains a large number of repeated features and it is observed that some of the features are highly similar to another part, so a simpler convolution can be considered to generate intrinsic features. Then, cheap calculations can be performed on the intrinsic features to obtain more features and the convolution operation can be described as (2).
in Equation (2). Compared with (1), (2) removes the bias term in order to reduce the computational effort and the number of channels by (2) becomes
m. In order to change the number of channels to
n, a linear operation with a smaller number of parameters
s is performed for each intrinsic feature obtained in
. The expression is shown in Equation (3).
where
denotes the
intrinsic features in
and
is the
(except the last one) linear operation, which aims to generate the
feature map. The ultimate linear operation aims to upsample the intrinsic feature map to preserve its features following the linear transformation map, as depicted in
Figure 3. The final feature map obtained from Equation (3) is used as the output data of the Ghost module.
The Ghost bottleneck layer is a critical component for enhancing network performance, as illustrated in
Figure 4. Structurally, it bears resemblance to the base residual module in ResNet [
23], featuring residual connections and Ghost modules that respectively expand and reduce the number of channels to match the number of channels in the residual connections. The above structure is suitable when the stride is set to 1. However, when the stride is set to 2, inserting a depth-wise separable convolution with a stride of 2 between two Ghost modules can effectively reduce the impact of geometric feature variations, thereby improving the results.
Despite performing well in standard detection tasks, GhostNet faces challenges in complex detection scenarios, particularly with regards to pedestrian height overlap and object occlusion. To address these challenges, we incorporated the compact bilinear pooling (CBP) algorithm [
24] to enhance network feature fusion and employed the ELU activation function to accelerate network convergence and improve feature expression capabilities. Additionally, we redesigned the bottleneck layer structure of the network.
The CBP algorithm is an extension of the bilinear pooling method [
25]. The compact bilinear pooling algorithm addresses this challenge through three main steps: feature mapping, bilinear pooling, and compression. In the feature mapping step, two input vectors A and B, typically represented with dimensions d1 and d2, are mapped. Bilinear pooling, the core concept of CBP, computes the product of each feature in A with each feature in B and aggregates these products. This results in a high-dimensional feature vector of size
. To manage the potential large number of products and reduce computational complexity, CBP employs a compression technique. This compresses the high-dimensional feature vector into a lower-dimensional representation, usually a fixed dimension, suitable for subsequent tasks. This compression step helps improve computational efficiency.
CBP excels at capturing high-order feature interactions between input vectors, making it highly valuable in various computer vision and natural language processing tasks. It can significantly enhance performance, especially in tasks that require high-dimensional feature interactions, while reducing computational costs. The compact bilinear pooling algorithm approximates the bilinear pooling method by using a low-dimensional polynomial kernel mapping and extends the bilinear pooling method by employing tensor sketch algorithms for model compression. This approach effectively reduces feature dimensions and computational expenses without compromising the effectiveness of feature fusion.
To minimize computational costs, the proposed CBP-G network features two structures: CBP-G with stride 1 simply modifies the activation function of G-bottleneck, whereas CBP-G with stride 2 employs the compact bilinear pooling algorithm, which more effectively integrates residual edges and Ghost modules and uses the more efficient ELU activation function. The architecture of CBP-G with stride 2 is illustrated in
Figure 5, with CBP denoting the compact bilinear pooling structure.
The design of CBP-GNet maintains the lightweight characteristics of GhostNet while enhancing the model’s ability to capture and integrate features from various network layers. This contributes to a more comprehensive understanding of pedestrian appearances under adverse weather conditions. This improved feature fusion effectively combines contextual information, spatial relationships, and scale-related details, which is crucial for accurate pedestrian detection in adverse weather conditions.
3.2. Proposed Simple-BiFPN Structure
The BiFPN structure, proposed by the Google team in EfficientDet [
26] and illustrated in
Figure 6c, is an efficient weighted bidirectional feature pyramid network. When examining the PANet utilized in YOLOv5 (
Figure 6a), it becomes apparent that node A and node B have limited impact on feature fusion due to their one-dimensional feature inputs. Consequently, BiFPN removes these nodes to minimize redundant parameters. Furthermore, by introducing an extra skip connection within the same feature dimension, more features can be fused without additional computational overhead. Lastly, to optimize the one-path structure of NAS-FPN [
27], BiFPN integrates both top-down and bottom-up paths into a single feature layer network.
The solution proposed by BiFPN is to add additional weights to each input and let the network learn, evaluate the importance of each input, and assign the appropriate weight assignment. For weight fusion, BiFPN uses a fast normalized fusion strategy to ensure that the output feature representation is of high quality. It can be described in Formula (4).
The training process of Simple-BiFPN involves the following steps: with an input image size of
,
has dimensions of
,
has dimensions of the dimensions of
, and
has dimensions of
. Since the YOLO detector only utilizes three scales of information,
is obtained through upsampling to further improve feature fusion, but is only involved in the feature fusion process and not in the final output. The Simple-BiFPN module assigns a weight
to each input layer and is used as an example.
The operation of Re
size in Equation (5) is generally done by upsampling or downsampling to achieve feature scale uniformity.
refers to the intermediate features of the layer
refers to the output features of that layer. To further improve the feature fusion efficiency, depthwise separable convolution is also incorporated in this process. The other layers are constructed similarly to this process, utilizing depthwise separable convolution to enhance the feature fusion efficiency. The network architecture of Simple-BiFPN is illustrated in
Figure 7.
An efficient feature fusion network plays a crucial role in addressing the challenges of pedestrian detection in rainy conditions. The advantage of Simple-BiFPN lies in its seamless integration of multiscale features from different network layers, enhancing the model’s ability to capture pedestrian details under adverse weather conditions. This feature fusion network optimally combines contextual information, such as object relationships and spatial dependencies, contributing to improved detection accuracy. Furthermore, it ensures that the network maintains computational efficiency, ensuring real-time or near real-time performance in rainy conditions. This efficiency allows the model to remain effective while efficiently utilizing hardware resources, making it a powerful choice for rainy-day pedestrian detection.
3.3. Incorporating the Spatial Attention Mechanism CBAM
CBAM (convolutional block attention module) is a more comprehensive approach to feature attention, as illustrated in
Figure 8. Unlike SENet, which only considers channel attention, CBAM focuses on both channel and spatial attention. The channel attention mechanism assigns appropriate weights to different channels to help them focus more effectively on key information. The module consists of two parts: global maximum pooling and global average pooling. The input features are processed through a shared fully connected layer; the resulting output is fed into two separate branches for channel and spatial attention. The output of each branch is then combined through summation and passed through a sigmoid function to obtain a value between 0 and 1, which is then used to weight the original features and obtain the new feature F1.
The spatial attention mechanism of the CBAM module focuses on the most salient regions in the feature map and assigns different weights to different regions to reduce the proportion of irrelevant regions. When the feature map F1 passes through the spatial attention mechanism, it undergoes global maximum pooling and global average pooling; these two results are concatenated channel by channel. The resulting feature map is then processed by a 1 × 1 convolution to adjust the number of channels. A sigmoid function is used to obtain weights between 0 and 1, which are then multiplied element-wise with the input features. The resulting weighted feature map is then used as the input for the next layer, F2.
As illustrated in
Figure 2, the CBAM module is inserted between the backbone network and the feature network, adding three attention mechanism modules in total. By incorporating the attention mechanism between the backbone network and the feature network, the network can leverage the rich feature representations generated by the backbone network while selectively focusing on the most relevant regions of the image. This includes key contextual information such as the outline and posture of pedestrians, as well as the rainy background and environmental factors. This allows the network to generate more accurate predictions by considering only the most pertinent information in the image.
The integration of the CBAM module between the backbone and feature fusion networks proves highly effective. CBAM enhances feature quality, refines spatial context, improves information flow, and notably contributes to the network’s improved performance, especially in challenging conditions such as rainy weather.
3.4. REP Structures Combining Structural Reparameterization Ideas
The design of the REP (reparameterization) structure is inspired by RepVGG [
28], a concept introduced by the Tsinghua University team, which proposes to use a complex structure in the training phase and a simple structure in the prediction phase to improve detection performance without increasing the complexity of the prediction network. Based on this concept, we propose the REP structure (
Figure 9).
The proposed method consists of three processes: (a) constructing the structure of the representation learning encoder in the training phase, (b) generating intermediate states during the process, and (c) designing the structure of the representation learning encoder for the prediction phase. As shown in
Figure 9, the structure of the prediction network is very simple, consisting of only a 3 × 3 convolutional block. The key to this process is to combine the convolution and BN operations into one operation and to unify all convolution operations with a 3 × 3 convolution kernel.
The first process achieves the fusion of the BN layer with the convolutional layer, as shown in Equation (6). The derivation of Equation (6) reveals that the computational form still follows the convolutional computational format, with the weights becoming and the bias term becoming . At the end of the training, one of these parameters can be obtained, so the constant transformation can be performed in the prediction phase to achieve the fusion of the BN layer with the convolutional layer. In the second process, all convolutions are replaced with 3 × 3 convolutions. The idea of this operation is also relatively simple. For the 1 × 1convolution, 0-value padding can be performed around it without affecting the calculation results. For the places where no convolution is performed, the goal can be achieved by replacing them with 3 × 3 convolutions with all weights of 1.
Based on the above analysis, we can conclude that the REP module has a high time cost during the training phase, but only requires one 3 × 3 convolution operation during the prediction phase, indicating that the REP module is an efficient module. As illustrated in
Figure 2, the REP structure is applied between the YOLO detection head and the feature fusion network to enhance the network’s complexity, enabling it to capture more complex features and improve the network’s detection performance.
Incorporating the REP module enhances network complexity during training, which, in turn, improves the network’s ability to extract and recognize pedestrian features in challenging rainy weather conditions. Importantly, this enhancement in complexity does not significantly impact network inference speed during the prediction phase, thus maintaining a high-speed detection capability.