3.2.1. Backbone: Feature Extraction Network (CSPDarknet53)
Regardless of the subdataset in
Figure 5, the YOLOv4 model’s backbone module is still used to extract image features. CSPDarknet53 can extract important semantic features from the YOLOv4 model. As a result, it is employed as a backbone network. As shown in
Figure 5, CSPDarknet53 is a CNN that acts as the foundation for object detection by employing DarkNet-53. The feature map on the base layer of a CSPNet network is divided into two parts that are linked together using a CSP-stage hierarchy. The complexity of network optimization can be considerably decreased while keeping accuracy with this network by identifying problems with redundant gradient information. It is compatible with a variety of networks, such as DenseNet, ResNeXt, and ResNet.
In CSPDarknet53, there are five residual blocks that are labeled CSP 1-5. Each residual block (Resunits) has several residual units. Resunit employs the 3×3 convolution operation. This convolution operation includes two steps. The second convolution operation (3 × 3 conv and 1 × 1 conv) is then conducted on each residual block. The overall output is the outcome of these two methods.
Figure 6 depicts the particular operational technique. As a result, each residual block can resolve the vanishing gradient problem.
Therefore, the backbone network could extract image features. In our model, the two subdatasets automatically classified after the preattention stage still use CSPDarknet53 to extract target features. CSPDarknet53 enters the feature fusion stage after performing feature extraction since it has a large receptive field and a minimal number of parameters.
Figure 6 contains 29 convolutional layers of 3 × 3 and 5 residual blocks. * represents the multiplication sign. The residual units in each block are 1, 2, 8, 8, and 4. The resolution of the input image of 618 × 618 is set; thus, the output resolution of the third residual block is 76 × 76. The fourth output resolution is 38 × 38, and the fifth output resolution is 19 × 19.
3.2.2. Neck: SPP-MHSA, FPN, and PANet
The extraction of fusion features for target identification is often improved by adding layers between the backbone network and the output layer. YOLOv4′s Neck module includes FPN, PANet, and SPP. The pooling layer, known as SPP, frees the network from the fixed-size constraint, resulting in the CNN no longer having a fixed-size input. The study, in particular, constructs an SPP layer above the final convolution layer. A top-down architecture with horizontal linkages is developed to generate a high-level map of semantic features on a global scale. The FPN design significantly outperforms other generic feature extractors in various applications. Further, PANet is quick, easy, and very efficient. It comprises components that improve information transmission across pipelines. It combines characteristics from all levels and even reduces the distance between the bottom and top layers. Moreover, each level’s elements are enhanced by utilizing augmented routes. This study introduces the transformer’s MHSA module for obtaining important and prominent features from diverse places in images following the SPP module and replacing the convolution block.
Figure 7 shows the precise operational procedure. From global to local elements in the images, modeling may be performed. This procedure can be more simply integrated than before and extracts useful semantic and geographical information.
Transformer’s MHSA. The transformer model mainly introduces MHSA, composed of multiple self-attention mechanisms. For this self-attention model, the calculation process could be summarized as a function:
The input is
x, representing a matrix in image processing, and the elements in the matrix are called tokens. Linear operations are performed on
x to obtain
Q,
K,
V, where
Q,
K are calculated for similarity and obtain the correlation between each element in
Q and
K. According to this correlation, the corresponding elements in
K are used to construct the output. Similar to the parallel branch in CNN, self-attention can also be designed as a parallel structure called multihead self-attention (MHSA). In Auto-T-YOLO, this study used four heads:
The specific self-attention structure diagram is shown below.
Relative Position Encodings: To perceive the position when performing attention operations, relative position coding is used in the transformer architecture [
10]. However, recently, relative-distance-aware position encodings have been found. Relative Position Encodings are a sort of position embedding for Transformer-based models that aim to take advantage of pairwise, relative positional information. The model receives relative positioning information at two levels: values and keys.
The coding approach proposed by Shaw et al. [
44] is more suitable for visual processing tasks [
45,
46]. This can be attributed to the attention mechanism considering the content information and the relative distance of the features in different positions. Based on these considerations, the perception of information and location can be linked more effectively. In SPP-MHSA, this study uses relative position-coding to perform self-attention operations [
44,
47].
SPP-MHSA. According to the previous analysis, deep CNNs can easily lead to the optimal local solution. It is easy to ignore the salient global features of some targets. The MHSA mechanism aims to model global features. Based on the precedence of global features in visual perception, the fusion order of semantic features for each target is from global to local. Therefore, to detect images with large targets, this study proposes replacing the convolutional block in the feature fusion stage and at various positions in images with the MHSA module. After replacing the convolution block with the SPP module, this study renames the transformer’s MHSA as the SPP-MHSA module.
Figure 8 depicts the particular operation. The model’s first spatial (3 × 3) convolutions are replaced by an MHSA layer that employs global self-attention of a 2D feature map. This was included at the beginning, as shown in
Figure 8. The beginning of FPN coincides with the end of CSPDarknet53 because every residual block typically exhibits its strongest characteristics at its deepest layer, and the strongest feature retrieved is in the output of the deepest block at the end of the network. CSPDarknet-53, the YOLOv4 backbone, is responsible for extracting numerous characteristics from images via 5 Resblocks (C1–C5). The network has 53 convolution layers ranging in size from 1 × 1 to 3 × 3. The feature map generated from the MHSA operation can be considered by each succeeding layer and multiscale feature fusion and extraction based on FPN and PANet network design. This is also analogous to transferring the global features perceived by the MHSA layer to the succeeding deep CNN, thus modeling semantic information from global to local using the MHSA layer first, then the convolutional layer. After replacing the block with the MHSA layer, no further operations are necessary because the stride of the convolution block is 1. This enhancement (SPP-MHSA) permits global information to be acquired from feature maps with varying receptive field sizes, and a multiscale fusion of global information can be conducted in subsequent operations. This enhances the model’s detection accuracy for large-size targets.
As shown in
Figure 7, the present study used four heads. * represents the multiplication sign. The 3D nature of the feature map passed to the module is determined by the
, putting these feature maps as
. Pointwise convolution is represented by
, ⊕ is the element-wise sum, and ⊗ represent matrix multiplication. When
represents height and
represents width,
and
are
relative position encodings. Add the
and
together and output as
, representing the position distance coding. Matrix multiplication of
obtains content-position (content-position) output
, and matrix multiplication of
obtains content-content (content-content) output
. After
, softmax normalization is applied to the obtained matrix, where
represent query, key, and position encodings, respectively (this study uses relative distance encodings [
38,
41,
42,
43]). After processing, the matrix is
, and then this matrix is multiplied by the weight
. The final output
. In addition to using multiple heads, we introduce a two-dimensional position encoding in the content-position section to accommodate the matrix form in image processing. It differs most from the Transformer mechanism [
38,
48,
49].
MHSA focuses more on global information. In
Figure 9, the important semantic features of small target images are in the surroundings of the targets and are transferred between contexts. Global information can sometimes cause redundancy in the transfer of essential features of small-size targets. Therefore, this study used convolution operations and attention mechanisms to separate the most significant context features instead of using the self-attention mechanism in the attention stage for detecting small-size target images.
In
Figure 9, the bold red part is to replace the 3×3 convolution block position with the MHSA layer.