This paper takes into account the hybrid idea of the ConvMixer [
24] and the advantages of the multi-branch architecture of EPSANet [
25]. Firstly, the input feature map is processed by multi-branch architecture, and each branch uses depthwise convolution to mix the spatial locations. Afterward, pointwise convolution is used to mix the channel locations. Large kernel convolution is used in depthwise convolution to mix remote spatial location information, so as to construct long-range dependence while obtaining larger receptive fields. Finally, a mixed attention MA module is proposed, which is composed of four parts, as shown in
Figure 2. Firstly, by executing the Mixer and Concat (MC) module, the multi-scale mixed feature map is obtained. Secondly, the SEW module is executed on the multi-scale mixed feature map to obtain the channel weight vector. Thirdly, Softmax function recorrects the channel weight vector to obtain the calibrated multi-scale channel weight vector. Fourthly, the calibrated weight vector is multiplied by the corresponding channel of the multi-scale mixed feature map. And finally, a refined feature map which is richer in multi-scale feature information is obtained and used as the output.
As shown in
Figure 2, in the MA module, the main operation for multi-scale mixed feature extraction is the MC module, and the overall structure of the module is shown in
Figure 3. In order to extract multi-scale spatial information, the input feature map is processed in a multi-branch way, the channel dimension of the input tensor of each branch is
, and the output channel dimension is
, where
represents the number of branches. By doing this, more abundant spatial location information can be obtained. The different spatial resolutions and depths can be generated by using multi-scale convolutional kernels in a pyramid structure. And the spatial information with different scales on each channel-wise feature map can be effectively extracted by squeezing the channel dimension of the input tensor. For each branch, it learns multi-scale mixed spatial information independently and establishes cross-dimensional interaction in a wide range. However, when the size of the convolution kernel increases, the hyperparameters also gradually increase. Therefore, in order to perform multi-scale convolution on the input tensors without increasing computational costs, grouped convolutions are heavily applied in the convolutional layers. At the same time, to select different group sizes without increasing the amount of parameters, referring to EPSANet network architecture design rules, the correlation between the multi-scale kernel size and group size can be defined as:
where
K represents the size of the convolution kernel and
G is the size of the group; the effectiveness of this formula has been proved in the ablation study. For each branch, the spatial dimension of the input tensor is first compressed to extract local information, and the feature map generation function is defined as:
where the size of the
i-th convolution kernel is
, the size of the
i-th group is
,
represents the activation function GELU, and BN is the BatchNorm [
26], which regularizes the tensors after activation to speed up the training of the model;
represents feature maps with different scales, followed by the hybrid module. In order to mix the remote spatial location information, we increase the size of the convolution kernel to 9. Meanwhile, in order to prevent the increase of the convolution kernel size from causing more computational overhead and parameter numbers, we use deep convolution in this paper. According to research in the literature [
27], if there is no identity shortcut in deepwise convolution of the large kernel, it is difficult to make it work. Therefore, a parallel shortcut branch was added for this paper. Referring to the Feed-Forward Network (FFN) design of
ViTs architecture, we use a similar CNN-style block composed of shortcut, SoftBAN, one 1 × 1 layers and GELU to mix channel location information. Hence, each branch in the MC module is very similar to the Transformer structure. And by doing this, a larger combined receptive field can be obtained, and the cross-dimensional interaction of channels is established. In the operation of the mixing module, the spatial dimension and channel dimension of the tensor are not changed. The mixing operation function is defined as:
where
is an improvement to IEBN [
28]; please check
Appendix A for detailed proof.
By extracting the channel attention weight information from the multi-scale preprocessing feature map, the channel weight vectors with different scales are obtained. The channel attention weight vector can be expressed as:
where
is the attention weight, and the
function obtains the attention weight from the input feature maps at a different scale. Due to the introduction of multi-branch architecture and the allocation of different convolution kernel sizes for each branch, the MA module can fuse context information at different scales, and under the holding of large kernel residual convolution, it is possible to generate better pixel-level attention for advanced semantic feature maps. In addition, in order to achieve the interaction of attention information and the fusion of cross-dimensional vectors without destroying the original channel attention weight vector, the whole channel attention weight vector is obtained by a concatenation method, as shown in Equation (8):
where
is a multi-scale weight attention vector.
Softmax is used to obtain multi-scale channel recalibration weights, which contain all local information in space and attention weights in channels. By doing this, the interaction between local and global attention is realized. Next, the channel attention vectors of the feature calibration are fused and spliced in a concatenation manner, so the entire channel attention vector can be expressed as:
where
represents the attention weight vector of the multi-scale channel after attention interaction. We multiply the recalibrated weight
of the multi-scale channel attention with the feature map
of the corresponding scale as:
where
denotes channel-wise multiplication, and
refers to the feature map weighted by the multi-scale channel attention weight vector, which has stronger feature representation and modeling capability, The concatenation operator is more efficient than the summation operator because it maintains the feature representation intact without destroying the information of the original feature map. In summary, the procedure to obtain optimized output can be written as:
From the above analysis, the MA module proposed in this paper can integrate multi-scale spatial information and cross-channel attention into the blocks of each feature group. Therefore, the MA module can obtain better information interaction between local and global channel attention.