A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks

Liu, Huaiyu; Zhang, Yueyuan; Chen, Yiyang

doi:10.3390/sym16080952

Open AccessArticle

A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks

by

Huaiyu Liu

,

Yueyuan Zhang

^*

and

Yiyang Chen

^*

School of Mechanical and Electrical Engineering, Soochow University, Suzhou 215137, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2024, 16(8), 952; https://doi.org/10.3390/sym16080952 (registering DOI)

Submission received: 28 June 2024 / Revised: 18 July 2024 / Accepted: 22 July 2024 / Published: 25 July 2024

(This article belongs to the Special Issue Symmetry/Asymmetry in Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, attention mechanisms have shown great potential in various computer vision tasks. However, most existing methods focus on developing more complex attention modules for better performance, which inevitably increases the complexity of the model. To overcome performance and complexity tradeoffs, this paper proposes efficient spatial and channel attention (ESCA), a symmetric, comprehensive, and efficient attention module. By analyzing squeeze-and-excitation (SE), convolutional block attention module (CBAM), coordinate attention (CA), and efficient channel attention (ECA) modules, we abandon the dimension-reduction operation of SE module, verify the negative impact of global max pooling (GMP) on the model, and apply a local cross-channel interaction strategy without dimension reduction to learn attention. We not only care about the channel features of the image, we also care about the spatial location of the target on the image, and we take into account the effectiveness of channel attention, so we designed the symmetric ESCA module. The ESCA module is effective, as demonstrated by its application in the ResNet-50 classification benchmark. With 26.26 M parameters and 8.545 G FLOPs, it introduces a mere 0.14% increment in FLOPs while achieving over 6.33% improvement in Top-1 accuracy and exceeding 3.25% gain in Top-5 accuracy. We perform image classification and object detection tasks on ResNet, MobileNet, YOLO, and other architectures on popular datasets such as Mini ImageNet, CIFAR-10, and VOC 2007. Experiments show that ESCA can achieve great improvement in model accuracy at a very small cost, and it performs well among similar models.

Keywords:

deep learning; attention mechanisms; symmetric; computer vision; image classification; object detection

1. Introduction

With the application of deep convolutional neural network (CNN) in the field of computer vision more and more widely, such as image classification, target recognition, and semantic segmentation [1,2], current research is gradually tending to find a deeper CNN architecture so that the network can obtain a more significant ability to enhance the learning of feature representation. However, too many heap-plus-convolutional layers often only increase the generalization ability of the model by a small amount, but it needs to spend a lot of computing resources and memory [3]. We believe that this cost is not worth it, which is the disadvantage of deep CNN architecture. Then, the direction of research changed to develop a lightweight network such as MobileNet [4] to maintain high model performance with less computing resources, but developing a new neural network is difficult and has great uncertainty. Therefore, as an alternative, the attention mechanism can not only enhance the generalization ability of the network [5,6,7] but can also be easily embedded into the architecture of the CNN network by virtue of its flexible structural characteristics. Therefore, the development of attention mechanisms has become a new hot spot in the field of computer vision [8,9].

Currently, it is widely accepted that attention mechanisms are primarily categorized into three research directions: spatial attention, channel attention, and their combination. Among these, SE stands out as a representative method for channel attention, as it explicitly models the inter-dimensional interaction among channels to extract channel attention [10]. Another notable approach is the CBAM, which leverages semantic dependencies between channel and spatial dimensions within the feature map to facilitate cross-space and cross-channel information transfer [11]. Consequently, CBAM shows significant potential in integrating cross-dimensional attention weights into input images. Nonetheless, the human design of pooling layers within CBAM entails complex processing and may escalate computational costs. To address this question, this study proposes a sustainable method that utilizes feature grouping to partition features across different resources into multiple groups. This approach ensures an even distribution of features in space. In this context, spatial group-wise enhancement (SGE) attention [12] further subdivides the channel dimension into many subfeatures and enhances the spatial distribution of various semantic subfeature representations. As a result, excellent performance is achieved.

Using dimensionality reduction convolution with channels is an efficient strategy to effectively simplify the model structure [13]. Unlike the SE attention mechanism, CA [14] incorporates spatial dimension direction information into channel attention, and optimizes the dimension reduction ratio of the channel to achieve comparable performance. This approach is different from tasks in coarse-grained computer vision (CV) and can alleviate the computational overhead associated with dimensionality reduction in per-pixel regression tasks. Inspired by the concept of estimating complex pixel semantics, polarized self-attention (PSA) [15] compresses the input images only along the corresponding channel dimension, preserving the hyperspectral resolution while completely collapsing the input feature image in the channel dimension for effective feature representation. Although an appropriate channel reduction ratio has the potential to lead to superior performance, it is important to note that it may inadvertently introduce the detrimental effect of deep visual representation extraction, which is the non-dimensionality reduction effect explored in ECA [16].

Increasing network depth serves as a crucial component in enhancing the representation power of convolutional neural networks. Nonetheless, this enhancement costs a lot of computational resources and memory [17,18]. Unlike the linear sequence feature of the deep attention mechanism, triplet attention (TA) [19] introduces a triplet parallel branch architecture to capture the inter-dimensional interaction between different parallel branches. Shuffle attention (SA) [20] utilizes parallel substructures to organize channel dimensions into discrete subfeatures and process them in parallel and efficiently on multiple processors. Parallel network (ParNet) [21] utilizes parallel sub-networks to enhance feature extraction efficiency, keeping the network shallow and minimizing latency.

From the analysis of the above attention mechanisms, cross-dimensional interactions contribute to the prediction of channel or spatial attention. Channel non-dimensionality reduction is beneficial to enhance the model’s generalization capability. We present a multi-dimensional ESCA (efficient spatial and channel attention) method, and the model architecture diagram is shown in Figure 1. Here, we use 1D convolution for multi-dimensional attention feature extraction, and the symmetrical design of channel attention is introduced in the model to fully focus on channel attention. To verify the validity of the rationality of our ESCA module design, we describe the extensive experiments we conducted in Section 4, where we first verify the negative impact of GMP on the model and the correctness of the 1D convolution kernel size k adaptation formula. Then, we use ResNet [22], MobileNet [4], YOLO [23], and other architectures to verify the classification task on Mini ImageNet, CIFAR-10 dataset and object detection task on VOC 2007 dataset. Our main contributions are as follows:

We propose an optimized, effective attention module that enhances the model’s generalization ability.
We verify the negative effects of GMP and the correctness of our modular symmetrization design through extensive ablation studies.
We validate that both classification and recognition tasks on multiple benchmarks (Mini ImageNet, CIFAR-10 and VOC 2007) are greatly improved using different network architectures (ResNet, MobileNet, YOLO) by embedding ESCA.

2. Related Work

In this section, we provide a brief literature review of the network architecture and attention mechanism discussed research in this paper.

Research on neural network structure has always been a focus in the field of computer vision [22,24,25,26,27,28]. The proposed ResNet solves the problem that “accuracy decreases as the network deeps” through a simple identity skip connection, ensuring the network remains theoretically optimal, preventing performance degradation as depth increases. Based on ResNet architecture, many models have been developed such as WideResNet [29], ResNeXt, and Inception-ResNet [30]. Later, in order to build a lightweight network, the MobileNet architecture was proposed, and then MobileNetV3 [31] combined with the neural architecture search algorithm to realize the search of the most suitable activation function and the expansion ratio of inverted residual blocks in different depth levels. After that, the YOLO architecture was used to achieve real-time object detection, providing higher accuracy and speed.

The emergence of the SE module first proposed that the attention mechanism is an effective channel attention learning mechanism, which learns features through channel dimensionality reduction and achieves good results. In the subsequent research, CBAM not only pays attention to channel attention but also pays attention to spatial information, and it realizes multi-dimensional feature aggregation through global average pooling (GAP) and GMP. Global second-order pooling (GSoP) [32] uses second-order pooling to achieve more efficient feature aggregation than CBAM. Gather–excite (GE) [33] uses depth-wise convolutions [34] to explore spatial expansions to aggregate features. The CA module realizes the location of the “where” of the image by encoding the spatial location. CBAM and spatial and channel SE (scSE) [35] take two-dimensional convolution to calculate the two-dimensional spatial attention extraction method, and then collect the channel and spatial attention together to the input image. ECA module proposes the effectiveness of channel attention without dimensionality reduction and gives the method of extracting attention with 1D convolution to achieve efficient feature extraction. GCNet [36] and non-local (NL) [24] neural networks model long-range dependencies by developing a simplified NL network that is integrated with the SE module. Double attention networks (A2-Nets) [37] introduce a new relation function for NL blocks in image and video recognition. The dual attention network (DAN) [38] accounts for both channel and spatial attention based on natural language for semantic segmentation. Efficient multi-scale attention (EMA) [39] proposes a new cross-spatial learning method and designs a multi-scale parallel sub network structure. This structure aims to establish short- and long-dependency relationships to better capture global and local information in the image. Our ESCA module fully considers the advantages and disadvantages of the above models and achieves efficient aggregation and extraction of spatial attention and channel attention while maintaining very low model complexity.

3. Efficient Spatial and Channel Attention

In this section, we review famous modules such as SE, CBAM, CA, and ECA. By analyzing the advantages and defaults between each module, we design the ESCA module, which fully considers the channel attention and spatial attention, and eliminates the maximum pooling with negative effects. It realizes fast and effective extraction of features in the C, H, and W directions and shows how to apply it to Resnet and Yolov8.

3.1. Review SE, CBAM, CA, and ECA Modules

Let the output of the previous layer be

F \in R^{C \times H \times W}

, in the equation

C, H, W

are the number of channels (i.e., the number of filters), the height of the feature map, and the width of the feature map, respectively. Thus the SE module can be described as

\begin{matrix} F = δ (W_{2} ε (W_{1} y)) \end{matrix}

(1)

where

y = \frac{1}{H \times W} \sum_{i = 1, j = 1}^{H, W} F_{i j}

is channel GAP,

W_{1} \in R^{C \times \frac{C}{r}}

and

W_{2} \in R^{\frac{C}{r} \times C}

are the parameters that need to be learned by the SE channel attention module,

δ

is the sigmoid activation function, and

ε

stands for the rectified linear unit (ReLU) activation function. Although the SE module has been proven to be an effective attention module, the design of

W_{1}

and

W_{2}

only plays the role of reducing the calculation parameters and does not bring beneficial effects to the learning of the model, and the SE module ignores the spatial information of the feature map.

The CBAM module can be described as

\begin{matrix} F & = F_{2} \otimes F_{1} \otimes F \end{matrix}

(2)

\begin{matrix} F_{1} & = δ (W_{2} (W_{1} y) + W_{2} (W_{1} z)) \end{matrix}

(3)

\begin{matrix} F_{2} & = δ (C 2 D_{7} (y (F_{1}), z (F_{1}))) \end{matrix}

(4)

where ⊗ stands for element-wise multiplication, z stands for GMP, the weight parameters of the mean layer and the maximum layer are shared, and

C 2 D_{7}

stands for connecting the mean layer and the maximum layer with a 7 × 7 two-dimensional convolution kernel to obtain two-dimensional spatial attention. The CBAM module focuses on spatial and channel attention and solves the problem of the “what” and “where” of the target. However, the relative complexity of the model, especially the step of maximum pooling, proves to bring negative effects in subsequent experiments.

The CA module can be described as

\begin{matrix} F = F \times δ (W_{2} ε (W_{1} y_{c}^{h})) \times δ (W_{2} ε (W_{1} y_{c}^{w})) \end{matrix}

(5)

where

y_{c}^{h} = \frac{1}{W} \sum_{j = 1}^{W} F_{h j}

,

y_{c}^{w} = \frac{1}{H} \sum_{i = 1}^{H} F_{i w}

represents GAP for the two dimensions H and W, respectively. The CA module is different from the SE module, which only focuses on the channel information. It realizes the encoding of channel H and W spatial information through the mean pooling of two dimensions. Attention can accurately locate the object of interest through the horizontal position and vertical position without abandoning the channel information, which can help us to achieve a more accurate positioning of the target of interest and improve the generalization ability of the model. Of course, the operation of

W_{1}

and

W_{2}

is worth improving.

The ECA module can be described as

\begin{matrix} F & = δ (C 1 D_{k} (y)) \end{matrix}

(6)

where

C 1 D_{k}

represents a 1D convolution with kernel size k and the ECA module proves that the design of

W_{1}

and

W_{2}

in SE is invalid and provides a 1D convolution kernel to achieve the purpose efficiently. Based on the above modules, we designed the ESCA attention module.

3.2. Efficient Spatial and Channel Attention (ESCA) Module

After fully analyzing SE, CBAM, CA, and ECA modules, we prove through the experiments of predecessors that channel dimension reduction does not bring better results, but using a 1D convolution kernel with kernel size k can focus on the strong connections between channels in the k interval to effectively enhance channel attention, and it brings a small extra parameter. This would be an excellent method for feature extraction as a dimension. At the same time, considering the extraction and positioning of spatial information, we will extract the weights of H and W dimensions as location information, respectively, and we do not use GMP to bring useless burdens to the model. Finally, we believe that the priority of channel attention is higher than that of location attention. Therefore, we perform two-layer channel extraction to enhance channel information so that we can still improve the generalization ability of the model with very few modules inserted into the model, and avoid the stacking of attention modules. Based on the above analysis, we propose our efficient spatial and channel attention module, which we describe in detail below.

Firstly, given

F \in R^{C \times H \times W}

as an input, ESCA will extract feature weights in the order of the channel–height–width–channel attention module.

f_{c} \in R^{C \times 1 \times 1}

represents 1D channel attention extraction, and

f_{h} \in R^{1 \times H \times 1}

and

f_{w} \in R^{1 \times 1 \times W}

represent 1D spatial attention extraction on H and W dimensions respectively, as shown in Figure 2. The design of the whole module can be summarized into the following steps:

\begin{matrix} F^{1} = f_{c} [F^{i n}] \otimes F^{i n} \end{matrix}

(7)

\begin{matrix} F^{2} = f_{h} [F^{1}] \otimes F^{1} \end{matrix}

(8)

\begin{matrix} F^{3} = f_{w} [F^{2}] \otimes F^{2} \end{matrix}

(9)

\begin{matrix} F^{o u t} = f_{c} [F^{3}] \otimes F^{3} \end{matrix}

(10)

In the channel attention module, we leverage feature map channel relationships to generate our attention weight module. Since every channel feature is usually considered as a feature detector, channel attention provides great help in solving the problem of the “what is” of the image. We followed the practice of Zhou et al. to perform GAP on the feature map to obtain the mapping of spatial information on the channel. We did not adopt the practice of Woo et al. to perform the GMP on the input image at the same time. Our experiments show that this does not increase the generalization ability of the model but rather reduces it and brings additional parameter calculations. After that, we used the method of assigning weights to 1D features designed by Wang et al., which we think is an effective method because the design of the 1D convolution kernel is based on the idea of

F = δ (φ y)

. This method fully considers the feature connection and fusion between adjacent channels. Finally, the channel attention weights are output by a sigmoid activation function, where

φ

is a band matrix:

\begin{matrix} [\begin{matrix} φ^{1, 1} & \dots & φ^{1, k} & 0 & 0 & \dots & \dots & 0 \\ 0 & φ^{2, 2} & \dots & φ^{2, k + 1} & 0 & \dots & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ & ⋮ \\ 0 & \dots & 0 & 0 & \dots & φ^{C, C - k + 1} & \dots & φ^{C, C} \end{matrix}] \end{matrix}

In the first step, we perform GAP on the input tensor to obtain a one-dimensional channel information map that aggregates the spatial information. The average pooling feature has been proven to be beneficial for feature extraction many times, but the maximum pooling feature does not play the ideal effect and even accelerates the overfitting speed of the model. Therefore, based on the experiment, we did not extract the maximum pooling feature. In the second step, we easily obtain the target channel features and the aggregated features of the surrounding channels using a simple 1D convolution kernel. It avoids the information loss caused by channel compression in the SE module and only increases a very small amount of parameter overhead. After that, the channel attention weight is obtained after normalizing the tensor through the sigmoid activation function, and the weight is then applied to the input feature map. Here is the computation process of the channel attention module:

\begin{matrix} f_{c} [F] & = δ (C 1 D_{k_{1}} (a v g p o o l (F))) = δ (C 1 D_{k_{1}}, (y)) \end{matrix}

(11)

Since the scale of the 1D kernel is correlated with the dimension of channel C and is not linear, the value of k in the above formula can be obtained using the following formula:

\begin{matrix} k = Φ (C) = {[\frac{l o g_{2} (C)}{2} + \frac{1}{2} + \frac{1}{10}]}_{o d d} \end{matrix}

(12)

The function

{[]}_{o d d}

represents the odd number closest to the number in the function, and the purpose of the second bias term 1/10 is to make the function take the closest odd number up.

The spatial attention module is subdivided into height and width attention components. We apply a similar approach to that used in the channel attention module, obtain the one-dimensional height information map that aggregates the channel and width information by GAP, and then obtain the target height features and the surrounding height aggregation features by a one-dimensional convolution kernel. Finally, the height weight and output feature map are obtained. The steps of width attention are the same, and such a design is adopted to fully extract the coordinate information while extracting the coordinate position of H and W, and it does not ignore the invisible relationship between coordinate information and channel information. Finally, in order to prevent the channel characteristics from weakening after extracting the height information and width information, we performed another channel feature extraction, and the correctness of this point was also verified in the subsequent experiments. Here is the formula for calculating spatial attention:

\begin{matrix} f_{h} [F] & = δ (C 1 D_{k_{2}} (a v g p o o l (F))) = δ (C 1 D_{k_{2}} (y)) \end{matrix}

(13)

\begin{matrix} f_{w} [F] & = δ (C 1 D_{k_{2}} (a v g p o o l (F))) = δ (C 1 D_{k_{2}} (y)) \end{matrix}

(14)

3.3. Discussion

For the input feature map, we perform feature extraction of spatial location information and channel information in the complementary order and fuse them to improve attention extraction and solve the problems of “what” and “where”. We found through our experiments that channel attention is superior to spatial attention, so we set the first operation as the channel attention module and added another channel attention layer at the end for symmetry design to ensure that it will not be canceled out by spatial attention. We will discuss the experimental results of this module under various conditions in detail in the next section.

4. Experiments

In this section, we evaluate ESCA on the Mini ImageNet and CIFAR-10 datasets for classification tasks and on the VOC 2007 dataset for object detection tasks. We re-implemented all evaluated neural networks using the PyTorch framework and present our reproducible experimental results. To thoroughly assess the effectiveness of the ESCA module, we conducted a large number of ablation experiments. To comprehensively evaluate the effectiveness of the ESCA module, we conducted an extensive set of ablation experiments. Our findings confirm that the generalization capability of the ESCA module surpasses all baselines without additional attention mechanisms, as well as neural networks enhanced with attention modules including SE, CBAM, CA, ECA, and EMA, showcasing the universal applicability of the ESCA module across diverse architectures. Then, we verify the rationality of some design steps in the ESCA module, such as the negative effect of maximum pooling and the selection of 1D convolution kernel size k, which verifies the rationality of the ESCA design. Then, we verify the generality of this module under different datasets on the CIFAR-10 dataset. Finally, we conduct object detection tasks in the VOC 2007 dataset to verify that the ESCA module has the effect of improving the network for different tasks and verify that the module can be seamlessly inserted into any position of the architecture within a certain range.

4.1. Experiments Preparation

The first experiment we conducted was a classification task on the Mini ImageNet dataset, which consists of 100 images of different classes. The training set consists of 50,000 images, the val set comprises 10,000 images, and the test set comprises 10,000 images. In order to accurately and comprehensively evaluate the influence of the ESCA attention module on this task, we select four CNNS with high recognition as the backbone models, which are RerNet-50, ResNet-101, MobileNetV3 and YOLOv8. For training using the ESCA module added to ResNets and YOLOv8, we refer to the settings of related hyperparameters and data augmentation in [10,11,14,16]. The specific hyperparameters are shown below, we randomly crop the input image to a 224 × 224 size and then flip it horizontally. We train the model for 24 epochs with an initial learning rate of 0.01, decreasing to 0.001. The batch size is 256, weight decay is 1 × 10⁻⁴, and momentum is 0.9. Detailed parameters are shown in Table 1. To add to MobileNetV3 using the ESCA module, we followed the hyperparameters set up in [24], and we trained the network with stochastic gradient descent (SGD) for 400 epochs, using a weight decay of 4 × 10⁻⁵, a momentum of 0.9, and a batch size of 96. The starting learning rate was set to 0.045, and the learning rate decreased with a linear decay rate of 0.98. After training, we assessed the network’s generalization performance using Top-1 and Top-5 accuracy metrics. Then, we used ResNet-50 and YOLOv8n as the backbone and used the detecor of YOLOv8 as the detection head to form the network. Subsequently, we validated the effectiveness of our model in object detection tasks using the VOC 2007 dataset. Our model code implementation is based on the repository of Ultralytics (YOLOv8). All of the experiments were conducted on a PC using a GeForce RTX 2060 GPU made by Nvidia in the United States and an Intel processor.

4.2. Image Classification on Mini Imagenet

In this short section, we first verify the influence of GAP and GMP on the ESCA module and then verify the effectiveness of 1D convolution kernel size k and (12). Finally, we use ResNet-50, ResNet-101, MobileNetV3, and YOLOv8-cls networks to load our ESCA module and load the same type of attention module for comparison.

4.2.1. Effect of GAP and GMP on the ESCA Module

As shown in (11), our ESCA module uses separate GAP in each attention module as a way to map the feature values of the auxiliary channel to the main channel. In this part, we evaluate the impact of this operation on our ESCA module and confirm the efficacy of our model solely using GAP. In the experiment, we use ResNet-50 and YOLOv8-cls as the backbone models and train on the Mini ImageNet dataset with the design of GAP and GMP fusion. The results are shown in Table 2.

From the table, we can obtain the following results. Firstly, the integrated mapping of channel features using GAP alone for either ResNet-50 or YOLOv8-cls achieves very good results in TOP-1 and TOP-5 accuracy. However, GMP alone and GAP plus GMP both perform poorly, even worse than the baseline model without an attention module. We believe that GMP pays more attention to the most significant local features and ignores the overall information, which is suitable for a situation that needs to emphasize local-specific details, but it is not suitable for the classification task that needs more global information perception. Therefore, the addition of GMP will lead to excessive attention and poor generalization ability of the model. However, GAP can distribute our attention more evenly and is more suitable for classification tasks. The above results demonstrate the effectiveness of our separate GAP in modules to obtain better and more stable results. The rationality of the model.

4.2.2. Impact of 1D Kernel Size on ESCA

As depicted in (12), the ESCA module employs a parameter governing the size of the 1D convolutional kernel. In this brief section, we evaluate the impact of parameter size selection on our ESCA module and confirm its effectiveness and rationale for the adaptive kernel size selection method described in (12). In the experiment, we use YOLOv8 as the backbone model. Since the dimension of C channel is usually different from the H and W channel in the three channels C, H and W, we use

k_{1}

and

k_{2}

two 1D convolution kernels of different sizes for the experiment, where

k_{1}

and

k_{2}

range from 3 to 7, and then we train the ESCA module. The results are shown in Table 3.

From the table, we can obtain the following results. Firstly, we can find that the performance of the ESCA module regardless of the value of

k_{1}

and

k_{2}

is always better than the baseline model, which verifies the positive effect of the 1D convolution kernel on feature extraction. Then we can see that for the parameters

k_{1}

and

k_{2}

the model achieves the best results when

k_{1}

and

k_{2}

are both equal to 5. We believe that in the case of a large number of channel eigenvalues, only a relatively large 1D convolution kernel can completely express the eigenvalues of the channel and its related channels. Finally, through calculation, it is found that the values of

k_{1}

and

k_{2}

are in line with (12), which verifies that the adaptively determined kernel size is better than the fixed kernel size module. At the same time, the cross-validation method can be avoided to manually tune the parameter k. The experimental results above illustrate the efficacy of adaptive kernel size selection in achieving improved and consistent outcomes.

4.2.3. Contrasts Using Different Networks

We train ESCA on the Mini ImageNet dataset using ResNet-50 and conduct comparative experiments with a wide range of advanced attention methods, incorporating SE, CBAM, CA, ECA, and EMA. The first part is the efficiency of the model (i.e., parameters of the network, inference time (Time), floating-point operations per second (FLOPs), and percentage increase in FLOPs over the baseline model), and the second part is the generalization ability of the model’s (TOP-1 and TOP-5 accuracy). In order to ensure that each attention module plays its role in the same position, we plug each module into the same number of network layers. When the number of network layers is shallow, the spatial feature map is too large, the number of channels is too small, and the weight of the extracted channels is too broad and does not fall on some specific features. The extracted spatial attention is more likely to cause negative effects due to the small number of channels, insufficient generalization, and sensitive and difficult-to-learn spatial attention. Having a layer that is too far back or too many channels can easily cause overfitting. And in order to reduce the additional parameter burden and ensure that the cost of adding modules is acceptable, we only add ESCA modules after layers 11, 23, and 41 of the ResNet-50 network. Similarly, ESCA modules were added after layers 11, 23, and 92 of the ResNet-101 network. See Figure 3 for details. To train the individual models, we reproduced the individual modules based on the Ultralytics (YOLOv8) repository and executed them on identical computing infrastructure. The outcomes are presented in Table 4. We can find that our ESCA module only increases a few parameters in the efficiency of the model and only increases FLOPs by 0.14% and inference time by 0.55 ms but improves the TOP-1 accuracy by 6.33% and the TOP-5 accuracy by 3.25%. We believe that this model only increases the burden of a very few parameters. But the big improvement in accuracy is worth it. At the same time, compared with SE, CBAM, CA, ECA, and EMA, the ESCA module achieves the second-best performance in terms of timing performance and complexity and obtains better and more competitive results in terms of accuracy.

ResNet-101 is used as the backbone model, and we compared SE, CBAM, CA, ECA and EMA with ESCA. From Table 4, we can find that ESCA outperforms the original ResNet-101 by 4.16% in TOP-1 accuracy and 2.47% in TOP-5 accuracy and only increases FLOPs by 0.07% and inference time by 0.84 ms with almost the same model complexity. At the same time, the ESCA module is better than SE, CBAM, CA, and EMA in terms of timing performance and complexity and better than all experimental models in terms of accuracy. Focusing on ResNet-50 and ResNet-101 results showcases the efficacy of our ESCA module with the popular ResNet architecture.

In addition, in order to verify the effectiveness of the ESCA module on ResNet architecture, we additionally validate the efficacy of ESCA module integrated into a lightweight CNN architecture. For this purpose, we utilize MobileNetV3 as the backbone model and compare SE, CBAM, ECA, and EMA with ESCA. The insertion of the attention module follows the same practice as in the ResNet architecture, after layers 7 and 10 of the MobileNetV3 network. All the models were then trained under the same hyperparameter configuration. The results in Table 4 show that our ESCA module improves by 2.09% and 1.53% in TOP-1 and TOP-5 accuracy, respectively, compared with the MobileNetV3 baseline model after embedding, and the ESCA module outperforms all experimental models in terms of accuracy. The aforementioned results validate the effectiveness of our ESCA module on the lightweight CNN architecture.

Finally, we also verified the effectiveness of the ESCA module on the latest YOLOv8 architecture, adopted YOLOV8-cls as the backbone model, and inserted ESCA after Stage Layer 2 and Stage Layer 3 of the YOLOV8-cls network. We compare SE, CBAM, CA, ECA and EMA to ESCA. It can be seen from Table 4 that all experimental modules can improve the basic performance of the network, but our ESCA module improves the TOP-1 and TOP-5 accuracy by 2.45% and 1.84% with only 0.19% FLOPs and inference time by a 0.85 ms increase. And we use heat map visualization to further illustrate our results see Figure 4. The results above further confirm the efficiency and effectiveness of our ESCA module across different architectures and verify that our ESCA is a plug-and-play effective module.

4.2.4. Image Classification on CIFAR-10

The second experiment we conducted was a classification task on the CIFAR-10 dataset, which includes 32 × 32-pixel images consisting of 10 different classes of images. The training set comprises 50,000 images, while the validation and the test set both include 10,000 images. In order to accurately and comprehensively evaluate the influence of ESCA attention module on this task, we choose ResNet-50 and YOLOv8-cls as the backbone models. For training using the ESCA module added to ResNet-50 and YOLOv8-cls, the specific hyperparameters we set are as follows. We train the model in 24 epochs with an initial learning rate of 0.01 and a final learning rate of 0.001. The batch size is 256, with a weight decay of 4 × 10⁻⁵ and a momentum of 0.9. After training, we assess the network’s generalization using Top-1 and Top-5 accuracy metrics.

Applying ResNet-50 and YOLOv8-cls as backbone models, we compared SE, CBAM, and ECA with ESCA. As shown in Table 5, for ResNet-50, the backbone network ESCA outperforms the original ResNet-50 by 0.94% in TOP-1 accuracy and 0.15% in TOP-5 accuracy, and FLOPs only increase by 0.012 G with almost the same model complexity. For the YOLOv8-cls backbone network ESCA is 0.59% higher than the original YOLOv8-cls in TOP-1 accuracy and 0.22% higher in TOP-5 accuracy. It can be seen that the ESCA module has achieved considerable results in terms of efficiency and accuracy compared with other experimental modules. The results focusing on ResNet-50 and YOLOv8-cls prove that our ESCA module not only has great universality in its architecture but also works well between different datasets.

4.2.5. Object Detection on VOC 2007

Our final experiment involved an object-detection task using the VOC 2007 dataset, which includes 640 × 640-pixel images consisting of 20 different categories of images. The train set contains 5000 images, the val set contains 5000 images, and the test set also contains 5000 images. In order to accurately and comprehensively evaluate the influence of the ESCA attention module on this task, we select ResNet-50 and YOLOv8-n as the backbone. For training using the ESCA module added to ResNet-50 and YOLOv8n, the specific hyperparameters we set are as follows. We train the model in 300 epochs with an initial learning rate of 0.1 and a final learning rate of 0.01. The batch size is 256, with a weight decay of 4 × 10⁻⁵ and a momentum of 0.9. After training, we assess the network’s generalization ability using the mAP(0.5) and mAP(0.5:0.95) accuracy metrics.

Applying ResNet-50 and YOLOv8n as backbone models, we compare SE, CBAM, and ECA with ESCA. As shown in Table 6, for the ResNet-50 backbone network, ESCA is 1.96% higher than the original ResNet-50 in mAP(0.5) accuracy and 1.21% higher in mAP(0.5:0.95) accuracy, and FLOPs only increases by 0.01G with almost the same model complexity. For the YOLOv8n backbone network, ESCA is 1.67% higher than the original YOLOv8-cls in mAP(0.5) accuracy and 0.72% higher in mAP(0.5:0.95) accuracy. It can be seen that the ESCA module has achieved considerable results in terms of efficiency and accuracy compared with other experimental modules. Focusing on the results of ResNet-50 and YOLOv8n proves that our ESCA module can achieve quite good results even when facing a recognition task with different networks. The effectiveness of the attention module is verified again.

5. Conclusions

In this paper, we propose the ESCA module after systematically comparing attention mechanisms like SE, CBAM, CA, ECA, and EMA. ESCA comprehensively addresses both channel and spatial attention in images, reinforcing channel attention through a symmetrized model design. It utilizes flexible and lightweight 1D convolutions, avoiding detrimental GMP in feature mapping, resulting in good generalization and temporal performance. Experimental results show that ESCA, as a lightweight and plug-and-play module, enhances the performance of various CNN architectures across different datasets, including widely-used ResNet, lightweight MobileNetV3, and the latest YOLOv8-cls. Additionally, ESCA exhibits outstanding performance in object detection tasks. Our future work will further explore the application of ESCA in tasks such as semantic segmentation.

Author Contributions

Conceptualization, H.L.; Methodology, H.L.; Software, H.L.; Validation, H.L.; Investigation, H.L.; Resources, Y.Z. and Y.C.; Writing—original draft, H.L.; Writing—review and editing, Y.Z. and Y.C.; Visualization, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6298–6306. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Guo, Y.; Cao, X.; Liu, B.; Gao, M. Cloud Detection for Satellite Imagery Using Attention-Based U-Net Convolutional Neural Network. Symmetry 2020, 12, 1056. [Google Scholar] [CrossRef]
Ayoub, S.; Gulzar, Y.; Reegu, F.A.; Turaev, S. Generating Image Captions Using Bahdanau Attention Mechanism and Transfer Learning. Symmetry 2022, 14, 2681. [Google Scholar] [CrossRef]
Yang, W.; Yuan, Y.; Zhang, D.; Zheng, L.; Nie, F. An Effective Image Classification Method for Plant Diseases with Improved Channel Attention Mechanism aECAnet Based on Deep Learning. Symmetry 2024, 16, 451. [Google Scholar] [CrossRef]
Wang, H.; Liu, J.; Tan, H.; Lou, J.; Liu, X.; Zhou, W.; Liu, H. Blind Image Quality Assessment via Adaptive Graph Attention. IEEE Trans. Circuits Syst. Video Technol. 2024. [Google Scholar] [CrossRef]
Li, Y.; Yang, X.; Fu, J.; Yue, G.; Zhou, W. Deep Bi-directional Attention Network for Image Super-Resolution Quality Assessment. arXiv 2024, arXiv:2403.10406. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part VII. Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Yang, J. Spatial Group-Wise Enhance: Enhancing Semantic Feature Learning in CNN. In Proceedings of the Computer Vision—ACCV 2022: 16th Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; Proceedings, Part V. Springer: Berlin/Heidelberg, Germany, 2023; pp. 316–332. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part XIV. Springer: Berlin/Heidelberg, Germany, 2018; pp. 122–138. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise mapping. Neurocomputing 2022, 506, 158–167. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 510–519. [Google Scholar] [CrossRef]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to Attend: Convolutional Triplet Attention Module. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3138–3147. [Google Scholar] [CrossRef]
Zhang, Q.L.; Yang, Y.B. SA-Net: Shuffle Attention for Deep Convolutional Neural Networks. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 6–11 June 2021; pp. 2235–2239. [Google Scholar] [CrossRef]
Goyal, A.; Bochkovskiy, A.; Deng, J.; Koltun, V. Non-deep Networks. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 6789–6801. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 201; pp. 1–9. [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Li, P.; Xie, J.; Wang, Q.; Zuo, W. Is Second-Order Information Helpful for Large-Scale Visual Recognition? In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2089–2097. [Google Scholar] [CrossRef]
Li, Y.; Wang, N.; Liu, J.; Hou, X. Factorized Bilinear Models for Image Recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2098–2106. [Google Scholar] [CrossRef]
Zagoruyko, S.; Komodakis, N. Wide Residual Networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning; AAAI Press: Washington, DC, USA, 2017; pp. 4278–4284. [Google Scholar]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global Second-Order Pooling Convolutional Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3019–3028. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 9423–9433. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
Roy, A.G.; Navab, N.; Wachinger, C. Recalibrating Fully Convolutional Networks With Spatial and Channel “Squeeze and Excitation” Blocks. IEEE Trans. Med. Imaging 2019, 38, 540–549. [Google Scholar] [CrossRef]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1971–1980. [Google Scholar]
Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A2-Nets: Double attention networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 350–359. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3141–3149. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]

Figure 1. Symmetric structure of an ESCA module. This module has four attention sub-modules, and the input image will be attention-extracted in the order of Channel-Spatial (Height-Width)-Channel attention.

Figure 2. A detailed diagram of the channel attention module. As illustrated in the figure, the channel attention module applies GAP to the input image, performs 1D convolution with a kernel k to obtain the weight map, and then multiplies it with the input features to produce the weighted feature.

Figure 3. Detailed structure of ResNet-50, ESCA-ResNet-50, and ESCA-ResNet-101.

Figure 4. Visualization of Grad-CAM. In the last layer of the YOLOv8-cls network with different attention, we employ Grad-CAM to visualize the input images, and it can be found that our ESCA module pays more attention to the key features of the target.

Table 1. Hyperparameter data for image classification experiments and object detection experiments.

	Image Classification	Object Detection
Dataset	Mini ImageNet	VOC 2007
Epochs	24	300
Batch size	256	256
Image size	224 × 224	640 × 640
Optimizer	SGD	SGD
Initial learning rate	0.01	0.1
Final learning rate	0.001	0.01
Momentum	0.9	0.9
Weight decay	1×10⁻⁴	4×10⁻⁵

Table 2. Using ResNet-50 and YOLOv8-cls as the backbone model, our ESCA module verifies the generalization ability of the model after adding GAP alone, GMP alone and GAP and GMP together, and compares it with the base model.

Method	Module	#.Param.	Top-1	Top-5	# FLOPs
Baseline		26.26 M	54.81%	80.79%	8.533 G
+GAP		26.26 M	61.14%	84.04%	8.545 G
+GMP	ResNet-50	26.26 M	54.10%	80.05%	8.534 G
+GAP, GMP		26.26 M	54.62%	80.12%	8.538 G
Baseline		45.25 M	46.12%	75.72%	0.4241 G
+GAP		45.25 M	48.57%	77.56%	0.4249 G
+GMP	YOLOv8-cls	45.25 M	44.73%	74.56%	0.4242 G
+GAP, GMP		45.25 M	46.02%	75.58%	0.4245 G

Table 3. Results of our ESCA module under different k settings using YOLOv8-cls as the backbone model and compared with YOLOv8-cls with no attention module as the baseline.

Kernel Size	Module	#.Param.	Top-1	Top-5	# FLOPs
Baseline		1.566 M	46.12%	75.72%	0.4241 G
K = (3, 3)		1.566 M	47.59%	76.57%	0.4248 G
K = (3, 5)		1.566 M	47.60%	76.86%	0.4249 G
K = (3, 7)		1.566 M	47.08%	76.16%	0.4250 G
K = (5, 3)		1.566 M	46.80%	76.08%	0.4248 G
K = (5, 5)	YOLOv8-cls	1.566 M	48.57%	77.56%	0.4249 G
K = (5, 7)		1.566 M	48.53%	77.03%	0.4250 G
K = (7, 3)		1.566 M	47.91%	76.81%	0.4248 G
K = (7, 5)		1.566 M	46.99%	76.49%	0.4249 G
K = (7, 7)		1.566 M	46.93%	75.93%	0.4250 G

Table 4. Training on the Mini ImageNet dataset for classification comparing model performance by parameters such as Top-1 and Top-5 accuracy, #.Param., #FLOPs and Time. Our ESCA module achieves very competitive results.

Method	Backbone Module	#.Param.	Top-1	Top-5	# FLOPs	+ FLOPs	Time
Baseline	ResNet-50	26.26 M	54.81%	80.79%	8.533G	0	12.55 ms
+SE		26.43 M	58.57%	82.81%	8.553G	+0.23%	13.26 ms
+CBAM		27.63 M	59.38%	83.33%	8.672G	+1.63%	15.43 ms
+CA		26.39 M	57.07%	81.83%	8.585G	+0.61%	14.57 ms
+ECA		26.26 M	59.74%	83.69%	8.537G	+0.05%	12.76 ms
+EMA		26.27 M	54.58%	80.14%	8.917G	+4.50%	15.08 ms
+ESCA		26.26 M	61.14%	84.04%	8.545G	+0.14%	13.10 ms
Baseline	ResNet-101	45.25 M	57.32%	81.56%	15.999G	0	21.53 ms
+SE		45.42 M	59.73%	83.40%	16.019G	+0.13%	22.56 ms
+CBAM		46.63 M	59.93%	83.42%	16.137G	+0.86%	23.16 ms
+CA		45.39 M	58.15%	82.18%	16.051G	+0.33%	22.89 ms
+ECA		45.25 M	60.65%	83.69%	16.002G	+0.02%	21.96 ms
+EMA		45.27 M	56.91%	80.87%	16.381G	+2.51%	22.97 ms
+ESCA		45.25 M	61.48%	84.03%	16.01G	+0.07%	22.37 ms
Baseline		0.782 M	39.24%	68.75%	122.811M	0	10.33 ms
+SE		0.782 M	40.36%	70.04%	122.832M	+0.02%	10.79 ms
+CBAM		0.783 M	40.49%	70.05%	122.953M	+0.12%	11.08 ms
+ECA	MobileNetv3	0.782M	40.48%	70.08%	122.836M	+0.02%	10.74 ms
+EMA		0.782 M	39.14%	68.37%	123.094M	+0.23%	11.01 ms
+ESCA		0.782 M	41.33%	70.28%	122.909M	+0.08%	10.76 ms
Baseline	Yolov8-cls	1.566 M	46.12%	75.72%	0.4241G	0	6.38 ms
+SE		1.569 M	46.99%	76.20%	0.4246G	+0.12%	6.67 ms
+CBAM		1.587 M	46.64%	76.38%	0.4266G	+0.59%	7.38 ms
+CA		1.571 M	47.45%	76.37%	0.4263G	+0.52%	8.65 ms
+ECA		1.566 M	46.73%	76.69%	0.4243G	+0.05%	7.01 ms
+EMA		1.567 M	45.81%	75.16%	0.4289G	+1.13%	7.90 ms
+ESCA		1.566 M	48.57%	77.56%	0.4249G	+0.19%	7.23 ms

Table 5. Experimental results of our ESCA module on CIFAR-10 dataset using ResNet-50 and YOLOv8-cls as the backbone models. This experiment verifies the effectiveness of the ESCA module on datasets of different sizes by comparing the models with SE, CBAM, ECA, and ESCA and the baseline model without the attention module.

Method	Module	#.Param.	Top-1	Top-5	# FLOPs
Baseline	ResNet-50	26.14 M	85.49%	99.32%	8.522 G
+SE		26.32 M	86.40%	99.45%	8.542 G
+CBAM		27.52 M	86.13%	99.36%	8.661 G
+ECA		26.14 M	86.39%	99.44%	8.526 G
+ESCA		26.14 M	86.43%	99.47%	8.534 G
Baseline	YOLOv8-cls	1.451 M	81.93%	99.00%	0.413 G
+SE		1.454 M	82.21%	98.99%	0.413 G
+CBAM		1.472 M	82.27%	99.22%	0.415 G
+ECA		1.451 M	82.45%	99.11%	0.413 G
+ESCA		1.451 M	82.52%	99.22%	0.414 G

Table 6. Experimental results of our ESCA module on the VOC 2007 dataset using ResNet-50 and YOLOv8n as backbone models. This experiment verifies the effectiveness of the ESCA module on object detection tasks by comparing the model with SE, CBAM, ECA, and ESCA and the baseline model without the attention module.

Method	Module	#.Param.	# FLOPs	mAP(0.5)	mAP(0.5:0.95)
Baseline	ResNet-50	49.29 M	18.62 G	57.30%	34.82%
+SE		49.38 M	18.64 G	57.91%	35.14%
+CBAM		50.29 M	18.69 G	57.85%	35.11%
+ECA		49.29 M	18.62 G	58.87%	35.41%
+ESCA		49.29 M	18.63 G	59.24%	36.03%
Baseline	YOLOv8n	3.010 M	0.993G	48.60%	29.32%
+SE		3.023 M	1.007 G	49.12%	29.55%
+CBAM		3.223 M	1.092 G	49.01%	29.51%
+ECA		3.015 M	1.006 G	49.56%	29.91%
+ESCA		3.157 M	1.086 G	50.27%	30.04%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, H.; Zhang, Y.; Chen, Y. A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks. Symmetry 2024, 16, 952. https://doi.org/10.3390/sym16080952

AMA Style

Liu H, Zhang Y, Chen Y. A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks. Symmetry. 2024; 16(8):952. https://doi.org/10.3390/sym16080952

Chicago/Turabian Style

Liu, Huaiyu, Yueyuan Zhang, and Yiyang Chen. 2024. "A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks" Symmetry 16, no. 8: 952. https://doi.org/10.3390/sym16080952

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Symmetric Efficient Spatial and Channel Attention (ESCA) Module Based on Convolutional Neural Networks

Abstract

1. Introduction

2. Related Work

3. Efficient Spatial and Channel Attention

3.1. Review SE, CBAM, CA, and ECA Modules

3.2. Efficient Spatial and Channel Attention (ESCA) Module

3.3. Discussion

4. Experiments

4.1. Experiments Preparation

4.2. Image Classification on Mini Imagenet

4.2.1. Effect of GAP and GMP on the ESCA Module

4.2.2. Impact of 1D Kernel Size on ESCA

4.2.3. Contrasts Using Different Networks

4.2.4. Image Classification on CIFAR-10

4.2.5. Object Detection on VOC 2007

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI