A Lightweight Network for Human Pose Estimation Based on ECA Attention Mechanism

Ji, Xu; Niu, Yanmin

doi:10.3390/electronics13010150

Open AccessArticle

A Lightweight Network for Human Pose Estimation Based on ECA Attention Mechanism

by

Xu Ji

and

Yanmin Niu

^*

College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(1), 150; https://doi.org/10.3390/electronics13010150

Submission received: 4 October 2023 / Revised: 21 November 2023 / Accepted: 6 December 2023 / Published: 29 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a novel approach to address the problem of increasing the number of parameters in existing human posture estimation network models while trying to improve prediction accuracy. We propose a human posture estimation optimization network model, called BDENet, which is based on the high-resolution detection network (HRNet). BDENet incorporates a bottleneck structure and depth-separable convolution to reduce the number of parameters. Additionally, it introduces an efficient channel attention (ECA) lightweight attention mechanism to enhance accuracy. We evaluate the proposed model using the MSCOCO dataset and compare it with HRNet. The experimental results demonstrate that BDENet achieves a 41.4% reduction in the number of parameters compared to HRNet, while achieving a 0.6% increase in accuracy. These findings confirm that the network model proposed in this paper can effectively improve accuracy while reducing the number of parameters.

Keywords:

human pose estimation; depthwise separable convolution; bottleneck structure

1. Introduction

Human pose estimation plays a crucial role in the field of computer vision. Its primary objective is to detect the positions of major skeletal points in the human body from images or videos using pose estimation algorithms. This helps in understanding human actions and behaviors. With the introduction of advanced pose estimation methods, this technology has found applications in various domains including action recognition [1], augmented reality [2], and the gaming industry [3]. The pose estimation algorithm [4] aims to establish the correspondence between the coordinates of key points in the human body and the skeletal structure. Human pose estimation can be classified into two broad categories: traditional methods based on graph structure models [5], and more recent approaches utilizing deep learning methods [6]. The graph structure model algorithm aims to design a human body part detector using a graph structure and a deformation part model. It establishes connectivity between the various components through a graph model and continuously optimizes the graph structure model based on human kinematics. This approach offers advantages such as simplicity in implementation and fast matching speed. However, its accuracy tends to significantly decrease in complex background environments.

Deep learning-based methods for human pose estimation typically involve using convolutional neural networks (CNNs) [7] to extract features from human postures. These features are then used in regression or classification networks to predict the positions or postures of human joint points.

Deep learning-based methods for human pose estimation offer several advantages. Firstly, they can handle complex background environments and situations involving occlusion, making them more robust. Secondly, these methods can learn higher-level feature representations, allowing them to extract richer posture information [8]. Additionally, deep learning models have strong fitting abilities, thereby resulting in higher accuracy.

Due to the rapid advancements in deep learning, deep learning-based methods for human pose estimation have gradually replaced traditional methods and have become the mainstream approach in the field. DeepPose, introduced by Toshev et al. [9], addresses the challenge of human pose estimation by directly outputting the coordinates of each joint point of the human body, resulting in significant improvements in accuracy. However, it relies solely on coordinate regression for key point detection, and further enhancements are needed to improve detection effectiveness. The Convolutional Pose Machine (CPM) method proposed by Wei et al. [10] tackles the problem of feature gradient disappearance during training by incorporating intermediate supervision [11]. By providing a natural learning objective function, the CPM approach supplements the gradient of backpropagation and adjusts the learning process, leading to improved accuracy in human body key point detection. Another notable method, HRNet, proposed by Sun et al. [12], introduces a high-resolution network that connects parallel networks of different resolutions. This approach ensures that high-resolution feature maps are maintained throughout the entire network model, effectively enhancing the accuracy of predicting human body key points. While these methods have successfully improved accuracy in human pose estimation, they often do so by increasing the number of parameters. Consequently, there is a need to explore techniques that can enhance performance while reducing the number of parameters—a challenge worth considering in the ongoing advancement of human pose estimation networks.

In this paper, we introduce BDENet, a lightweight network model designed for human pose estimation. The primary objective of BDENet is to reduce the number of model parameters while still maintaining high accuracy. Unlike prior lightweight models, our approach combines the power of a high-resolution network (HRNet) with bottleneck modules [13] and depthwise separable convolution modules [14]. This unique combination enables us to strike a balance between achieving high accuracy and reducing the overall size of the model. By leveraging these innovative techniques, BDENet offers a promising solution for efficient human pose estimation. In the design of the BDENet basic module, this paper adopts a modification by replacing the standard convolution with a depthwise separable convolution module in the Bottleneck structure. This change effectively reduces the number of parameters in the model while preserving its high-resolution functionality. Additionally, this study introduces a lightweight and efficient channel attention mechanism called Efficient Channel Attention (ECA) [15]. The ECA mechanism aims to enhance the accuracy of pose estimation tasks. To evaluate the performance of BDENet, comprehensive experiments were conducted. The results demonstrate that BDENet achieves a significant reduction in model size while improving accuracy compared to previous lightweight models. The combination of the bottleneck module, the depthwise separable convolution module in the HRNet framework, and the ECA mechanism contributes to the overall performance enhancement of BDENet. This research is based on BDENet proposed by HRNet. The contributions of this research include the following: By introducing ECA and embedding it into the backbone network of the model, we enhance the focus on the people in the picture and suppress the interference of non-target background information. To meet the real-time performance of key point detection and lightweight model, we use depth-separable convolution to optimize the convolution operation in the backbone network, which significantly reduces the amount of model calculations and parameters.

We innovatively designed the BDENet module to significantly reduce the number of parameters while maintaining the same accuracy, processing image features more effectively and improving overall detection efficiency.

The subsequent sections of this article are organized as follows: Section 2 briefly introduces the relevant background. Section 3 introduces the BDENet model. In Section 4, we confirm the effectiveness of the BDENet model through comprehensive experiments and analysis. Finally, Section 5 is dedicated to discussing and summarizing our work.

2. Related work

2.1. High-Resolution Network

The main goal of using cross-branch connections in this network architecture is to facilitate the exchange of information between different branches. These connections enable the high-resolution branch to obtain global pose structure information from the low-resolution branch. Simultaneously, the low-resolution branch can also benefit from the detailed feature information and edge information in the high-resolution branch through these connections. The utilization of a multi-branch parallel approach in this network architecture allows for the maintenance of high resolution while capitalizing on feature information from various resolutions. This greatly enhances the accuracy of pose estimation compared to traditional networks. Moreover, this architecture minimizes the loss of feature information [16], leading to more precise and reliable pose estimation results. The network should be optimized to ensure efficient information flow and collaboration between branches. Additionally, careful selection of appropriate loss functions [17] and optimization algorithms [18] is necessary to train and optimize the network effectively.

HRNet is a network architecture that focuses on multi-scale feature fusion. It consists of four stages, each with specific functionalities. The first stage is a high-resolution subnetwork, designed to process detailed information within the input image. From the second stage onward, a new branch subnet is added in each stage. The resolution of this new branch subnet is set to half of the lowest resolution in the previous stage. This ensures that the network maintains a multi-scale feature representation. Structure Figure 1 is as follows.

Branched subnetworks with different resolutions in HRNet play a crucial role in accurately detecting features of various scales in images. By fusing features from these branch subnetworks, the network ensures a more comprehensive representation of human body features. To further process the semantic information expressed by images of different resolutions, HRNet incorporates exchange units into the multi-resolution subnet. The purpose of these exchange units is to facilitate the exchange and integration of information between different resolutions. This allows the network to effectively capture the semantic context [19] and improve the overall performance of human body feature representation. For a visual representation of the exchange units in HRNet, refer to Figure 2.

HRNet utilizes multi-scale feature fusion to capture features at various scales in images. It achieves this by employing branch subnetworks with different resolutions, allowing for the extraction of multi-scale features. This process facilitates repeated interaction of information between feature maps of different scales. When dealing with feature maps of different resolutions, HRNet uses the bilinear interpolation method to perform upsampling. This ensures that the feature maps match in size for subsequent operations. Additionally, 1 × 1 convolutions are employed to unify the number of channels across feature maps. For downsampling, HRNet uses 2-strided convolutions. This operation reduces the resolution of the feature maps while maintaining their overall quality. If two branch subnets have the same resolution, no operation is conducted, and they are directly merged. Starting from the second stage of the multi-resolution subnetwork, all subnetworks, including the backbone network, undergo multiple rounds of information fusion. This fusion process continues until the final feature map is produced as the output of the network.

2.2. Attention Mechanism

The introduction of the attention mechanism [20] in models allows for more accurate focus on crucial information, thus enhancing overall performance. In computer vision, attention mechanisms are extensively used in tasks such as image classification [21], target detection [22], and image generation [23]. Through attention mechanisms, models can learn to prioritize important image areas for classification or detection purposes, resulting in enhanced accuracy. In their research, Jaderberg et al. introduced the Spatial Transformer Network [24] (STN), which is designed to examine spatial attention. The main objective of this network is to extract valuable information from feature maps while reducing the impact of background information. By doing so, it enhances the overall performance of the network. Another approach, proposed by Hu et al., is the Squeeze-and-Excitation Networks [25] (SENet), which focuses on channel attention. This network establishes connections between channels through compression and excitation. These connections are then weighted onto the channel feature map, thereby improving the network’s ability to represent information. Lastly, Wang et al. proposed the ECA module as an alternative attention mechanism to improve performance. The ECA module achieves this goal by adding a small number of parameters compared to SENet. Despite its simplicity, the ECA module effectively guides important information into the answer generation process, enhancing the overall performance of the network.

2.3. Lightweight Network

A lightweight network is a type of network architecture that aims to reduce the number and complexity of model parameters while still maintaining high accuracy. These networks explore various approaches to achieve this goal, including exploring different network structures and utilizing model compression techniques such as knowledge distillation [26] and pruning. The purpose of developing lightweight networks is to enable the application of deep learning technology in mobile and embedded devices like smart homes, security systems, autonomous driving, and video surveillance. Reducing the number of parameters and calculations in a network model is essential for enabling efficient deployment on resource-limited devices. This enables the model to be lightweight and ensures smooth execution on mobile terminals and embedded devices.

Params—The amount of network parameters.

Params = (k × k × C_in + 1)× C_out

(1)

FLOPs—floating point operands.

For the input image whose input is W × H × C_in, the convolution kernel size is k × k, and the output feature map size is W × H × C_out convolution operation, and its floating point number is as shown in Formula (2):

FLOPs = W × H × (k × k × C_in + 1) × C_out

(2)

Indeed, lightweight networks [27] focus on reducing the number of parameters and calculations while ensuring comparable accuracy to the original backbone network. One common approach to achieve this is through improving the depth or width of the convolutional neural network (CNN).

Grouped convolution is a popular technique used in lightweight networks. It involves dividing the feature map into multiple groups, which are then processed by separate GPUs. Each GPU performs convolution on its respective group of the feature map. Finally, the results from individual GPUs are fused together to obtain the final output.

By employing this grouped convolution approach, lightweight networks can effectively distribute the computational workload, reducing the overall amount of calculations required and enabling parallel processing. This technique helps to improve the efficiency and speed of the network, making it suitable for resource-constrained devices.

Depthwise separable convolution comprises two steps: depth convolution and pointwise convolution. In the depthwise convolution step, each input channel is convolved separately with its corresponding set of filters. This process reduces the number of parameters and computational complexity. In the pointwise convolution step, a 1 × 1 convolution is applied to linearly combine the output channels obtained from the depthwise convolution. This helps capture complex relationships between channels and enables the model to achieve similar convolutional effects compared to regular convolutions.

3. Model of This Article

This paper proposes the BDENet network model with the HRNet network as a reference. The overall structure of the network model is shown in Figure 3.

The BDENet network consists of four stages. First, input a 256 × 256 size image, use two standard 3 × 3 convolutions to reduce the resolution to a quarter of the original resolution, and set this reduced resolution as the resolution of the first stage. Starting from the second stage, each stage will add a new branch based on the original branch. The resolution of the new branch is one-half the lowest resolution in the previous stage, and the number of channels is twice the number of channels of the lowest resolution branch in the previous stage.

The BDENet network adopts a parallel connection method and gradually reduces the resolution at each stage, which can effectively prevent the problem of a large amount of feature information being lost due to excessive resolution reduction. The preprocessing stage of the network converts the original three-channel image into a 64-channel image. The first stage uses four bottleneck modules to extract features and increases the number of channels from 64 to 256. From the second stage to the fourth stage, each stage uses BDENet to extract the feature information of the image. The number of channels becomes 64, 128 and 256, respectively, and the resolution of the image becomes 1/8 and 1/16 of the original image and 1/32.

In BDENet, after each stage, the feature fusion module is used to interactively fuse the feature information of each branch. This interaction enables each branch to receive information from other branches to reduce the loss of feature information.

Finally, the network converts the highest resolution output into the final heatmap output. This means that when the network processes the input image, it gradually reduces the resolution and number of channels, and finally converts the resulting feature map into a heat map output with the same resolution as the original input image. The BDENet network realizes information interaction between branches through the feature fusion module to reduce information loss, and converts the highest resolution feature map into the final heat map output. This design enables the network to extract useful features from the input image and generate heat map results suitable for specific tasks.

3.1. Depthwise Separable Convolution

Depthwise separable convolution is a commonly used operation in convolutional neural networks. Its basic principle is to perform feature extraction and feature combination through two consecutive convolutional layers. The structure is shown in Figure 4.

The main advantage of depthwise separable convolution is its ability to reduce the number of parameters while maintaining reasonable accuracy compared to traditional convolution operations. This reduction in parameters is achieved by performing separate convolutions for each input channel in the depthwise convolution step. By convolving each channel separately, the number of convolution kernels required is significantly reduced.

Additionally, the pointwise convolution step, which uses 1 × 1 convolution kernels, helps combine the feature maps obtained from the depthwise convolution. This combination of feature maps increases the expressive ability of the network by introducing cross-channel interactions. The pointwise convolution aggregates information across channels without affecting the spatial dimensions, thereby reducing the computational cost.

Since depthwise separable convolution is an effective convolution operation that can reduce the number of parameters while maintaining good accuracy, it is widely used in convolutional neural networks.

3.2. ECA Attention

Although lightweight attention mechanisms have made progress in reducing computation and improving performance, further optimization and improvements are still necessary, especially for small devices.

ECA is one such technique that addresses this challenge. It employs one-dimensional convolutions with a size parameter k to capture relationships between local adjacent channels. The value of k is adaptively adjusted based on the number of input channels. Notably, the number of parameters brought by the one-dimensional convolution is negligible compared to the fully connected layer used in the SE (Squeeze-and-Excitation) structure.

The implementation process of ECA attention involves the following steps: 1. Perform a global average pooling operation on the input feature map, resulting in a 1 × 1 × C feature vector, where C represents the number of channels. 2. Apply a 1-dimensional convolution operation of size k on the feature vector. 3. Using the Sigmoid activation function, obtain a weight vector w of size 1 × 1 × C, representing the weight of each channel. 4. Element-wise multiply the weight vector w with the original input feature map to obtain the final output feature map.

Compared to other attention mechanisms, ECA attention offers simplicity and efficiency. It has minimal impact on the computational speed of the model. This is because ECA attention does not require complex element-wise multiplication and weighted sum operations. Instead, it calculates channel weights through global average pooling and one-dimensional convolution, resulting in a concise operation. The ECA attention structure diagram is shown in Figure 5.

In Figure 5, the diagram represents the ECA attention mechanism. The input feature is denoted by ‘x’, and the output feature is denoted by ‘y’.

To compute the attention weights for each channel, the input feature is passed through a one-dimensional convolution operation using a kernel size of ‘k’. This implies that all channels in the input feature share weights during the convolution operation.

After the convolution operation, the result is subjected to a global average pooling operation denoted by ‘G’. This pooling operation reduces the spatial dimensions of the feature map to a size of 1 × 1, while preserving the channel information.

Next, the feature vector obtained from the global average pooling is passed through a Sigmoid activation function. This activation function maps the values in the feature vector to a range between 0 and 1, representing the attention weights (ω) of each channel. These weights signify the importance or relevance of each channel in the input feature.

Finally, the attention weights are element-wise multiplied (weighted) with the original input feature ‘x’. The result is the final output feature ‘y’, where the attention mechanism enriches the representation of the input feature by assigning different weights to each channel, and the formula is as shown in Equation (3):

ω = σ(D_k(y))

(3)

Here, k represents the number of parameters of one-dimensional convolution; σ represents the Sigmoid activation function; and y represents the output feature.

3.3. BDENet Module

Each stage from the second stage to the fourth stage of this article extracts features by the BDENet module built in this article. The BDENet structure is shown in Figure 6.

The input feature map is dimensioned through 1 × 1 convolution. BDEblock refers to BasicBlock and replaces the 3 × 3 standard convolution in the bottleneck structure with a 3 × 3 depth-separable convolution to reduce the number of parameters. Then, in the depth-separable convolution, a 3 × 3 standard convolution is added to further extract features, and a 3 × 3 standard convolution is added at the residual connection. In image detection, different regions in the image contribute differently to the task. In the human posture estimation task, only the areas related to the human body need to be focused most, and only then will the key joint points of the human body will be paid attention to. In order to ensure the prediction accuracy of the network model, lightweight attention ECA is introduced. In short, depthwise separable convolution is added to reduce the amount of calculation by mapping feature channels. In addition, the BN operation adds more nonlinear features, making the model more efficient during the training process, while quickly integrating feature map information through the ADD operation. Finally, the ECA module not only enhances the focus on the target area, but also improves the model generalization ability. Through the design of the overall architecture, the number of parameters is reduced and accuracy is ensured.

4. Experimental Results and Analysis

4.1. Dataset

The MSCOCO dataset is a publicly available dataset extensively utilized in research areas including target detection, human key point detection, and semantic segmentation. It consists of more than 200,000 labeled images that are employed for training and validating diverse computer vision algorithms. For human body key point detection, the COCO dataset provides annotations for 17 key points on each human body instance. Such key points can facilitate the analysis and comprehension of human posture, movement, and related factors. This dataset encompasses a diverse range of scenes and human poses, thereby facilitating more comprehensive and accurate analysis outcomes. In this article, 118,287 human posture images from COCO2017 are used as the training set, 5000 images for verification, and 20,000 images for testing.

4.2. Evaluation Indicators

This article uses the Object Keypoint Similarity (OKS) indicator officially given by the MCSOSO dataset as the evaluation indicator.

Formula (4) is as follows:

OKS = \frac{\sum_{i} e x p (- \frac{d_{i}^{2}}{{2 s^{2} k}_{i}^{2}}) δ (v_{i} > 0)}{\sum_{i} δ (v_{i} > 0)}

(4)

Here, d_i represents the Euclidean distance between the predicted key point and the actual key point; v_i represents the flag of the actual key point; and s is the object scale, and ki is a per-keypoint constant that controls falloff.

The value of OKS is in the range [0, 1]. When the prediction is absolutely accurate, OKS = 1. When the detection is too inaccurate, OKS = 0.

The standard OKS method is selected for the experiment in this article, including the following indicators: AP⁵⁰; AP⁷⁵APMAP^L, which represents the prediction accuracy when OKS is 0.5; AP⁷⁵, which represents the prediction accuracy when OKS is 0.75; mAP, which represents the prediction accuracy when OKS is 0.50, 0.55, etc., or when the average of all prediction accuracy rates is 0.95; AP^M, which represents the accuracy of key point detection for small-sized objects; and AP^L, which represents the accuracy of key point detection for large-sized objects.

4.3. Experimental Environment and Settings

This article uses the PyTorch1.10.0 deep learning framework; the operating system environment is Windows11, the GPU is RTXA4000, the system memory size is 16 GB, and the programming language is Python3.9. The Adam optimizer is used to optimize the network model during training. The initial learning rate is 0.0001. The network is trained for a total of 170 rounds, and the batch size of each GPU is 64.

Since the images in the dataset are of different sizes, image preprocessing is used to modify the images. In this experiment, the image size in the MSCOCO dataset was cropped to 256 × 192 pixels. In order to solve the problem of some incomplete human images in the MSCOCO dataset, data enhancement operations were performed on the training images, including random rotation of the dataset [−45°, 45°], random scaling [0.65, 1.35] and random flip operations.

4.4. Experimental Verification and Analysis

This article is verified on the CCO dataset. The experimental results of different methods on the COCO validation set are shown in Table 1 below.

As can be seen from Table 1, compared with several currently popular network models, BDENet proposed in this article has higher accuracy with fewer parameters. Compared with CPN, Simple Baseline, and other non-lightweight human pose estimation models, the model in this article has fewer parameters and the accuracy increases by 6.4% and 4.8%, respectively. Compared with HRNet, the number of parameters of the BDENet model proposed in this article reduced by 41.4%, and the prediction accuracy increased by 0.6%. It effectively improved the accuracy while reducing the number of parameters. Compared with the lightweight model Lite-HRNet, the accuracy increased by 10.6% and the performance increased significantly.

4.5. Ablation Experiment

The ablation experiment in this article was trained and verified on the MSCOCO dataset. Since this article introduces the bottleneck module that incorporates depth-separable convolution and the BDEBlock with the ECA attention mechanism, the ablation experiment is compared between the BDEBlock that does not incorporate the ECA attention mechanism and the BDEBlock that only consists of the ECA attention. The ablation experiment is shown in Table 2 below.

As can be seen from Table 2, compared with HRNet, the number of BDENet parameters has dropped by 41.4%, and the computational complexity has dropped by 28.2%.

mAP increased by 0.6%. Compared with ECANet without ECANet and ECANet alone, BDEet has the highest accuracy when the number of parameters and computational complexity are not much different. Therefore, incorporating the bottleneck structure of depth-separable convolution and the ECA attention mechanism into BDENet has a certain effect on reducing the number of parameters and improving model accuracy.

5. Visual Analysis

This article performs visual verification on the COCO2017 verification dataset. Figure 7 shows the visual results of the BDENet human pose estimation method on the MSCOCO dataset.

The middle point in Figure 7 represents the location of the key point, and the red line is the connection line of the key point.

It can be seen from the visualization results that the BDE human pose estimation in this article performs well in different backgrounds, different shooting angles, and in the presence of obstructions.

6. Conclusions

This paper presents the BDENet network, which aims to address the issue of an excessive number of parameters in neural networks. The main contribution of this study is the design of the BDENet module. Within this module, depth separable convolutions are introduced to effectively reduce the quantity of parameters in the overall model. Additionally, the ECA attention mapping technology is incorporated to enhance the network’s ability to focus on the characteristic information of the human body, thereby improving accuracy. Comparative experiments were carried out on the COCO dataset to evaluate the performance of BDNet. The results indicate that BDNet achieves a 0.6% increase in accuracy compared to the original HRNet, while simultaneously reducing the number of parameters by 41.4%. This notable improvement serves as evidence for the effectiveness of the proposed model. Moreover, ablation experiments highlight the significance of integrating the bottleneck structures of ECA and join depthwise separable convolutions into BDENet, emphasizing the necessity of these design choices for optimizing performance. Future research efforts will concentrate on further reducing the number of parameters in the model to enhance the efficiency and effectiveness of the network.

Author Contributions

X.J. designed and performed the experiments; X.J. provided analysis software; X.J. analyzed the data; X.J. organized the data and wrote the paper. Y.N. perfected the details of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Chongqing Normal University (Grant No.: 20XLB035).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The code in this article cannot be published due to privacy and can be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3192–3199. [Google Scholar]
Azuma, R.T.; Liu, Y.; Wang, Y.; Li, Y.; Lei, J.; Lin, L.; Wang, H.L.; Sengupta, K.; Kumar, P.; Sharma, R.; et al. A survey of augmented reality. Presence Virtual Augment. Real. 1997, 6, 355–385. [Google Scholar] [CrossRef]
Feijoo, C.; Gómez-Barroso, J.-L.; Aguado, J.-M.; Ramos, S.J.T.P. Mobile gaming: Industry challenges and policy implications. Telecommun. Policy 2012, 36, 212–221. [Google Scholar] [CrossRef]
Schweighofer, G.; Pinz, A. Robust pose estimation from a planar target. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 2024–2030. [Google Scholar] [CrossRef] [PubMed]
Wang, R.; Mou, S.; Wang, X.; Xiao, W.; Ju, Q.; Shi, C.; Xie, X. Graph structure estimation neural networks. In Proceedings of the Web Conference 2021, Online, 12–23 April 2021; pp. 342–353. [Google Scholar]
Nishani, E.; Çiço, B. Computer vision approaches based on deep learning and neural networks: Deep neural networks for video analysis of human pose estimation. In Proceedings of the 2017 6th Mediterranean Conference on Embedded Computing (MECO), Bar, Montenegro, 11–15 June 2017; pp. 1–4. [Google Scholar]
Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
Mittelstaedt, H. Origin and processing of postural information. Neurosci. Biobehav. Rev. 1998, 22, 473–478. [Google Scholar] [CrossRef] [PubMed]
Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
Wei, S.-E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4724–4732. [Google Scholar]
He, G.; Lan, Y.; Jiang, J.; Zhao, W.X.; Wen, J.-R. Improving multi-hop knowledge base question answering by learning intermediate supervision signals. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual Event, 8–12 March 2021; pp. 553–561. [Google Scholar]
Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-resolution detection network for small objects. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Baldwin, C.Y. Bottlenecks, modules and dynamic architectural capabilities. In Harvard Business School Finance Working Paper; No. 15-028; Elsevier: Amsterdam, The Netherlands, 2015. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Bidarra, R.; De Kraker, K.J.; Bronsvoort, W.F. Representation and management of feature information in a cellular model. Comput. Des. 1998, 30, 301–313. [Google Scholar] [CrossRef]
Janocha, K.; Czarnecki, W.M. On loss functions for deep neural networks in classification. arXiv 2017, arXiv:1702.05659. [Google Scholar] [CrossRef]
Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Becker, C.A. Semantic context effects in visual word recognition: An analysis of semantic strategies. Mem. Cogn. 1980, 8, 493–512. [Google Scholar] [CrossRef] [PubMed]
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Lu, D.; Weng, Q. A survey of image classification methods and techniques for improving classification performance. Int. J. Remote. Sens. 2007, 28, 823–870. [Google Scholar] [CrossRef]
Bekkerman, I.; Tabrikian, J. Target detection and localization using MIMO radars and sonars. IEEE Trans. Signal Process. 2006, 54, 3873–3883. [Google Scholar] [CrossRef]
Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.; Wierstra, D. Draw: A recurrent neural network for image generation. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1462–1471. [Google Scholar]
Mower, E.; Matarić, M.J.; Narayanan, S. A framework for automatic human emotion classification using emotion profiles. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 1057–1070. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, S.; Wang, Y.; Huan, W. Review of research on lightweight convolutional neural networks. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; pp. 1713–1720. [Google Scholar]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 466–481. [Google Scholar]
Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10440–10450. [Google Scholar]

Figure 1. HRNet framework map.

Figure 2. Illustrating how the exchange unit aggregates the information for high, medium and low resolutions.

Figure 3. BDENet framework map.

Figure 4. Depthwise separable convolution structure diagram.

Figure 5. ECA framework map.

Figure 6. BDEblock framework map.

Figure 7. Visualization result.

Table 1. Comparison of the outcome of different methods.

Network	Enter Image Dimensions/Pixels	Parameter Amount/10⁶	FLOPs/10⁹	mAP/%	AP⁵⁰/%	AP^M/%	AP^L/%	AR/%
CPN [28]	256 × 192	27.0	6.2	68.6	—	—	—	—
Simple Baseline [29]	256 × 192	34.0	8.9	70.4	88.6	67.1	77.2	76.3
Lite-HRNet [30]	256 × 192	1.1	0.2	64.8	86.7	62.1	70.5	71.2
HRNet [12]	256 × 192	28.5	7.1	74.4	90.5	70.8	80.1	79.8
BDENet	256 × 192	16.7	5.1	75.0	93.6	82.6	78.1	79.6

Table 2. Ablation experiment results.

Measure	Enter Image Dimensions/Pixels	Parameter Amount/10⁶	FLOPs/109	mAP/%
HRNet	256 × 192	28.5	7.1	74.4
BDENet (no ECAattention)	256 × 192	16.9	5.5	74.6
BDENet (only ECAattention)	256 × 192	16.7	5.4	74.7
BDENet (This paper model)	256 × 192	16.7	5.1	75.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ji, X.; Niu, Y. A Lightweight Network for Human Pose Estimation Based on ECA Attention Mechanism. Electronics 2024, 13, 150. https://doi.org/10.3390/electronics13010150

AMA Style

Ji X, Niu Y. A Lightweight Network for Human Pose Estimation Based on ECA Attention Mechanism. Electronics. 2024; 13(1):150. https://doi.org/10.3390/electronics13010150

Chicago/Turabian Style

Ji, Xu, and Yanmin Niu. 2024. "A Lightweight Network for Human Pose Estimation Based on ECA Attention Mechanism" Electronics 13, no. 1: 150. https://doi.org/10.3390/electronics13010150

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Network for Human Pose Estimation Based on ECA Attention Mechanism

Abstract

1. Introduction

2. Related work

2.1. High-Resolution Network

2.2. Attention Mechanism

2.3. Lightweight Network

3. Model of This Article

3.1. Depthwise Separable Convolution

3.2. ECA Attention

3.3. BDENet Module

4. Experimental Results and Analysis

4.1. Dataset

4.2. Evaluation Indicators

4.3. Experimental Environment and Settings

4.4. Experimental Verification and Analysis

4.5. Ablation Experiment

5. Visual Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI