Accurately Estimate and Analyze Human Postures in Classroom Environments

Shou, Zhaoyu; Yu, Yongbo; Li, Dongxu; Mo, Jianwen; Zhang, Huibing; Zhang, Jingwei; Wu, Ziyong

doi:10.3390/info16040313

Open AccessArticle

Accurately Estimate and Analyze Human Postures in Classroom Environments

by

Zhaoyu Shou

^1,2

,

Yongbo Yu

¹,

Dongxu Li

¹,

Jianwen Mo

¹,

Huibing Zhang

³

,

Jingwei Zhang

³

and

Ziyong Wu

^4,*

¹

School of Information and Communication, Guilin University of Electronic Technology, Guilin 541004, China

²

Guangxi Wireless Broadband Communication and Signal Processing Key Laboratory, Guilin University of Electronic Technology, Guilin 541004, China

³

School of Computer and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

⁴

School of Information Engineering, Nanning University, Nanning 541699, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 313; https://doi.org/10.3390/info16040313

Submission received: 14 March 2025 / Revised: 12 April 2025 / Accepted: 14 April 2025 / Published: 15 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Estimating human posture in crowded smart teaching environments is a fundamental technical challenge for measuring learners’ engagement levels. This work presents a model for detecting critical points in human posture using ECAv2-HRNet in crowded situations. The paper introduces a method called ECAv2Net, which combines a channel feature reinforcement method with the ECANet attention mechanism network, this innovation improves the performance of the network. Additionally, ECAv2Net is integrated into the high-resolution network HRNet to create ECAv2-HRNet. This fusion allows for the incorporation of more useful feature information without increasing the model parameters. The paper also presents a human posture dataset called GUET CLASS PICTURE, which is designed for dense scenes. The experimental results when using this dataset, as well as a public dataset, demonstrate the superior performance of the human posture estimation model based on ECAv2-HRNet proposed in this paper.

Keywords:

pose estimation; keypoint detection; image processing; ECAv2Net

1. Introduction

Learning engagement is a crucial measure for assessing learners’ involvement in acquiring knowledge through various forms of body positions and is a fundamental aspect of intelligent education. Nevertheless, the high concentration of individuals in the classroom, along with the complexity and limited precision of the keypoint detection challenge, significantly intensifies the difficulty of posture estimation. Conventional manual image annotation techniques involve using kinematic analysis software to label each important point of human posture in sequential images. These annotated data are subsequently utilized for posture analysis. However, this process is both time-consuming and labor-intensive. Use of wearable electronic devices for posture analysis [1] is hindered by the drawbacks of these devices being expensive and inconvenient to wear. The emergence of artificial intelligence has sparked significant interest in the field of human posture estimation through computer vision technology [2,3]. This method is particularly popular due to its cost-effectiveness, simplicity, and high level of accuracy.

With the rapid development of artificial intelligence in recent years [4,5], human pose estimation has become a very important research direction in computer vision tasks, including 2D human pose estimation [6,7,8] and 3D human pose estimation [9,10,11]; typically, 2D pose estimation uses images from a single orientation, while 3D pose estimation involves combining images from multiple orientations. However, in smart classroom scenarios, pose estimation requires combining images from a single orientation to detect multiple and dense human posture keypoints. This makes achieving multi-person posture estimation in this scenario particularly challenging.

This study employs the top-down approach for human pose estimation [12]. Firstly, it carries out target detection to locate and isolate human targets [13]. Secondly, it performs human pose estimation to determine the individual keypoints of the human pose. The current training dataset for pose estimation contains a maximum of 20 human poses per image. However, in a densely populated smart classroom scenario, where the number of people can reach 60 or more, the dense pose estimation of multiple individuals is not effective enough to achieve the desired goal. This paper aims to address the problem by labeling the 11 unobstructed keypoints on multi-person dense images in a smart classroom scenario. These labeled keypoints are then integrated into the GUET CLASS PICTURE dataset. This dataset enhances the pose estimation model’s ability to accurately recognize dense human body poses in a classroom scenario. The current large-scale model for human posture keypoint estimation has the highest mean average precision (mAP) [14], but it has a large number of model parameters, leading to significant resource wastage. On the other hand, the small-scale model has fewer model parameters [8], but it struggles to achieve the desired level of mAP. The small- and medium-sized models attempt to strike a balance between the advantages and disadvantages of both [15]. A recent study introduced ECA-HRNet (Efficient Channel Attention-High Resolution Network) [16], which achieves a high mean average precision (mAP) by incorporating ECANet into the HRNet backbone network. ECANet [17] enables localized cross-channel interaction without reducing dimensionality and requires only a small number of parameters, resulting in efficient performance improvements. Nevertheless, the approach is limited in its ability to accurately extract feature information and reinforce information. This study introduces ECAv2Net, a novel approach that incorporates two-layer squeeze to increase feature information. Building upon the advantages of ECANet, ECAv2Net is integrated into the HRNet backbone network to improve many performance metrics, including mAP. YOLOv8 is utilized to identify the human target in the image of the smart classroom [18]. Additionally, a human posture estimation model is employed to accurately determine the position of keypoints such as the nose and left and right shoulders. The obtained coordinate information is then used to assess whether the learner is raising or lowering their head, as well as turning their head or slouching. This analysis provides valuable technical support for the smart education system’s human posture analysis component. This paper’s primary contributions are as follows:

A method for enhancing features is proposed and integrated with ECANet to improve the efficiency of the network during feature extraction. This method enhances the local cross-channel interaction strength of the module, resulting in significant performance improvement for ECAv2Net while using a minimal number of parameters.
By integrating the ECAv2Net module into the HRNet backbone network, we introduce the ECAv2-HRNet model for human posture assessment. This model enhances the speed of model convergence without adding more model parameters, and it achieves higher accuracy compared to the ECA-HRNet model.
To address the task of estimating the posture of a large group of people in a smart classroom, we created a dataset called GUET CLASS PICTURE. This dataset consists of nearly 10,000 annotated human postures, specifically focusing on identifying whether the learner is in a head-up or head-down position, as well as detecting turning and slouching movements. The dataset includes keypoints such as the nose, left and right eyes, left and right ears, left and right shoulders, left and right elbows, and left and right wrists, totaling 11 points.

The remaining sections of this paper are structured in the following manner: Section 2 provides a concise overview of the relevant research conducted in relation to this study. Section 3 provides a comprehensive explanation of the proposed method, which includes a detailed description of ECAv2Net and how it is integrated into HRNet. Section 4 presents the dataset. Section 5 presents the results of the experiments and the application of the model. Section 6 discusses the limitations of the proposed model, the challenges encountered, and future research directions. Section 7 concludes the contributions of this work.

2. Related Work

Convolutional neural networks (CNNs) [19,20,21] have outperformed previous algorithms in human posture estimation in recent years. Recent research trends categorize them into two groups: bottom-up pose estimation methods [22], which directly identify all the keypoints and subsequently assemble them into individual human poses. Bottom-up methods are faster but slightly less precise. Meanwhile, top-down pose estimation methods [23] initially detect the human body using a target detection model and then perform pose estimation for each human body separately, offering high accuracy but at a slightly slower speed.

In the study of human pose estimation, Xiao et al. [24] incorporated multiple anti-convolutional layers into a ResNet backbone network to create a straightforward yet powerful architecture for representing high-resolution heat mAPs. Cai et al. [25] proposed a multi-stage network that incorporates a residual step network (RSN) module. This module enables the network to acquire detailed localized information by utilizing an efficient intra-layer feature fusion technique. The pose refinement machine (PRM) module reveals a trade-off in feature information between local and global aspects. Wang et al. [15] introduced a high-resolution network (HRNet) that enhances semantic richness and spatial accuracy. This is achieved by connecting high-resolution and low-resolution convolutional streams in parallel and repeatedly exchanging information across different resolutions to obtain feature information. HRNet demonstrates excellent performance in various applications such as human pose estimation, semantic segmentation, and target detection. Long et al. [26] contended that low-resolution feature mAPs possess the most potent semantic information and necessitate traversing further layers to amalgamate with high-resolution features. Additionally, the computational expense of convolutional layers is very high for high-resolution features. Consequently, a U-shaped high-resolution network (U-HRNet) was developed to enhance the performance of the backbone network. This was achieved by incorporating additional residual modules after the feature mapping with the most significant semantic information. The network operates by computing different-resolution features in parallel and feeding them into the residual module. Notably, the low-resolution features receive more computation, resulting in improved performance. Artacho et al. [27] enhanced the multiscale feature representations obtained by the waterfall module by employing progressive filtering in a cascade architecture. This allowed them to achieve the same multiscale field of view as the spatial pyramid pool module. Ultimately, they combined cross-scale contextual information and joint localization information from a multiscale feature extractor. Additionally, they used Gaussian heatmap modulation to achieve the highest level of accuracy in precision human pose estimation. Yu et al. [28] proposed the Lite-HRNet human pose estimation model by applying the efficient wash module in the ShuffleNet network to the HRNet backbone network, which accomplished high accuracy with very few parameters.

In the study of attention mechanisms, Woo et al. [29] introduced a convolutional block attention (CBAM) module to leverage attention features in both channel and spatial dimensions. This module enhances the input feature maps by multiplying them with the attention features, resulting in adaptive feature refinement. The CBAM module can be seamlessly integrated into any CNN architecture and leads to improved overall performance. Hu et al. [30] introduced a squeeze excitation module called SENet, which effectively adjusts the channel feature responses by explicitly modeling the relationships between the channels. When integrated into a backbone network, this module demonstrates notable performance enhancements. Feng et al. [31] enhanced the squeeze excitation module by transforming it into a dual squeeze excitation module (S2ENet) and integrated it into a ResNet backbone network. This modification resulted in a superior performance compared to the single squeeze excitation module. Jin et al. [32] introduced a two-stage spatial pooling design to enhance descriptor extraction and information fusion in SENet. This architecture allows the excitation module to provide more precise reweighted scores based on data and improves the squeeze excitation module. Wang et al. [17] introduced an efficient channel attention module (ECANet) to address the challenge of balancing performance and complexity. This module effectively incorporates a local cross-channel interaction strategy without reducing dimensionality through one-dimensional convolution. As a result, the module achieves significant performance improvements with minimal parameter involvement. Bao et al. [16] integrated ECANet into the HRNet backbone network to enhance the precision of pose detection and to assess and analyze the poses of a rapidly moving human body in a skiing scenario.

To summarize, current models improve accuracy by increasing the complexity of model parameters. However, achieving high accuracy with a moderate number of parameters is challenging. Additionally, existing models struggle to maintain performance in dense human posture scenarios. To address these issues, this paper proposes a feature enhancement approach based on ECANet and HRNet. The resulting ECAv2-HRNet outperforms existing models in terms of accuracy and feasibility.

3. ECAv2-HRNet Architecture

3.1. The ECANet Attention Mechanism

To prevent a reduction in channel features, the module employs a channel feature representation without reducing dimensionality. It achieves local cross-channel interactions through one-dimensional convolution and determines the extent of these interactions by adaptively selecting the size of the one-dimensional convolution kernel. This module depends on a limited set of parameters to effectively process the interaction of channel feature information. However, there are constraints in accurately extracting channel features and enhancing information. Therefore, this paper introduces a feature enhancement method that endows ECAv2Net with both the attributes of a small number of parameters and an exceptionally high feature extraction capability.

3.2. ECAv2Net Attention Mechanism

Assuming an aggregated feature parameter matrix without dimensionality reduction

y \in R^{C}

, the channel attention is shown in Equation (1):

w = σ (W y)

(1)

First, ECANet learns channel attention in order to realize local cross-channel interactions using

W_{k}

,

W_{k}

, as shown in Equation (2):

[\begin{matrix} w^{1, 1} & \dots & w^{1, k} & 0 & 0 & \dots & \dots & 0 \\ 0 & w^{2, 2} & \dots & w^{2, k + 1} & 0 & \dots & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ & ⋮ \\ 0 & \dots & 0 & 0 & \dots & w^{C, C - k + 1} & \dots & w^{C, C} \end{matrix}]

(2)

W_{k}

contains

k \times C

parameters, less than the general attention mechanism, and avoids complete independence between different groups, as shown in Equation (3):

w_{i} = σ (\sum_{j = 1}^{k} w_{i}^{j} y_{i}^{j}), y_{i}^{j} \in Ω_{i}^{k}

(3)

where

Ω_{i}^{k}

denotes the set of

k

neighboring channels of

y_{i}

. A more efficient approach is to make all channels share the same learning parameters, as shown in Equation (4):

w_{i} = σ (\sum_{j = 1}^{k} w^{j} y_{i}^{j}), y_{i}^{j} \in Ω_{i}^{k}

(4)

This approach can be realized by fast

1 D

convolution with kernel size

k

, as shown in Equation (5):

w = σ (C 1 D_{k} (y))

(5)

where

C 1 D

denotes one-dimensional convolution and the approach involves only

k

parameters. When

k = 3

, the ECA module achieves similar results to the SE module, but with a lower model complexity than the SE module, which ensures efficiency and effectiveness through appropriate local cross-channel interactions. Local cross-channel interactions require constraints on the range, i.e., the size of the one-dimensional convolutional kernel. In various different convolutional neural network architectures, the interaction range varies across the number of channels, and manual tuning for cross-validation is computationally expensive. Extensive experiments with swarm convolution for convolutional neural networks have demonstrated that the coverage of interactions is proportional to the channel dimensions, and that there is mapping

Φ

in

k

and

C

, as shown in Equation (6):

C = Φ (k)

(6)

since linear functions characterize a limited number of mapping relations, such as

Φ (k) = γ \times k - b

, they are extended to nonlinear functions as shown in Equation (7):

C = Φ (k) = 2^{γ \times k - b}

(7)

That is, given the number of channels

C

, the size of its convolution kernel is determined adaptively, as shown in Equation (8):

k = Φ (C) = {|\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d}

(8)

where

o d d

denotes the odd number that takes the closest absolute value in Equation (8). Through the experimental study of Wang et al. [17], it was found that it is most appropriate to set

γ

and

b

to 2 and 1, which allows the higher-dimensional channels to have longer-range interactions, while the lower-dimensional channels experience shorter-range interactions through the use of nonlinear mapping. By mapping

Φ

, the high-dimensional channels have longer interaction ranges and the low-dimensional channels have shorter interaction ranges. Eventually, ECANet has the property of local cross-channel interaction, and the interaction range

k

can be determined adaptively according to channel

C

.

ECANet utilizes a singular squeeze method to extract feature information from the entire channel. However, this approach has limitations in accurately extracting features, particularly in the deep regions of the network. A singular squeeze weakens the ability to express certain features, resulting in reduced efficiency in channel information interaction. This paper proposes a double squeeze method to enhance the expression ability of features. It achieves this by applying one half squeeze and one full squeeze. The method also eliminates factors that affect the efficiency of channel information interaction. Additionally, the improved module enhances the extraction accuracy of keypoint features while maintaining the excellent characteristics of ECANet. The underlying concept of ECAv2Net can be summarized as follows: Let the feature parameter matrix

M_{W, H, C}

be specifically shown in Equation (9):

M_{W, H, C} = (\begin{matrix} M 1 = [\begin{matrix} m_{1}^{1, 1} & \dots & m_{1}^{1, W} \\ ⋮ & ⋱ & ⋮ \\ m_{1}^{H, 1} & \dots & m_{1}^{W, H} \end{matrix}] \\ M 2 = [\begin{matrix} m_{2}^{1, 1} & \dots & m_{2}^{1, W} \\ ⋮ & ⋱ & ⋮ \\ m_{2}^{H, 1} & \dots & m_{2}^{W, H} \end{matrix}] \\ ⋮ \\ M C = [\begin{matrix} m_{C}^{1, 1} & \dots & m_{C}^{1, W} \\ ⋮ & ⋱ & ⋮ \\ m_{C}^{H, 1} & \dots & m_{C}^{W, H} \end{matrix}] \end{matrix})

(9)

In the formula,

W

and

H

are the width and height of the eigenparameter matrix, indicating different channels of the model. According to the rules of ECANet, after one squeeze, to obtain the eigenvalues of the whole eigenparameter matrix

α_{N}

, its expression is shown in the following Equation (10):

α_{N} = \sum_{i = 1, j = 1}^{W, H} \frac{m_{N}^{i, j}}{W \times H}, N \in [1, C]

(10)

This study presents a method for acquiring the eigenparameter matrix

W_{P} \times H_{P} \times C

after one half squeeze, and then obtaining the eigenvalue

β_{N}

after one full squeeze for shape

1 \times 1 \times C

. The expression of

β_{N}

is shown in Equation (11):

\begin{matrix} β N = M A X \{\sum_{i, j}^{\bar{i}, \bar{j}} m_{N}^{i, j}\}, i = (\frac{(ε - 1) W}{W_{P}} + 1), j = (\frac{(δ - 1) H}{H_{P}} + 1) \\ ε \in [1, W_{P}], δ \in [1, H_{P}], \bar{i} = \frac{ε W}{W_{P}}, \bar{j} = \frac{δ H}{H_{P}}, N \in [1, C] \end{matrix}

(11)

In the formula,

ε

and

δ

denote natural numbers, and the eigenvalues

α_{N}

and

β_{N}

obtained from the two squeeze methods are different; when the values of the eigenparameter matrix

M_{W, H, C}

are more uniformly distributed, the values of

α_{N}

and

β_{N}

are similar; when the values of the eigenparameter matrix

M_{W, H, C}

are not uniformly distributed,

β_{N} > α_{N}

. A representation of the

α_{N}

and

β_{N}

feature extraction process is shown in Figure 1.

As shown in Figure 1, the feature extraction method of double squeeze firstly divides the feature information into four parts through the average squeeze, and then through a maximum squeeze; the maximum squeeze can extract the optimal features among the previous four parts as feature

β_{N}

. The feature extraction method of a single squeeze directly uses the average squeeze to average the feature information, and the final average is used as feature

α_{N}

. When comparing the two types of feature squeeze methods, with the double squeeze method, more useful features can be extracted in the feature extraction process, especially in the case of uneven distribution of features, and the advantage of dual squeeze over single squeeze is clearer.

3.3. Feature Representation for Varying Depths of Convolutional Layers

Feature extraction is the central focus of the keypoint detection task. With the advancement of deep learning, the model’s depth increases, resulting in different feature distributions in the convolutional layers at various depths. Shallow convolutional layers typically yield uniformly distributed features as they are in the early stage of feature extraction. On the other hand, deep convolutional layers tend to produce non-uniformly distributed features as they are in the later stage of feature extraction, where certain keypoint features become more prominent. A uniform distribution refers to a probability distribution where all outcomes are equally likely to occur, as shown in Figure 2.

According to the two cases of the existence of the feature map, in case 1: the feature map parameter values are uniformly distributed, i.e., the variance

D (M_{W, H, C})

is relatively small; and in case 2: the feature map parameter values are not uniformly distributed, and some of the features are prominent, i.e., the variance

D (M_{W, H, C})

is large. This is shown in Equation (12):

\{\begin{cases} D (M_{W, H, C}) > σ \\ D (M_{W, H, C}) \leq σ \end{cases}

(12)

where

σ

denotes the threshold. When

D (M_{W, H, C}) \leq σ

,

β_{N} \approx α_{N}

, at this time, the feature extraction effect of the two methods is similar, with essentially no difference; when

D (M_{W, H, C}) > σ

,

β_{N} > α_{N}

, at this time, the proposed method can enhance the effect of feature extraction and strengthen the eigenvalues.

3.4. ECAv2Net Embedding in HRNet

In order to validate the above Equation (11), ECAv2Net is embedded into the HRNet backbone network and compared with ECA-HRNet. The network structure of ECAv2Net is shown in Figure 3. The structure obtains the channel features through two squeezes, then local channel feature interactions, and finally connects through one-dimensional convolution.

The model structure of ECAv2-HRNet is shown in Figure 4 below. Since the features of the shallow areas (Layer 1, Stage 2) of HRNet are not clear, i.e.,

D (M_{W, H, C}) \leq σ

,

β_{N} \approx α_{N}

, the role of ECAv2Net is not clear, and the features of the deep areas (Stage 3, Stage 4) are clearer, i.e.,

D (M_{W, H, C}) > σ

,

β_{N} > α_{N}

; at this time, the role of ECAv2Net is clear.

The shallow layer of the final proposed ECAv2-HRNet model adopts the attention mechanism of ECANet, and the deep part adopts the attention mechanism of ECAv2Net.

4. Experimental Datasets

This work conducts tests on two datasets: the COCO dataset, which is a comprehensive dataset suitable for tasks such as image identification, semantic segmentation, and picture caption synthesis. GUET CLASS PICTURE is a dataset obtained from a university smart classroom. It was collected using a SONY AX60 HD camera to capture the smart teaching process. The dataset consists of time-series images that are generated based on the design of the knowledge-point teaching process. The dataset is specifically used for the task of detecting keypoints of human posture in an intensive manner.

4.1. COCO Dataset

In the field of human posture keypoints, the COCO dataset is labeled with a total of 17 keypoints with rich size specifications, which is a very challenging dataset. The image size analysis is shown in Figure 5, and the distribution of all keypoints is shown in Figure 6.

4.2. GUET CLASS PICTURE Dataset

The images of this dataset were captured from the teaching process in the smart classroom scenario, and the distribution of human postures in the images is very dense, with more than 50 human postures labeled per image and 11 keypoints per human posture, whereas the maximum number of human postures labeled per image in COCO is 20. This dataset is a dense human posture dataset; the size distribution of the images is shown in Figure 7, and the distribution of all keypoints is shown in Figure 8. Other detailed information is provided in Appendix A.

Human posture images in a classroom environment have many occluded keypoints, the distance between people is relatively close, and the recognition results of many models are not satisfactory. The GUET CLASS PICTURE dataset enables the model to perform better human pose detection in dense environments like classrooms. In this study, HRNet and ECAv2-HRNet were trained on two datasets, and a comparative experiment was conducted in a classroom setting. The comparison results are shown in Figure 9. As can be seen in the figure, the model trained on the GUET CLASS PICTURE dataset is able to detect more occluded keypoints, yielding significantly better results than the model trained on the COCO dataset.

5. Experiments and Results

The performance of the ECAv2-HRNet model described in this research was enhanced on both the COCO and GUET CLASS PICTURE datasets, thus confirming its cutting-edge capabilities.

5.1. Validation on COCO Dataset

5.1.1. Experimental Environment and Results on COCO Dataset

Experimental environment: The experiments were conducted under a Ubuntu18.04 LTS 64-bit system, the CPU was Intel Xeon Gold 6330H, the origin is the Hillsboro factory in Hillsboro, OR, USA, the graphics card was NVIDIA GeForce RTX3090 24 GB, the place of origin is Shenzhen, China, and the running environment was PyTorch 2.0.1; all the models converged within 210 Epochs. The optimizer uses the Adam optimizer with weight decay set to 1 × 10⁻⁴. The initial learning rate was 0.001, and the learning rate decay factor was multiplied by 0.1 at the 170th and 200th Epochs, respectively. Data augmentation was performed using randomized half-body training, affine transformations, and randomized horizontal flips. The input size was 256 × 192 × 3, where the base channel number of the HRNet backbone network was 32. The mAP convergence plots of all the processes of each model are shown in Figure 10, and the mAP convergence plots of the 51st Epoch to the 210th Epoch are shown in Figure 11. The metrics for each model in the COCO val dataset are shown in Table 1, and the metrics for each model in the COCO test-dev dataset are shown in Table 2.

From Table 1, it can be seen that the mAP of ResNet50 and ResNet101 is higher than that of ResNet152, repeatedly increasing the number of layers of the residual network does not result in a continued improvement of the model’s performance, and there is a bottleneck in the performance of the ResNet backbone network. The accuracy of the HRNet backbone network is generally higher than that of the ResNet backbone network, and the use of dual squeeze by ECAv2-HRNet improves the channel feature information interaction, while the use of average pooling and maximum pooling squeeze does not increase the parameters of the model. The mAP, AP⁵⁰, AP⁷⁵, AP^M, AP^L, and mAR of ECAv2-HRNet with double squeeze are better than those of ECA-HRNet with single squeeze on the COCO dataset. This is because the feature extraction accuracy of double squeeze is higher than that of single squeeze. When integrating the improvement of double squeeze into ECAv2-HRNet, the overall recognition performance was improved; therefore, the mAP of ECAv2-HRNet proposed in this paper rises from 74.37% in ECA-HRNet to 75.7% under the condition that the GFLOPs and Params remain unchanged.

5.1.2. Ablation Experiment on COCO Dataset

ECAv2-HRNet embeds feature enhancement modules in the deeper regions (Stage 3, Stage 4), which makes the convergence smoother and more accurate. Figure 12 shows the all-process mAP convergence plot for the comparison experiment, Figure 13 shows the mAP convergence plot from the 51st Epoch to the 210th Epoch for the comparison experiment, and Figure 14 and Figure 15 show the all-process mAR convergence plot and the mAR convergence plot from the 51st Epoch to the 210th Epoch for the comparison experiment.

ECAv2-HRNet enhances the interaction and sharing of channel feature information by efficiently extracting features. This, in turn, improves the stability of expressing keypoint features in the high-resolution backbone network when combining features of different resolutions. As a result, the convergence curves of mAP and mAR in ECAv2-HRNet become more stable, and ECAv2-HRNet has a stronger feature extraction ability than ECA-HRNet. Figure 16 presents a feature extraction comparison graph of the two models. From Figure 16, it can be seen that ECAv2-HRNet has a clearer feature extraction ability, which makes the model more accurate in recognizing the keypoints of the human body posture.

This study conducted a comparative experiment by adding ECAv2Net at different stages. The results of the comparison are shown in Table 3, which validate Equation (11); the model with ECAv2Net added in stages 3 and 4 achieves the highest mAP, AP⁵⁰, AP⁷⁵, AP^M, AP^L, and mAR metrics compared to models where ECAv2Net is implemented in other stages. The model indicators mAP, AP⁵⁰, AP⁷⁵, AP^L, and mAR added with ECAv2Net in stage 4 are the second highest; this is because stage 4 belongs to the deep stage, and its feature distribution is uneven. At this time, the double squeeze method can extract better feature information, which improves the overall performance of the model, but its effect is slightly worse than that of the models with ECAv2Net added in stages 3 and 4. In this paper, the per second (SPS) test results of ECAv2-HRNet and ECA-HRNet on the RTX 3090 device are 4.67 and 4.65, respectively; ECAv2-HRNet leads to an improvement in accuracy but brings very slight running latency.

5.2. Validation on GUET CLASS PICTURE Dataset

5.2.1. Experimental Environment and Results

This part of the experiment was conducted on a Ubuntu18.04 LTS 64-bit system with an Intel Core i5 CPU, NVIDIA GeForce RTX2080Ti 11 GB graphics card, and PyTorch as the running environment, and all the models were converged within 110 Epochs. The optimizer uses the Adam optimizer with weight decay set to 1 × 10⁻⁴, with an initial learning rate of 0.0002, which was multiplied by a learning rate decay factor of 0.1 at the 50th and 80th Epochs, respectively. Data augmentation was performed using randomized half-body training, affine transformations, and randomized horizontal flips, with an input size of 256 × 192 × 3, where the number of channels underlying the HRNet backbone network was 32. Figure 17 shows the mAP convergence diagram of the full-process of ResNet series models, Figure 18 is the mAP convergence diagram of the detailed process of ResNet series models, Figure 19 shows the mAP convergence diagram of the full-process of HRNet series models, Figure 20 is the mAP convergence diagram of the detailed process of HRNet series models, while the model-specific metrics are shown in Table 4.

The ResNet network was able to converge faster on the GUET CLASS PICTURE dataset and achieve a higher mAP, and the HRNet network outperformed ResNet on the COCO dataset due to its constant fusion of features of different resolutions, but the number of GUET CLASS PICTURE datasets with lower resolutions was small, and so the HRNet network’s low-resolution feature did not play a role. The ResNet50 series model’s Params were concentrated around 34 M, and GFLOPs were concentrated around 4G, while the ResNet101 series model’s Params were concentrated around 55 M, and GFLOPs were concentrated around 7.7 G. The two model parameters of the HRNet series were slightly better than those of the ResNet50 series, and the highest mAP was 96.71% for ECAv2-HRNet.

5.2.2. Ablation Experiment on GUET CLASS PICTURE Dataset

On the GUET CLASS PICTURE dataset, the performance of the HRNet backbone network was slightly worse than that of the ResNet backbone network, but with the embedding of the attention mechanism and the improvements proposed in this paper, the mAP of ECAv2-HRNet reached 96.71%, which was 1.09% higher than that of ECA-HRNet. Figure 21 is the mAP full-process convergence diagram of two model comparison experiments, Figure 22 is the detailed convergence diagram of mAP of two model comparison experiments, Figure 23 is the mAR full-process convergence diagram of two model comparison experiments, Figure 24 is the mAR detailed convergence diagram of two model comparison experiments.

ECAv2-HRNet possesses the capability of enhancing features, resulting in a higher rate of convergence. The model’s stability experiences minor fluctuations during the middle stage of convergence. This is attributed to the model’s heightened sensitivity to dataset noise caused by feature reinforcement. However, as the learning rate decreases, the model gradually stabilizes. Additionally, both the mAP and mAR metrics are higher compared to ECA-HRNet.

5.3. Modeling Applications and Data Analysis

After using YOLOv8 for human target recognition, the proposed model in this paper was applied to analyze the posture of the learner in the smart teaching scenario, and a posture image was obtained, as shown in Figure 25.

The method proposed in this paper could extract 11 body posture keypoints for most learners in a smart teaching intensive scenario, though some could not be recognized because they were heavily occluded. A separate posture extraction was performed for one of the learners, as shown in Figure 26 below.

In Figure 26,

∠ A Q G = 90^{°}

, the length of the straight line

|\vec{A Q}|

is able to distinguish well between head-up or head-down, and

|\vec{A Q}|

is calculated as shown in Equation (13):

\vec{|A Q|} = {({(x_{A} - x_{Q})}^{2} + {(y_{A} - y_{Q})}^{2})}^{1 / 2}

(13)

where

(x_{A}, y_{A})

denotes the coordinates of point A, and

(x_{Q}, y_{Q})

denotes the coordinates of point Q. When

\vec{A Q} > θ_{1}

(

θ_{1}

is the critical value), it denotes the head-up state; when

\vec{A Q} \leq θ_{1}

, it denotes the head-down state.

Let the angle between

\vec{F G}

and the horizontal axis be

φ_{F G}

, and then the horizontal slope

K_{F G} = t g φ_{F G}

of

\vec{F G}

. The calculation of

t g φ_{F G}

is shown in Equation (14):

t g φ_{F G} = \frac{y_{F} - y_{G}}{x_{F} - x_{G}}

(14)

where

(x_{F}, y_{F})

denotes the coordinates of point F and

(x_{G}, y_{G})

denotes the coordinates of point G. When

t g φ_{F G} > θ_{3}

(

θ_{3}

is the threshold value and

θ_{3} > θ_{2}

), it indicates a right-sloping body state; when

t g φ_{F G} < θ_{2}

, it indicates a left-sloping body state; and when

θ_{2} \leq t g φ_{F G} \leq θ_{3}

, it indicates a positive sitting state.

The length relation of

\vec{F Q} - \vec{Q G}

can determine whether the head is turned or not, which is calculated as shown in Equation (15):

\vec{F Q} - \vec{Q G} = {({(x_{F} - x_{Q})}^{2} + {(y_{F} - y_{Q})}^{2})}^{1 / 2} - {({(x_{Q} - x_{G})}^{2} + {(y_{Q} - y_{G})}^{2})}^{1 / 2}

(15)

In the equation, when

\vec{F Q} - \vec{Q G} < θ_{4}

(

θ_{4}

is the threshold value, and

θ_{5} > θ_{4}

), it is the right head-turning state; when

\vec{F Q} - \vec{Q G} > θ_{5}

, it is the left head-turning state.

∠ F H K

and

∠ G I J

can respond well to the elbow state, and

∠ F H K

is calculated as shown in Equation (16):

∠ F H K = \arccos (\frac{{\vec{|H K|}}^{2} + {|\vec{H F}|}^{2} - {|\vec{F K}|}^{2}}{2 \times \vec{|H K|} \times |\vec{H F}|})

(16)

In the equation, when

∠ F H K < θ_{6}

(

θ_{6}

is the threshold value), the elbow is bent; when

∠ F H K \geq θ_{6}

, the elbow is straight.

K point is the keypoint representation of the wrist, which can be used to calculate whether it is in the hand-raising state or not, when

x_{F} - x_{K} \geq θ_{7}

(

θ_{7}

is the threshold value), it is in the hand-raising state; on the contrary, it is in the normal state. The posture diagram for each state is shown in Figure 27.

The temporal information generated by the coordinates of various keypoints can well reflect the changes in posture. Combining the knowledge points to explain the information can provide a comprehensive interpretation of the learners’ postures. After class, generate time-series posture data for each learner, following the sequence of knowledge points; feedback on learners in abnormal learning postures is provided to the teacher, who then assists them in resolving their learning difficulties, and this provides more accurate data for calculating subsequent learning participation.

6. Discussion

ECAv2-HRNet is slightly inferior to current models with large numbers of parameters, such as ViTPose and Omnipose, in terms of human pose recognition accuracy and generalization of application scenarios (e.g., sports scenarios, medical scenarios), and it is also slightly inferior to lightweight models such as Lite-HRNet in terms of inference speed. However, by striking a balance among accuracy, parameter count, and inference speed, ECAv2-HRNet has demonstrated exceptional performance in smart classroom scenarios. And it is much smaller than the large-scale model in the number of parameters and higher than the lightweight model in accuracy. Moreover, ECAv2-HRNet uses double squeezing to extract the feature information in the channel, making the model more accurate when the keypoint features are unclear or obscured.

At present, it is still a great challenge to realize the real-time pose recognition of all learners in the smart classroom scene. The SPS of ECAv2-HRNet is 4.65. When the number of learners is fewer than 10, the ECAv2-HRNet model can complete real-time pose recognition, and when the number of learners is about 30, the model has a certain delay on the pose recognition of learners. The real-time pose recognition of learners cannot be completed in the classroom with about 90 learners, but ECAv2-HRNet can maintain high accuracy and the speed of pose recognition of all learners in the non-real-time state. Both data management and model training present significant challenges that cannot be overlooked, such as dataset limitations, constraints in data collection, and lack of verification of annotation accuracy. These issues can further lead to overfitting problems during model training.

Future research should not limit the application scenarios to classrooms but extend them to other similarly crowded scenarios, such as lecture halls and report halls, to broaden the acquisition scenarios of the GUET CLASS PICTURE dataset, increase the image annotation in lecture hall and report hall scenarios, and enhance the generalization ability of the ECAv2-HRNet model for different crowding scenarios in intelligent education.

7. Conclusions

Smart education is an emerging trend that focuses on utilizing learners’ body postures to gather valuable information about their level of engagement in the learning process. By effectively capturing and analyzing these postures, it is possible to enhance the quality of teaching and learning. The paper introduces ECAv2Net, which incorporates adaptive local cross-channel interaction and enhances network features through two squeeze operations. These operations are embedded in the deeper regions (Stage 3, Stage 4) of HRNet, resulting in the proposed ECAv2-HRNet model. This study presents the creation of the GUET CLASS PICTURE dataset, which captures human postures for smart teaching in intensive scenarios. Experimental findings using a public dataset demonstrate that the suggested human posture estimation model, based on ECAv2-HRNet, outperforms other models. The model presented in this work is utilized in the context of a smart classroom scenario to extract the postures of learners. The collected data can then be utilized to assess the posture information of learners and offer guidance for the development of smart education.

Author Contributions

Z.S. offered comprehensive direction for experiments. Y.Y. completed the implementation of the experiments as well as the improvement of the experiment sand data analysis. D.L. participated in the tests and conducted the analysis of the results. J.M. completed data collation and analysis. H.Z. and J.Z. completed the derivation of the theory. Z.W. completed writing this article. All authors reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62177012, 62267003) and Guangxi Natural Science Foundation under Grant No. 2024GXNSFDA010048 and was supported by a Project of the Guangxi Wireless Broadband Communication and Signal Processing Key Laboratory (GXKL06240107), an Innovation Project of Guangxi Graduate Education (YCBZ2024160), and the Project for Improving the Basic Scientific Research Abilities of Young and Middle-aged Teachers in Guangxi Colleges and Universities under Grant 2023KY1870. Thanks go to all the individuals who actively took part in this research project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data included in this study are available upon request by contact the corresponding author.

Acknowledgments

We are very grateful to the anonymous reviewers for their valuable and insightful suggestions on the original manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Ethics statement. The present study was approved by the China Association of Research Ethics Committees. The experimental procedures were conducted in conformity with the Declaration of Helsinki. Informed consent for participation in this study and publication in an open-access format was obtained from the participants, with regard to their photographs and personal information. The procedures of this study were fully explained to the participants, and they provided their informed written consent before testing.

GUET CLASS PCITURE Dataset Details. The GUET CLASS PCITURE dataset is taken from a university smart classroom video, each frame contains 50 to 90 people, a total of 12,426 posture data were labeled, each posture datum in the absence of occlusion was labeled with 11 keypoints, and the keypoints include the nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, and right wrist. The labeling rules adopt the json rules of the COCO dataset, and Figure A1 shows a high-resolution labeled image of this dataset.

Figure A1. High-resolution labeled image of the GUET CLASS PICTURE dataset.

References

Mascret, Q.; Gagnon-Turcotte, G.; Bielmann, M.; Fall, C.L.; Bouyer, L.J.; Gosselin, B. A wearable sensor network with embedded machine learning for real-time motion analysis and complex posture detection. IEEE Sens. J. 2021, 22, 7868–7876. [Google Scholar] [CrossRef]
Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep learning-based human pose estimation: A survey. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
Chen, H.; Feng, R.; Wu, S.; Xu, H.; Zhou, F.; Liu, Z. 2D Human pose estimation: A survey. Multimed. Syst. 2023, 29, 3115–3138. [Google Scholar] [CrossRef]
Zhang, C.; Chen, J.; Li, J.; Peng, Y.; Mao, Z. Large language models for human–robot interaction: A review. Biomim. Intell. Robot. 2023, 3, 100131. [Google Scholar] [CrossRef]
Peng, Y.; Yang, X.; Li, D.; Ma, Z.; Liu, Z.; Bai, X.; Mao, Z. Predicting flow status of a flexible rectifier using cognitive computing. Expert Syst. Appl. 2025, 264, 125878. [Google Scholar] [CrossRef]
Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef] [PubMed]
Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv 2023, arXiv:2303.07399. [Google Scholar]
Zhang, S.; Qiang, B.; Yang, X.; Zhou, M.; Chen, R.; Chen, L. Efficient pose estimation via a lightweight single-branch pose distillation network. IEEE Sens. J. 2023, 23, 27709–27719. [Google Scholar] [CrossRef]
Holmquist, K.; Wandt, B. Diffpose: Multi-hypothesis human pose estimation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 15977–15987. [Google Scholar]
Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. 3D human pose estimation with spatio-temporal criss-cross attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4790–4799. [Google Scholar]
Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8877–8886. [Google Scholar]
Zhang, T.; Lian, J.; Wen, J.; Chen, C.P. Multi-Person Pose Estimation in the Wild: Using Adversarial Method to Train a Top-Down Pose Estimation Network. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 3919–3929. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Bao, W.; Niu, T.; Wang, N.; Yang, X. Pose estimation and motion analysis of ski jumpers based on ECA-HRNet. Sci. Rep. 2023, 13, 6132. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Dong, C.; Du, G. An enhanced real-time human pose estimation method based on modified YOLOv8 framework. Sci. Rep. 2024, 14, 8012. [Google Scholar] [CrossRef]
Li, Q.; Zhang, Z.; Xiao, F.; Zhang, F.; Bhanu, B. Dite-HRNet: Dynamic lightweight high-resolution network for human pose estimation. arXiv 2022, arXiv:2204.10762. [Google Scholar]
Zhang, Q.; Xu, Y.; Zhang, J.; Tao, D. Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. Int. J. Comput. Vis. 2023, 131, 1141–1162. [Google Scholar] [CrossRef]
Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
Li, J.; Su, W.; Wang, Z. Simple pose: Rethinking and improving a bottom-up approach for multi-person pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11354–11361. [Google Scholar]
Zhang, H.; Dun, Y.; Pei, Y.; Lai, S.; Liu, C.; Zhang, K.; Qian, X. HF-HRNet: A Simple Hardware Friendly High-Resolution Network. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7699–7711. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germeny, 8–14 September 2018; pp. 466–481. [Google Scholar]
Cai, Y.; Wang, Z.; Luo, Z.; Yin, B.; Du, A.; Wang, H.; Zhang, X.; Zhou, X.; Zhou, E.; Sun, J. Learning delicate local representations for multi-person pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer International Publishing: Glasgow, UK, 2020; pp. 455–472. [Google Scholar]
Wang, J.; Long, X.; Chen, G.; Wu, Z.; Chen, Z.; Ding, E. U-HRnet: Delving into improving semantic representation of high resolution network for dense prediction. arXiv 2022, arXiv:2210.07140. [Google Scholar]
Artacho, B.; Savakis, A. Omnipose: A multi-scale framework for multi-person pose estimation. arXiv 2021, arXiv:2103.10180. [Google Scholar]
Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10440–10450. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Feng, Y.; Liu, P.; Lu, Z.M. Optimized S2E Attention Block based Convolutional Network for Human Pose Estimation. IEEE Access 2022, 10, 111759–111771. [Google Scholar] [CrossRef]
Jin, X.; Xie, Y.; Wei, X.-S.; Zhao, B.-R.; Chen, Z.-M.; Tan, X. Delving deep into spatial pooling for squeeze-and-excitation networks. Pattern Recognit. 2022, 121, 108159. [Google Scholar] [CrossRef]

Figure 1. Diagram of

α_{N}

and

β_{N}

feature extraction process.

Figure 1. Diagram of

α_{N}

and

β_{N}

feature extraction process.

Figure 2. Distribution of eigenmaps for different channels at different depths ((a) shows the original input image, colors in figures (b–h) indicate eigenvalues corresponding to the values of the labeled axes on the right).

Figure 3. ECAv2Net architecture diagram.

Figure 4. Structure of ECAv2-HRNet.

Figure 5. Distribution of image sizes for the COCO dataset.

Figure 6. Distribution statistics of various types of keypoints in the COCO dataset.

Figure 7. Distribution of image sizes for the GUET CLASS PICTURE dataset.

Figure 8. Distribution statistics of various types of keypoints in the GUET CLASS PICTURE dataset.

Figure 9. (A) Recognition results of the HRNet model trained on the GUET CLASS PICTURE dataset, (B) HRNet model trained on the COCO dataset, (C) ECAv2-HRNet model trained on the GUET CLASS PICTURE dataset, and (D) ECAv2-HRNet model trained on the COCO dataset.

Figure 10. Whole-process mAP convergence plot on COCO val dataset.

Figure 11. Detailed mAP convergence plot on COCO val dataset.

Figure 12. ECAv2-HRNet and ECA-HRNet full-process mAP convergence plot on COCO val dataset.

Figure 13. ECAv2-HRNet and ECA-HRNet detailed mAP convergence plot on COCO val dataset.

Figure 14. ECAv2-HRNet and ECA-HRNet full-process mAR convergence plot on COCO val dataset.

Figure 15. ECAv2-HRNet and ECA-HRNet detailed mAR convergence plot on COCO val dataset.

Figure 16. (A) Nose keypoint feature extraction maps for ECAv2-HRNet model and (B) ECA-HRNet model, and (C,D) right shoulder keypoint feature extraction maps for ECAv2-HRNet and ECA-HRNet models, respectively.

Figure 17. Full-process mAP convergence plot for the ResNet family of models on the GUET CLASS PICTURE dataset.

Figure 18. Detailed mAP convergence plot for the ResNet family of models on the GUET CLASS PICTURE dataset.

Figure 19. Full-process mAP convergence plot for the HRNet family of models on the GUET CLASS PICTURE dataset.

Figure 20. Detailed mAP convergence plot for the HRNet family of models on the GUET CLASS PICTURE dataset.

Figure 21. ECAv2-HRNet and ECA-HRNet full-process mAP convergence plot on GUET CLASS PICTURE dataset.

Figure 22. ECAv2-HRNet and ECA-HRNet detailed mAP convergence plot on GUET CLASS PICTURE dataset.

Figure 23. ECAv2-HRNet and ECA-HRNet full-process mAR convergence plot on GUET CLASS PICTURE dataset.

Figure 24. ECAv2-HRNet and ECA-HRNet detailed mAR convergence plot on GUET CLASS PICTURE dataset.

Figure 25. Multi-person dense posture estimation maps for smart teaching scenarios.

Figure 26. Estimated human postures of a single student in a classroom, with A-K denoting nose, left eye, left ear, right eye, right ear, right shoulder, left shoulder, right elbow, left elbow, left wrist, and right wrist, respectively.

Figure 27. (a) Head-up state, (b) head-down state, (c) right-slanting state, (d) sitting upright state, (e) left-slanting state, (f) right-turning state, (g) left-turning state, (h) bent-elbow state, (i) straight-elbow state, and (j) raised-hand state.

Table 1. Performance metrics for each model on COCO val.

Method	Params (M)	GFLOPs	mAP/%	AP⁵⁰/%	AP⁷⁵/%	AP^M/%	AP^L/%	mAR/%
ResNet50	34	4.04	72.23	92.44	80.37	69.28	76.69	75.4
ResNet101	53	7.69	72.96	92.48	81.31	70.17	77.1	76.1
ResNet152	67.51	10.49	72.21	92.49	80.31	69.24	76.63	75.35
HRNet	28.54	7.69	73.43	92.27	80.9	71.25	77.04	77.01
ECA-HRNet	28.54	7.69	74.37	92.17	81.89	72.29	78.02	77.93
ECAv2-HRNet	28.54	7.69	75.7	93.43	83.35	73.27	79.47	78.56

Table 2. Performance metrics for each model on COCO test-dev.

Method	mAP/%	mAR/%
ResNet50	71.54	77.28
ResNet101	72.5	78.47
ResNet152	70.2	75.5
HRNet	72.8	78.3
ECA-HRNet	73.4	78.9
ECAv2-HRNet	74.42	79.21

Table 3. Experimental results of adding ECAv2Net with different stages.

Stage N with ECAv2Net	mAP/%	AP⁵⁰/%	AP⁷⁵/%	AP^M/%	AP^L/%	mAR/%
1 2	73.93	92.54	81.49	71.25	78.01	76.85
1 2 3	74.02	92.54	81.45	71.47	78.2	77.1
1 2 3 4	73.74	92.54	81.32	70.79	78.39	76.77
2 3 4	74.25	92.53	81.6	71.75	78.45	77.17
4	74.28	92.55	81.61	71.58	78.5	77.23
3 4	75.7	93.43	83.35	73.27	79.47	78.56

Table 4. Performance metrics for each model on GUET CLASS PICTURE.

Method	Params (M)	GFLOPs	mAP/%	AP⁵⁰/%	AP⁷⁵/%	AP^M/%	AP^L/%	mAR/%
ResNet50	34	4.04	95.78	98.95	98.85	95.29	96.1	97.48
ResNet101	53	7.69	96.13	98.84	98.83	95.35	96.47	97.80
SE-ResNet50	36.53	4.05	96.35	98.93	98.92	95.88	96.17	98.07
SE-ResNet101	57.77	7.7	95.5	98.94	98.94	94.72	95.82	97.36
S2E-ResNet50	36.53	4.05	95.57	98.94	98.93	95.14	95.88	97.36
S2E-ResNet101	57.77	7.7	96.04	98.95	98.94	94.67	96.46	97.52
ECA-ResNet50	34	4.04	96.35	98.94	98.93	96.03	96.66	98.04
ECA-ResNet101	52.99	7.69	96.27	98.94	98.92	95.47	96.66	97.89
CBAM-ResNet50	34.52	4.05	96.18	98.94	98.94	95.45	96.62	97.90
CBAM-ResNet101	53.52	7.7	96.14	98.93	98.93	95.79	96.35	97.87
Omnipose	30.56	7.91	96.52	98.97	98.96	96.15	96.7	98.06
UHRNet	28.57	7.69	91.44	98.98	97.65	90.96	91.95	94.16
HRNet	28.54	7.69	94.89	98.95	98.92	94.14	95.39	96.91
S2E-HRNet	28.76	7.69	95.66	98.93	98.93	95.39	95.98	97.46
SE-HRNet	28.76	7.69	93.89	98.95	98.89	92.63	94.64	96.08
ECA-HRNet	28.54	7.69	95.62	98.95	98.93	94.87	95.97	97.42
ECAv2-HRNet	28.54	7.69	96.71	98.93	98.92	96.44	96.87	98.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shou, Z.; Yu, Y.; Li, D.; Mo, J.; Zhang, H.; Zhang, J.; Wu, Z. Accurately Estimate and Analyze Human Postures in Classroom Environments. Information 2025, 16, 313. https://doi.org/10.3390/info16040313

AMA Style

Shou Z, Yu Y, Li D, Mo J, Zhang H, Zhang J, Wu Z. Accurately Estimate and Analyze Human Postures in Classroom Environments. Information. 2025; 16(4):313. https://doi.org/10.3390/info16040313

Chicago/Turabian Style

Shou, Zhaoyu, Yongbo Yu, Dongxu Li, Jianwen Mo, Huibing Zhang, Jingwei Zhang, and Ziyong Wu. 2025. "Accurately Estimate and Analyze Human Postures in Classroom Environments" Information 16, no. 4: 313. https://doi.org/10.3390/info16040313

APA Style

Shou, Z., Yu, Y., Li, D., Mo, J., Zhang, H., Zhang, J., & Wu, Z. (2025). Accurately Estimate and Analyze Human Postures in Classroom Environments. Information, 16(4), 313. https://doi.org/10.3390/info16040313

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accurately Estimate and Analyze Human Postures in Classroom Environments

Abstract

1. Introduction

2. Related Work

3. ECAv2-HRNet Architecture

3.1. The ECANet Attention Mechanism

3.2. ECAv2Net Attention Mechanism

3.3. Feature Representation for Varying Depths of Convolutional Layers

3.4. ECAv2Net Embedding in HRNet

4. Experimental Datasets

4.1. COCO Dataset

4.2. GUET CLASS PICTURE Dataset

5. Experiments and Results

5.1. Validation on COCO Dataset

5.1.1. Experimental Environment and Results on COCO Dataset

5.1.2. Ablation Experiment on COCO Dataset

5.2. Validation on GUET CLASS PICTURE Dataset

5.2.1. Experimental Environment and Results

5.2.2. Ablation Experiment on GUET CLASS PICTURE Dataset

5.3. Modeling Applications and Data Analysis

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI