Next Article in Journal
Hybrid CNN and XGBoost Model Tuned by Modified Arithmetic Optimization Algorithm for COVID-19 Early Diagnostics from X-ray Images
Previous Article in Journal
Data Augmentation and Deep Learning Methods in Sound Classification: A Systematic Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Three-Dimensional Action Recognition for Basketball Teaching Coupled with Deep Neural Network

1
Sports Department, Shanghai Polytechnic University, Shanghai 201209, China
2
Department of Physical Education, Shanghai University of Traditional Chinese Medicine, Shanghai 201203, China
*
Author to whom correspondence should be addressed.
Electronics 2022, 11(22), 3797; https://doi.org/10.3390/electronics11223797
Submission received: 14 October 2022 / Revised: 12 November 2022 / Accepted: 15 November 2022 / Published: 18 November 2022
(This article belongs to the Section Artificial Intelligence)

Abstract

:
This study proposes a 3D attitude estimation algorithm using the RMPE algorithm coupled with a deep neural network that combines human pose estimation and action recognition, which provides a new idea for basketball auxiliary training. Compared with the traditional single-action recognition method, the present method makes the recognition accuracy better and the display effect more intuitive. The flipped classroom teaching mode based on this algorithm is applied to the college sports basketball optional course to explore the influence of this teaching mode on the classroom teaching effect. Compared with the evaluation index of action recognition, the experimental results of various action recognition methods and datasets are compared and analyzed, and it is verified that the method has a good recognition effect. The values of Topi and Top5 of the proposed method are 42.21% and 88.77%, respectively, which are 10.61% and 35.09% higher than those of the Kinetics-skeleton dataset. However, compared with the NTU RGM dataset, the recognition rate of Topi is significantly reduced. Compared with the traditional single-action recognition method, this method has better recognition accuracy and a more intuitive display effect. The fusion method of human posture estimation and motion recognition provides a new idea for basketball auxiliary training.

1. Introduction

In basketball games, basic actions include dribbling, shooting, and layups, among which dribbling is the most basic action in basketball, and shooting is the key to scoring the entire game [1,2,3,4]. The accuracy of the basic actions has a great impact on the game’s big impact score. With the development of basketball sports, the fusion of the human pose estimation algorithm and action recognition algorithm plays a crucial role in assisting in improving the scoring rate [5]. Human pose estimation is the detection and estimation of the position, direction, and scale information of each part of the target human body from the image. It is necessary to convert this information into a digital form that can be interpreted by the computer and output the current human posture and action. However, action recognition is based on the result of pose estimation and is used as the input object to judge whether a person’s actions are normative and how to improve the normativeness [6].
Two-dimensional pose estimation algorithms include OpenPose, AlphaPose, RMPE, and other algorithms. The RMPE algorithm used in this paper is an improvement on the single-person pose estimation algorithm and at the same time effectively avoids the inaccuracy and redundancy problems of the detection frame position [7]. The VideoPoSe3D algorithm is based on a fully convolutional model of extended temporal convolution on two-dimensional key points to effectively estimate 3D poses in videos. The algorithm uses a time-dilated convolution model, which is a fully convolutional architecture that takes a 2D word strip as input and transforms it through temporal convolution. Change mode supports volume processing and time comparison, but not RNN time comparison. In transition mode, the length of the gradient path between the output and the input is constant regardless of the array length, which minimizes the vanishing and exploding gradients of the RNN [8]. The convolution mode has precise control over the capture time domain, which facilitates the evaluation of formatting time dependencies during the evaluation of 3D images. VideoPose3D uses a semi-supervised training approach to improve accuracy in settings with limited availability of labeled 3D ground truth pose data [9]. Combined with video and 2D keypoint detection equipment, the monitoring of the loss function of projected objects can be extended. This approach solves the problem of automatic encryption of unnamed data: the encoder (evaluation layer) performs 3D evaluation based on 2D joint points, while the decoder (prediction layer) returns the same 3D pose projected back to 2D coordinates [10,11,12].
In human action recognition and classification based on video data, this paper proposes a method combining human pose estimation and human action recognition [13,14,15]. The method is mainly divided into three steps: first, the video 2D skeleton coordinate information is extracted from the data, and then the 2D skeleton coordinate information is converted into 3D skeleton coordinate information by the VideoPoSe3D method. Finally, the 3D skeleton coordinate information is input into the dynamic skeleton model spatiotemporal graph convolution network (ST-GCN) network model for classification and recognition to improve the accuracy of action recognition (Figure 1). This algorithm is applied to the flipped classroom teaching of college sports basketball optional courses, and the teaching effects of the flipped classroom teaching mode and traditional teaching mode are compared [16].
The biggest problem encountered by 3D action recognition is that the recognition efficiency is not high due to the interference of the environment. For 2D image recognition, this method processes hundreds of pixels per image in real time to extract features, which is computationally expensive, and noise such as scenes will affect the recognition accuracy [17]. The input source of 3D action recognition is RGBD video, and the depth data in it have no color information, so the color of the subject’s clothes and the cluttered scene has no effect on the segmentation process. This allows researchers to focus more on obtaining robust feature descriptors to describe actions, rather than on low-level segmentation. Therefore, this paper introduces the deep learning method in basketball training, which can assist in training and improve the scoring rate.
The aim of early action recognition is to perceive and reconstruct 3D objects on 2D images and classify and recognize them by edge extraction of single-frame images. Compared with video-based action recognition methods, action recognition based on a single frame image is easier to obtain, but at the same time, there are problems such as failure to obtain time information, which leads to misjudgment. Video-based action recognition can obtain time and space information, which can greatly improve the recognition rate, and has strong scalability and high flexibility. Therefore, video-based action recognition has become the main research direction in recent years. The traditional method of motion recognition is to extract features from video data manually. The recognition method starts from the earlier template-based method, which can recognize simple actions, but has limited accuracy. The Markov model and probabilistic latent semantic analysis model in the probabilistic statistical model method use spatiotemporal correlation graphs for coding, which enrich the recognition description of human actions [16]. For different types of actions, unsupervised generation models have been used for learning, such as the conditional random field model and the infinite state implicit conditional random field model. Then came the syntax analysis technology, which took the human body’s action description as the atomic level for action recognition. The improved dense trajectories (IDT) algorithm realizes human behavior recognition by extracting feature points between SURF descriptors and dense optical flow to obtain partial trajectories [14].
The action recognition method based on deep learning uses the depth network to automatically learn features from the original video to output the classification results. The image-based action recognition method is studied in video classification, action location in video, real-time performance, etc. [15]. Although the image-based recognition method has a better recognition effect, the problems of limited data information, complex spatiotemporal correlation, and diverse intraclass changes have become more difficult to solve in human motion recognition. A 3D convolutional neural network is a single-stream network model. The 3DCNN model gathers the video target [17], action, and other related information and is applicable to multiple tasks, but the video needs to be segmented during training. To solve this problem, Convolutional 3D (C3D), which can process the entire video frame [18], is used. In order to improve the generalization ability of 3D convolution networks, 3D residual convolution networks and pseudo-3D residual networks were proposed one after another. Three-dimensional convolution has high computing and memory costs. It can use hybrid convolution tubes and generate countermeasure networks to train videos, which improves recognition efficiency and performance.
Generally, video-based recognition methods are worse than those based on human key points. However, traditional human keypoint methods need equipment for capturing, which is difficult to apply in basketball games and training. In this regard, combined with the basketball game data, we mainly use the space–time dual flow network structure, combined with the human posture estimation algorithm, and use the basketball movement dataset to train the network model so that the network model can be used for action recognition and compared with other recognition algorithms through experiments. The skeleton information extracted from the 2D human posture estimation method cannot meet the needs of each player’s position and relationship in basketball games.
In our study, the multiperson pose estimation method is used to extract 2D skeleton information, and then 3D human skeleton information is extracted from a player’s 2D video through the 3D human pose estimation method in video with time convolution and semi-supervised training, which provides a basis for later model training. The motion recognition technology in deep learning is introduced into basketball. Through the integration of the human posture estimation method and motion recognition method, the self-built basketball dataset is used for training, and the motion in basketball game videos is recognized and classified to improve recognition efficiency.
We aimed to answer three questions: (i) Does the skeleton information extracted by the improved action recognition method based on the deep neural network meet the needs of each player’s position and relationship in the basketball game? (ii) Can the fusion of the human posture estimation method based on deep learning and the motion recognition method improve the efficiency of model recognition? (iii) What is the visual representation of the model output results proposed in this paper?
Our research is organized as follows: Firstly, it introduces the methods of human posture estimation and the related theoretical basis of motion recognition, including 2D and 3D posture estimation methods, graph convolution networks, and time convolution networks. Then it introduces the deep learning framework involved in this paper and the system development technology stack used. Then, the 3D coordinates of the human body are extracted and preprocessed through the specific method of human posture estimation, the results are replaced by the input of the ST-GCN network model for training and verification, and the experimental results are compared and analyzed.

2. Materials and Methods

2.1. Data Collection

In this paper, the 2D skeleton coordinates are extracted from the motion video by the RMPE algorithm, the 2D human body pose estimation method is implemented, and the 2D skeleton information is processed by VideoPoSe3D for 3D skeletons [18]. The integrated neural network (ST-GCN) gradually generates a higher-level feature map on the graph and then classifies the features through the average pooling and fully connected layers. The overall architecture steps of action recognition are shown in Figure 2 below.
With the advent of the era of big data and the increasing popularity of camera tools, video-based datasets can be used to evaluate algorithm performance. In recent years, some studies have attempted to expand human action recognition in the field of physical activity and have collected and established sports-related datasets [19]. Examples include UCFSport, Sport-lM, NCAABasketballDataset dataset, and VolleyballDataset dataset. However, the NCAABasketballDataset dataset related to basketball includes the data in the game, and there is a certain gap between the movement of the players in the actual game. Therefore, the basketball motion dataset in this paper includes 600 basketball motion videos in the training set and 600 basketball motion videos in the test set. The resolution of the videos is 1920 × 1080, and the frame rate is 15 frames per second. The dataset includes the basic actions of a single person, which are classified as shooting, layup, dribbling, running dribbling, dunking, and running without the ball [20]. It also includes basic actions of multiple people, such as dribbling breakthrough and defense;. This paper uses the comparative dataset NTURGB+D120 dataset and the Kinetics-skeleton dataset. The NTURGB+D120 dataset is simultaneously captured by three Microsoft Kinectv2 cameras, including 114,480 videos and 120 action categories; the Kinetics-skeleton dataset is a large-scale, high-quality dataset, containing up to 650,000 video clips, covering 400/600/700 human action classes, depending on the version of the dataset.

2.2. Methods

In the video, the 2D skeleton information is first extracted, then the 2D skeleton information is converted into 3D skeleton information, and then the 3D information is used as the input of the algorithm model for action recognition. The 2D attitude estimation algorithm uses the RMPE algorithm, which belongs to the top-down attitude estimation method and is an improvement of the SPPE algorithm to solve the problem of inaccurate and redundant detection frame positions [21]. It is divided into three steps: human frame detection, human pose estimation, and non-maximum suppression (Figure 3). The first step is to normalize the input image using the YOLO.v3 algorithm, then divide the dimensions into several grids, and then use multiple priority boxes for each grid and find multidimensional attributes for each box. During this process, the detected human frame may be incorrect [22]. The second step mainly solves the problem of inaccurate detection frames. The results of the previous step are input into two parallel attitude estimation branches. The upper branch modifies the original detection framework through STN to maximize the effect of SPPE evaluation. However, there are errors in this step. The inverse transform parameters are calculated, and the predicted human pose is mapped into the original detection frame. During training, the lower branch ParallelSPPE acts as a regular corrector to avoid local optimization. The pose evaluation results are compared with the indicated actual pose, and the center position error is sent back to the STN module to improve the accuracy of the SDN module selection. The third step is to solve the detection frame redundancy problem. The pose estimation results are input into the PoseNMS module, which outputs 2D coordinates of human key points by comparing the similarity of human poses. The fourth step is to convert 2D coordinates into 3D coordinates. For the output 2D coordinates, this paper uses the VideoPoSe3D algorithm mentioned in Section 2.1 to convert them into 3D coordinates [23]. The algorithm uses a fully convolutional network to perform time-domain convolution on the input 2D keypoint sequence to obtain a 3D keypoint sequence. With multiple residual modules and rapidly increasing dilated convolution factors, the field of view can be extended to the entire input array and ultimately to the 3D integrated frame array output.
The biggest problem encountered by 3D motion recognition is that the recognition efficiency is not high due to the interference of the environment. For the recognition of 2D images, the method processes hundreds of pixels of each image in real time to extract features, which is expensive to calculate, and the scene noise will affect the recognition accuracy. The input source of 3D motion recognition is RGBD video, in which the depth data have no color information, so the color of the subjects’ clothes and the chaotic scene have no impact on the segmentation process. This allows researchers to focus more on obtaining robust feature descriptors to describe actions, rather than on low-level segmentation. Therefore, this paper introduces the deep learning method in basketball training, which can assist in training and improve the scoring rate.
The key points of the human body are acquired in the video or collected by the camera. Due to different shooting angles, background environments, etc., the data are very different. Therefore, it is necessary to preprocess the keypoint sequence data, including coordinate transformation and normalization operations. In order to conform to the human body movement logic, this paper establishes the coordinate system of the human body movement direction, standing direction, and ground direction and maps all coordinates to this coordinate system [24]. According to computer graphics, the modification of the integrated system includes printing the rotation and translation functions, and the transformation of the coordinate system includes the rotation and translation operations around the axis. For point ( X s , Y s , Z s ) , point ( X t , Y t , Z t ) is post-revised. There is a mapping relationship shown in Equation (1).
x t y t z t = r 11 r 12 r 13 r 21 r 22 r 23 r 31 r 33 r 33 x s y s z s
where R i j is a rotation variable and t i is a translation variable. First, the neck node of the human body is taken as the origin, the two shoulder lines are taken as the x-axis direction, and the y-axis is established according to the right-hand coordination method. At this time, the y-axis is the vertical direction of the human body (Figure 4). By translating the appearance of the y-axis, the body coordinates fall on the positive semi-axis of this axis, and the z-axis is the frame, which is the movement direction of the body’s center point on the first frame, and the coordinates after the last frame are mapped. Normalization is measuring data graphically equally within a given interval. When normalizing keypoint data, this worksheet selects the positive maximum value of the y-axis as the person’s height and normalizes it to 1, and then measures other data in equal proportions, with all coordinates in the range of [0, 1].
This paper studies the spatiotemporal graph convolutional neural network model (ST-GCN), which designs a general representation of skeleton sequences for action recognition. The model is based on a series of skeletal graphs, where each node corresponds to a joint of the human body. There are two types of edges: temporary edges that connect the same joint to a natural joint of the joint and edges that extend the space between successive joints. Based on this, a multilayer transformation for temporal graphs is constructed to capture the information integration in the spatiotemporal dimension.
The spatiotemporal graph convolution can be obtained by corresponding connections of the same important points in normal frames on temporal edges. As such, it has a permanent structure that can be changed through steps such as spatiotemporal transformations. For frame V, each frame has keypoint D and channel C, and each row shows the location of the main point at different times. If the length of each transition is 1, once the node is changed once, the switching center point of the main frame will contain a total of K frames before and after it, and the size of the agreed kernel is K*1. The structure of the algorithm is shown in Figure 5, and the network is divided into three parts. The first part adjusts the input matrix, i.e., the coordinates of the joints in different frames. The second part strengthens the ST-GCN structure. In ST-GCN, it is used to measure the edge weight of different parts of the human body. The changes in the figure show the local spatial functions in the edge of the layer frame and the temporal function in the edge of the transition layer. The third part uses the average pooling layer and the whole link layer to classify the elements and print the classification results [25].
Traditional skeleton recognition uses convolutions to identify bone motion to concatenate the coordinate vectors of all joints to create an aspect ratio vector for each frame. We use spatiotemporal graphs to represent skeleton sequences. In particular, the undirected spatiotemporal graph G = (V, E) is constructed in skeletal order, with N joints and T frames. Node set V = { v t i | t = 1 , 2 , , T , i = 1 , , N } contains all joints in the skeleton sequence. In the input of ST-GCN, the coordinate system of the length and width vector on the node F v t i is composed of the ith joint coordinate vector and the rated confidence of t. This paper uses two steps to create a spatiotemporal map of the skeleton sequence. First, according to the connection of the human body structure, the joints are attached to the frame with the edges. In a continuous integration system, each joint is connected to a single joint. Thus, the connections in this system are naturally defined without excluding the individual parts [26]. This enables network configurations to handle datasets with varying numbers of nodes or connections.
The definition of the transformation process in a 2D image or feature map is considered as a 2D grid. The output feature map of the transformation process is also a 2D grid. Given a convolution operator with kernel size KXK, the output value of a channel in spatial location x can be written as follows:
f o u t x = h = 1 k ω = 1 k f i n p x , ω , h · W h , ω
where p is a sampling function centered on the center, which represents the neighborhood of position x; ω : Z2Rc is a weight function. The inner product is calculated by the weight vector of the C-dimensional space. In graph convolution, the sampling function and weighting function need to be redefined. The sampling function P(x, h, w) is defined on adjacent pixels relative to the center position x: B v t j = v t i | d v t i , v t j D ,   a   node   is   v t i , and d v t i , v t j is the minimum length of any path from v t i to v t j . The mapping ltj maps nodes in the neighborhood set to labels, and the weight function can be expressed as follows:
W v t i , v t j = W l t i v t j
Using the improved sampling function and weight function, Formula (2) is now rewritten in the form of graph convolution as follows:
f o u t v t i = v t j B v t i 1 Z t i v t j f i n p v t j , v t i · W v t j , v t i
The result of normalization is Z t i v t j = | { v t k | l t i v t k = l t i v t j } | , and this term is used to balance a subset of outputs with different contributions. Finally, we can obtain the following:
f o u t v t i = v t j B v t i 1 Z t i v t j f i n v t j · W l t j v t i
After the spatial graph CNN is established, the task of modeling the spatial and temporal dynamics within the skeleton sequence begins. Starting in the modeling space, temporal dynamics are worked on in skeleton order. Note that in the construction of the map, the timing aspects of the map are constructed by connecting the same joints between consecutive frames. Spatial graphs can define a much simpler strategy for extending CNNs to the spatiotemporal domain. In other words, the concept of the neighborhood is extended to include temporarily connected joints:
B v t i = { v q j | d v t j , v t i K , q t τ 2 }
where parameter τ controls the time limit included in the neighbor map, so it can be called the time kernel size. In order to complete the transformation process on the spatiotemporal graph, the model function (equivalent to the spatial condition) and the weight function are required:
l s t v q j = l t i v t j + q t + τ 2 × K
This method is a simple and straightforward sharing strategy. In this strategy, (1) the aspect vector on each adjacent node has an inner product with the same weight vector; (2) distance division, the distance part is the distance of the adjacent part set at the distance from the tip to the root node; (3) spatial division, since the human skeleton is spatial, the subset is specifically divided into the root node itself, the centripetal group, and the centrifugal subset. The single-label partitioning performs simple feature averaging before the convolution operation, so the performance is poor, and the multi-subset partitioning strategy solves this problem; meanwhile, in the multi-subset partitioning strategy, the spatial configuration partition has better performance. Therefore, this paper uses spatially configured partitions and considers concentric and eccentric motion patterns.
In action recognition, the evaluation metrics used by 2D-based pose methods are the percentage of correctly estimated body parts (PCP) and mAP. PCP selects limb length as a benchmark to evaluate the detection accuracy of the head, torso, upper arm, lower arm, thigh, and calf. PCK chooses the normalized distance as the benchmark to evaluate the detection accuracy of seven joints, namely head, shoulder, elbow, wrist, hip, knee, and ankle. The mAP index reflects the average PCKh detection rate across all joints. The human pose estimation experiment in this paper uses the mAP: m A P = m e a n A P   @   s , where AP is average accuracy, s is a threshold, mAP is the mean AP in different s, p is the number of human in the video, OKSp mainly evaluates and predicts the similarity between the joint position of the P-th person and the label position, and σ(∙) is a crone function.
A P   @ s = b σ ( O K S p > s ) p 1
O K S p = i d p i 2 / 2 s p 2 δ i 2 δ p i = 1 i δ v p i = 1
where i is the node type variable and d p i 2 represents the importance of joint point i; when v p i = 1 , the node is visible. The 3D-based recognition methods use Topi and Top5 benchmarks. Topi is the largest label prediction value in the final probability vector of the predicted label. Top5 is the final probability vector of the top 5. In addition, the recognition accuracy is evaluated using cross-view (CV) and cross-subject (CS) benchmarks.

2.3. Guideline for Basketball Teaching

Through the collection and comparative analysis of team and player schedule information, basic data, and data in personal training, the user uploads a single person’s sports video, and through the comparative analysis of the extracted three-dimensional skeleton information and the skeleton information of standard actions, assists players in training for standard actions. For the team game, the user uploads the complete game video, analyzes it through the introduced recognition algorithm, feeds back the movement and position of each player, and provides tactical guidance to the coach. Before the development of the system, it is necessary to conduct a feasibility analysis of the system to be developed, including economic feasibility, technical feasibility, and operational feasibility. The following is the specific analysis content [27]:
In terms of economic feasibility, compared with commercial sports video analysis systems at home and abroad, at present, in basketball games, mainstream commercial systems mainly use foreign analysis systems, such as SynergySports, ShortTrackerTeam, and Coach’s Eye. The cost is high. For the training and use of ordinary college basketball teams, the economic pressure is too great, However, the development of this system is all completed on a personal computer, and the software used is free software. The data in development come from daily collection, and the development cost is low. The charge for the basketball auxiliary training system after commercial use is low, which solves the problem of high investment costs in colleges and universities [28].
In terms of technical feasibility, domestic game analysis systems, namely Tongdao intelligent cloud platform, intelligent teaching assistant event systems, SportsDT of peer technology, and other systems are mainly focused on football, golf, and other matches and require a lot of manual assistance, with slightly lower intelligence and agility. Through the research of the 3D-based motion recognition algorithm presented in Section 3, this system can improve the recognition rate of athletes, extract the corresponding 3D skeleton of the motion, and compare the skeleton longitudinally to effectively assist coaches in training [29].
In terms of operational feasibility, the system is designed with a simple interface. Users do not need to understand the background processing process. For users, the background data processing process and computing difficulty can be ignored. At the same time, the system is deployed on Alibaba Cloud without excessive computer configuration, so it is operationally feasible.

3. Results and Discussion

3.1. Three-Dimensional Recognition Algorithm Results

Table 1 compares the performance of OpenPose and RMPE on the MPII dataset, including the statistics of the mAP of each joint and the running speed of the network. Table 2 compares the performance of the two algorithms on the MSC0C0 dataset. AP50 represents the average precision of 0KS at a threshold of 50, and APM represents the estimation of mAP on a small scale. The experimental results show that in the human pose estimation algorithm, the detection speed of OpenPose is 120 times that of RMPE because OpenPose is composed of network forward calculation to estimate joint points and joint point allocation, and the time scale of forward calculation is two times larger than that of joint point allocation. The order of magnitude is predominant, and its time consumption is not affected by the number of people. In the multiperson pose assessment system, the detection speed of OpenPose is 120 times faster than that of RMPE, because OpenPose has a network forward calculation for estimating joint points and joint point allocation, the forward calculation is larger than the joint point allocation, and its time consumption is not affected by the number of people. However, RMPE needs to calculate the offset between the ground truth and the detection bounding box for each detection and then normalize it, which is very time-consuming. However, the detection effect of RMPE on various human joint points and people of various scales is generally higher than that of OpenPose. Due to the advantage of more accurate detection of joint points, this paper uses the RMPE method for human pose estimation, and the estimated bone sequence is input into the action recognition model.
In this experiment, each video ranges from 0.2 s to 10 min. Each video contains a basic action. The resolution of the video is 1920 × 1080, and the data are recorded in 15 frames per second. The joint points of the human body in the video can be obtained, as shown in Table 3 below.
For the basketball basic action database, the video is first converted into 3D skeleton data, the preprocessing method given in Section 2 is used, the 3D information of 18 local joint points is output through human action recognition, and then the results are created in order. For any initial skeleton dataset input to the network, its dimension (18, 3, 300) represents the number of 18 human connection points and the information size of 3 input connection points. Usually, the space is a three-dimensional coordinate representing the total number of frames of the 300 incoming video, and this sequence is used as the input to the “human action recognition network”. The nine-layer GCN-TCN module is divided into three parts. The first three layers define the number of 64 single-node feature channels, the three middle layers define 128, and the last three layers define 256. In order to verify the effectiveness of the method in this paper for basic basketball action recognition, the training set of the basketball basic action dataset is used to train the network model in the text, and the test set is used to test the accuracy of the model. Compared with the action recognition method based on video data, the recognition method based on skeleton coordinate information has higher recognition efficiency. Therefore, in the comparative experiment, this paper first compares different methods based on 3D action recognition on the NTURGB-D dataset, as shown in Table 4. The results show that 3D-based action recognition has richer joint relationships and helps to capture more useful patterns, and on the NTURGB-D dataset, additional motion prediction and complementation based on skeleton features improve the recognition efficiency. In order to deal with noise and occlusion in 3D skeleton data, some scholars have introduced a gating mechanism in LSTM to learn the reliability of sequence input data and adjust its update effect on long-term context information stored in memory units accordingly; the rate reached 77.7%. With the proposal of ST-GCN, the recognition efficiency reaches 88.3%. This method combines the GCN model and the TCN model. It is a dynamic skeleton model of the space–time dual stream. The recognition method based on the space–time dual stream is equipped with a three-dimensional convolution filter. The accuracy rate in this method is better than those of other network structures. Among them, the hierarchical structure of the GCN model and the data in the action recognition task are diverse, and the topology of the graph is heuristically set and fixed on all model layers and input data for processing graph structure data with different rules.
Then, the coordinate data of 3D skeleton key points are input into the ST-GCN network for a comparison test. Specifically, the 2D keypoint coordinate data and 3D keypoint coordinate data are used as input, and the model is researched and tested in the self-made basketball dataset (Figure 6). The experimental results show that the recognition rates of 2D and 3D coordinates are 66.64% and 87.69%, respectively. In network training, since the two-dimensional joint point information is used as the input of the recognition network and lacks depth information, the recognition rate will be significantly lower than the recognition efficiency of the three-dimensional joint point information as the input. This is the reason for converting dimensional coordinates to 3D coordinates.
The ST-GCN recognition algorithm is compared on different datasets, the standard cross-entropy loss function is used as a whole, the batch size is selected as 16 in the initial training process, the initial learning rate is 0.01 and is reduced to 10% of the original amount every 5 epoch iterations, a total of 65 epochs are trained, and the verification recognition rate is recorded in the self-built basketball dataset. The model in this paper is experimentally compared on the NTU-RGBD dataset, the Kinetics-skeleton dataset, and the self-built basketball dataset. The experimental results are shown in Figure 7 below.
Since this experiment uses a self-built basketball dataset, in the initial action recognition results, the Topi recognition rate is too low, and parameters need to be adjusted to improve the recognition rate, as shown in Figure 8. By adjusting the learning rate lr, this parameter is the gradient of the error to the parameter. If the gradient is positive, it means that when it increases, the loss will also increase accordingly. At this time, a positive number can be subtracted to achieve the purpose of reducing the error. If the gradient is negative, it means that when it increases, the loss will decrease accordingly. At this time, a negative number can be subtracted to achieve the purpose of reducing the error. Therefore, after adjusting the parameters many times, it is finally determined that the lr value is 0.30, the batch number is 8, the epoch number is 55, and the recognition accuracy is the highest and the most stable. After modifying the parameters, the Topi value of this method is 42.21% and the Top5 value is 88.77%, which are 10.61% and 35.09% higher than those of the Kinetics-skeleton dataset, respectively. Compared with the RGM dataset, the recognition rate of Topi is significantly reduced; this is because the self-built basketball dataset uses manual segmentation of basic actions in the process of collection and processing, and there are certain errors. At the same time, the collected videos are occluded. This problem directly affects the recognition rate.
In order to accurately observe the ST-CGN classification results of the self-built basketball video dataset, this paper uses a confusion matrix to evaluate the performance of ST-CGN, as shown in Figure 9. It is known that the diagonal elements are assumed to be percentages equal to real numbers. The confusion matrix shows that ST-CGN can effectively solve the problem of shape variation and skeleton noise in large-scale data.
Since basketball is a multiplayer sport, the running between players and the referee on the court will inevitably cause occlusion problems. At the same time, the standardization of each player’s movements is inconsistent, which will affect the experimental results. The shooting action is divided into shooting, hook layup, and dunk. It can be seen from the picture that “shoot”, “layup”, and “slam dunk” are easily confused. Among the “shoot” actions, 1% were mistaken for “layup” and 3% were mistaken for “slam dunk”. The dribbling action is divided into in situ dribbling and running dribbling, and “in situ dribbling” and “running dribbling” can easily be confused. The main reason for those is that in the process of competition, each athlete’s action extraction has different problems of active and passive interaction, and the relationship information between the skeleton joints is lacking, so it is difficult to capture and distinguish.

3.2. The Effect of Basketball Teaching Coupled with 3D Algorithms

Taking the “basketball in situ one-handed over-the-shoulder shooting” class in Theme 2 as an example, the strategy of integrating deep learning into basketball optional courses is specifically analyzed, and the teaching effects of the flipped classroom teaching mode and the traditional teaching mode are compared. The learning of this stage of “basketball shooting with one hand over the shoulder” is one of the most important movements in the whole “basketball” course. The basic concepts are also extremely abstract. It is difficult to accept and understand, and it is also the key point of this class. Students’ learning goals are to explore and construct the action of “shooting with one hand over the shoulder” through videos or self-practice activities and use this as a guide for deep learning to ask questions [30]. Students will be evaluated in a timely manner for physical problems, and the questions will be continuously extended in the form of questioning, so that students’ thinking will always be active, and open-ended questions will be arranged after class. The former can enable students to summarize their own knowledge and thoughts, while the latter can pave the way for subsequent students, achieve the effect of reconstruction, and form a network thinking structure, which is tight and compact. The following is the classroom process of one-handed over-the-shoulder basketball shooting in the flipped classroom model based on deep learning [31]:
By means of testing, the physical quality of students in each class before and after the experiment is counted, and it is tested whether there is a significant difference in the physical quality of students in experimental class 1, experimental class 2, and the control class before and after the experiment. The specific statistical results are shown in Table 5.
The physical fitness scores of the students in experimental class 1, experimental class 2, and the control class before and after the experiment were subjected to a paired sample T-test. The p values of experimental class 1 in 50 m, sit-ups/pull-ups, 800/1000, metering, sitting forward bending, and cross-direction running were all less than 0.05; that is to say, there were significant differences in the five physical qualities of the students in class 1 before and after the experiment, and the average scores were improved. Students in experimental class 1 can take physical fitness training seriously at the end of each class, can clearly recognize the benefits brought by physical fitness training, and can prepare for activities before the test, so the overall physical fitness of experimental class 1 has the best quality. In class 2 of the experiment, the p values in the four aspects of 50 m, sit-ups/pull-ups, 800/1000 m, and sitting forward bending were all less than 0.05, and the p values in the cross-direction running were greater than 0.05. Therefore, before and after the experiment, the students in experimental class 2 showed significant differences in the four physical qualities of 50 m, sit-ups/pull-ups, 800/1000 m, and sitting forward flexion, and their average scores were improved. The reason for the lack of a significant difference in cross-direction running is that some students were still not familiar with the rules of cross-direction running, which resulted in mistakes and thus lowered the average score of the class. The p values of the control class in the five aspects of 50 m, sit-ups/pull-ups, 800/1000 m, sitting forward flexion, and cross-direction running were all greater than 0.05. There was no significant difference in the physical quality of the subjects; although the average scores improved, they did not reach the level of significant difference. The reason for this is that some students seldom exercise in the rest of the time except for physical education for a while. In addition, students do not fully prepare for the activities and go to the test, resulting in some students not showing their best performance. There were no significant differences before and after the five physical fitness tests.
Statistics on the performance of students in various basketball techniques and technical and tactical application ability were tested in each class before and after the experiment to test whether the students in experimental class 1, experimental class 2, and the control class have the ability to apply various basketball techniques and technical tactics before and after the experiment. There are significant differences; the specific statistical results are shown in Table 6.
The paired sample T-test was conducted on the students’ basketball skills and technical and tactical application ability scores in experimental class 1, experimental class 2, and the control class before and after the experiment. From the analysis results, the p values of experimental class 1 for full-court dribbling and passing, 60 s free throws, V-shaped layup, and teaching competition were all less than 0.05; that is to say, there were significant differences in the four basketball skills and technical and tactical application abilities of students in class 1 before and after the experiment. After a semester of deep learning flipped classroom teaching, the average level of students’ cognitive level has reached the level of parallel structure, and they can clearly understand their own strengths and weaknesses. After interviews, we learned that most students will exercise after class to address their weaknesses, so the final test results show that the average scores of students in experimental class 1 in various basketball techniques and technical and tactical application abilities have improved compared with those before the experiment. The p values of class 2 in 60 s free throws and V-shaped layup were both less than 0.05, and the p values in both dribbling and passing and teaching competitions were greater than 0.05.
Therefore, for the students in class 2 before and after the experiment, there are significant differences in the two basketball techniques and technical and tactical application abilities in the two aspects of 60 s free throws and V-shaped layups, and the average performance has improved. There are no significant differences in the two aspects of dribbling and passing and teaching competition; the reason for this is that after studying, some students will practice basketball in their spare time. Most of the practice content is shooting, and there is less practice for basketball dribbling, passing, and playing games, which eventually leads to a significant improvement in shooting results. There was no significant difference from before the experiment. The p values of the control class in the four aspects of dribbling and passing, 60 s free throw, V-shaped layup, and teaching competition were all greater than 0.05. The reason for this is that most of the students only practice basketball skills and technical and tactical application skills in addition to physical education classes.
By means of testing, the deep learning ability of students in each class before and after the experiment is measured, and it is tested whether there is a significant difference in the deep learning ability of students in experimental class 1, experimental class 2, and the control class before and after the experiment. The specific statistical results are shown in Figure 10.
The paired sample t-test was conducted on the deep learning ability scores of students in experimental class 1, experimental class 2, and the control class before and after the experiment. From the analysis results, experimental class 1 was at the front structure level, the single-point structure level, and the multipoint structure level. The p values of the five levels of pre-structure level, single-point structure level, multipoint structure level, parallel structure level, and abstract extended structure level are all less than 0.05; that is to say, there are significant differences in the five deep learning levels of students in class 1 before and after the experiment. After one semester of study, the reverse in-depth learning strategies such as instructional design and question-based teaching enable students not only to learn about basketball on the surface, but also to have a deep understanding of the content, thus developing students’ higher-order thinking ability, resulting in the final test results showing that the cognitive level of the students in experimental class 1 has been improved. The p values of the three levels of the former structure level, the single-point structure level, and the multipoint structure level of the experimental class 2 were all less than 0.05, and the p values of the two levels of the parallel structure level and the abstract extended structure level were greater than 0.05; that is to say, before and after the experiment, the students in class 2 of the experiment had significant differences in the three levels of deep learning, namely the pre-structure level, the single-point structure level, and the multipoint structure level, but there was no significant difference in the parallel structure level and the abstract extension structure level. The reason for this is that the students are practicing in a flipped classroom without deep learning. Although they have some ideas about the understanding of techniques and tactics, they still cannot combine the techniques and tactics they have learned with what they have learned before. Their own words were expressed, so the students did not reach a deeper level of learning. The p values of the control class at the former structure level and the single-point structure level were less than 0.05, and the p values at the three levels of the multipoint structure level, the parallel structure level, and the abstract extended structure level were all greater than 0.05; that is to say, the students in the control class before and after experiment have significant differences in the two levels of the pre-structure level and the single-point structure level, but there is no significant difference in the three levels of the multipoint structure level, the parallel structure level, and the abstract extension structure level. The reason for this is that after traditional teaching, most students’ learning level is limited to shallow learning, and their understanding of learning content is only superficial. Therefore, traditional teaching cannot allow students to learn in depth.

3.3. A Comparison of Results with Others

On the MPII dataset and MSC0C0 dataset, the ARSRNet network, the ARSRNet-A network with the attention module, and the ARSRNet-MA model with the time domain multiscale module are compared experimentally. The experimental results are shown in Table 7. It can be seen from Table 7 that the model with the highest accuracy is the ARSRNET-MA model. In addition, the accuracy of ARSRNET-A is 0.62% higher than that of the ARSRNet model, and the accuracy of ARSRNET-MA is 0.95% higher than that of the ARSRNET-M model. It can be seen from the experimental data that after the addition of the channel attention module, the accuracy of the two datasets has been significantly improved, which verifies the effectiveness of the model improvement.
The recognition accuracy of the ARSRNet-MA network trained from scratch on the UCF101 dataset is low, which is because the UCF101 dataset is too small compared with the network model, the model parameters are not fully trained, and overfitting occurs. After pretraining on the Kinetics-400 dataset, ARSRNet-MA was trained and tested on UCF101 dataset, and the accuracy was 91.7%, as shown in the comparison of ARSRNET-MA with other mainstream behavior recognition networks on the UCF101 dataset shown in Table 7. It can be seen from Table 8 that on the UCF101 dataset, ARSRNet-MA network recognition accuracy increases by 0.9%, which verifies the effectiveness of network model improvement. At the same time, the recognition accuracy of the ARSRNet-MA network is not different from that of other mainstream networks. Therefore, the ARSRNet-MA network performs well on the UCF101 dataset.

3.4. The Visual Results

The experimental design mainly involves two aspects: on the one hand, the selection of preprocessing methods for converting 2D coordinates of video to 3D coordinates, and on the other hand, the use of action recognition algorithms. Real-time matching 3D skeleton coordinate information of basketball running is input to VideoPoSe3D, and the player’s 3D skeleton video is output in real time and then processed through the experimental algorithm to obtain a video labeled with classification. The specific process and results are shown in the following figure, which includes (a) a diagram shows the preprocessing stage of human motion video and (b) a diagram showing the result of motion recognition (Figure 11).
In the analysis of the whole game video, through RMPE’s human posture estimation method, multiple far mobilization human posture skeletons are extracted, saved, and output to the json file. At the same time, through the recognition method, multiple athletes’ actions are recognized and scores are displayed. Figure 12 includes (a) a figure representing the input competition video and (b) a figure representing the result after recognition (Figure 12).

4. Discussion

By applying the flipped classroom teaching model based on the deep learning concept in a college sports basketball class, it was found that the physical quality of the students in the three classes improved after the experiment. The teaching effect of the first class of the experiment based on the deep learning flipped classroom teaching model is the most significant. One of the things that can make the students really use their brains is the agile quality: the cross-direction running. The flipped classroom teaching model based on deep learning can better enable students to understand the essentials of cross-direction running, so the students’ performances in the flipped classroom after the experiment and the traditional classroom teaching mode are the same as those before the experiment. After the experiment, the level of basketball technology and technical and tactical application ability of experimental class 1 improved the most. The average levels of the 60 s free throw and V-shaped layup in experimental class 2 increased. The level of improvement was not significant. After one semester of traditional teaching, the level of basketball technology and technical and tactical application ability of the control class has really improved, but the overall level has not changed much compared with before the experiment. After the experiment, most of the students in experimental class 1 stayed at the parallel structure level and extended abstract structure level of deep learning, most of the students in experimental class 2 were at the multipoint structure level, and most of the students in the control class stayed at the single-point structure level, which shows that the flipped classroom teaching mode based on deep learning is more conducive to students’ in-depth learning. Although the flipped classroom teaching mode can also allow students to conduct in-depth learning, it cannot enter a deeper learning state, and the traditional teaching mode can only allow students to learn the superficial content of knowledge, and their understanding of knowledge remains at the level of shallow learning but not deep learning.

5. Conclusions

With the deepening of action recognition research, recognition technology has made great progress, but it still faces huge challenges. For example, there is still a huge gap between the data images collected by the computer and the biological vision system, and the biological vision system’s sensitivity to motion information is much higher than that of computers. Although similar to still image analysis, video data analysis is much more complex. A successful video analysis solution must not only overcome variations such as scale, intraclass disparity, and noise, but also analyze motion cues in the video. Human action recognition can be considered an important problem in video analysis due to its wide range of applications and the complexity of motion patterns generated by joint actions. For vision-based human motion analysis to go beyond isolated actions and poses, contextual information of the environment or objects should be integrated. Things such as the context of the environment provide a strong indication of the type of action, improving recognition accuracy and predicting action.
Through research and analysis of the actual needs of basketball team training, the current basketball training only relies on the coach’s subjective judgment and analysis to formulate a training plan, and it is impossible to intelligently analyze the physical characteristics of each player and the degree of movement specification for training through sports analysis technology. The relevant literature introduces the action recognition algorithm into the basketball auxiliary training system to provide a reference for the training of coaches and athletes.
Human pose estimation detects and estimates the position, direction, and scale information of each part of the target human body from the image. It is necessary to convert this information into a digital form that can be interpreted by the computer and to output the current human pose action. However, action recognition uses the result of pose estimation as the input object to judge whether a person’s action is standardized and how to improve the standardization. Using an attitude estimation algorithm and skeleton-based action recognition algorithm, the basic actions of players are classified and compared, assisting coaches and players to intuitively analyze the actions and tactics of their own team and the opposing team, giving tactical suggestions for action analysis and improving training suggestions for the shortcomings of their own actions. The system helps coaches and athletes observe the details of movements in a data visualization way, compares the calculated technical action indicators with standard data, and provides personalized improvement plans to improve the sports level.
In video-based human motion recognition, the interaction between people is usually a defining feature of an action, but it is also one of the practical challenges due to large differences in similar actions, easy confusion of different types of actions, and self-occlusion of action targets. Future work should focus on the method of encoding context information so that it can be effectively integrated into the coupled motion recognition and attitude estimation system and improve recognition efficiency and accuracy. Considering the deep architecture of motion recognition, the key technologies involved will be 3D convolutional networks, time pools, optical flow frames, etc. Although the above elements were developed separately, mixing the above methods can improve performance. On the other hand, if we want to improve performance, we need to carefully design algorithms, such as data enhancement technologies, funnel-shaped structures, and unique frame sampling strategies. When the athlete’s action is too fast or too complex, the recognition result is not very ideal. It is necessary to improve the multiperson attitude estimation algorithm to extract more accurate human posture estimation information, thus improving recognition efficiency. At present, there is no public dataset for professional basketball videos. The dataset constructed in this paper has a total of six categories, which is still far from the basic basketball actions in the game. More efforts should be made to collect basic basketball action videos in the later stage. At the same time, we should discuss action classification with professional coaches for further research on basketball-based action recognition.

Author Contributions

Conceptualization, K.Z.; methodology, X.S.; software, K.Z.; validation, X.S. and K.Z.; formal analysis, K.Z.; investigation, K.Z.; resources, K.Z.; data curation, K.Z.; writing—original draft preparation, K.Z.; writing—review and editing, X.S.; visualization, X.S.; supervision, K.Z.; project administration, K.Z.; funding acquisition, K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any product, service, and/or company that could be construed as influencing the position presented in, or the review of, this manuscript.

References

  1. Ning, X.; Tian, W.; Yu, Z.; Li, W.; Bai, X.; Wang, Y. HCFNN: High-order Coverage Function Neural Network for Image Classification. Pattern Recognit. 2022, 131, 108873. [Google Scholar] [CrossRef]
  2. Ning, X.; Xu, S.; Nan, F.; Zeng, Q.; Wang, C.; Cai, W.; Jiang, Y. Face editing based on facial recognition features. IEEE Trans. Cogn. Dev. Syst. 2022. [Google Scholar] [CrossRef]
  3. Qi, S.; Zou, J.; Yang, S.; Jin, Y.; Zheng, J.; Yang, X. A self-exploratory competitive swarm optimization algorithm for large-scale multiobjective optimization. Inf. Sci. 2022, 609, 1601–1620. [Google Scholar] [CrossRef]
  4. Wang, C.; Ning, X.; Sun, L.; Zhang, L.; Li, W.; Bai, X. Learning Discriminative Features by Covering Local Geometric Space for Point Cloud Analysis. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
  5. Cai, W.; Ning, X.; Zhou, G.; Bai, X.; Jiang, Y.; Li, W.; Qian, P. A Novel Hyperspectral Image Classification Model Using Bole Convolution with Three-Directions Attention Mechanism: Small sample and Unbalanced Learning. IEEE Trans. Geosci. Remote Sens. 2022. [Google Scholar] [CrossRef]
  6. You, B.; Qi, H.; Ding, L.; Li, S.; Huang, L.; Tian, L.; Gao, H. Fast neural network control of a pseudo-driven wheel on deformable terrain. Mech. Syst. Signal Process. 2021, 152, 107478. [Google Scholar] [CrossRef]
  7. Ding, Y.; Qu, Y.; Sun, J.; Du, D.; Jiang, Y.; Zhang, H. Long-Distance Multi-Vehicle Detection at Night Based on Gm-APD Lidar. Remote Sens. 2022, 14, 3553. [Google Scholar] [CrossRef]
  8. Jin, X.; Li, D. Rotation Prediction Based Representative View Locating Framework for 3D Object Recognition. Comput.-Aided Des. 2022, 150, 103279. [Google Scholar] [CrossRef]
  9. Kumar, M.N.; Kumar, S.S. Face Recognition Using 3D CNN and Hardmining Loss Function. SN Comput. Sci. 2022, 3, 155. [Google Scholar]
  10. Lokesh, B.; Hweiyan, T.; Bor, F.C. Functional Nanoparticles with Magnetic 3D Covalent Organic Framework for the Specific Recognition and Separation of Bovine Serum Albumin. Nanomaterials 2022, 12, 411. [Google Scholar]
  11. Pranav, W.; Kashif, U.; Gokul, K.; Timothy, O.; Bahram, J. Lowlight object recognition by deep learning with passive three-dimensional integral imaging in visible and long wave infrared wavelengths. Opt. Express 2022, 30, 1205–1218. [Google Scholar]
  12. Yasir, S.M.; Sadiq, A.M.; Ahn, H. 3D Instance Segmentation Using Deep Learning on RGB-D Indoor Data. Comput. Mater. Contin. 2022, 72, 15. [Google Scholar]
  13. Shuaifei, M.; Qianru, Z.; Tengfei, L.; Huaibo, S. Basic motion behavior recognition of single dairy cow based on improved Rexnet 3D network. Comput. Electron. Agric. 2022, 194, 106772. [Google Scholar]
  14. Mourad, C.; Zahid, A.; Abdehai, L. Contactless person recognition using 2D and 3D finger knuckle patterns. Multimed. Tools Appl. 2022, 81, 8671–8689. [Google Scholar]
  15. Muzahid, A.A.M.; Wan, W.; Ferdous, S.; Mohammed, B.; Li, H.; Hidayat, U. Erratum to “Progressive conditional GAN-based augmentation for 3D object recognition”. Neurocomputing 2022, 473, 20–30. [Google Scholar] [CrossRef]
  16. Sumaira, M.; Hyeon, J.S.; Jin, K.E.; Hyeon, B.S.; Gyo, I.G.; Won, P.J.; Yong, K.T. 3D Recognition Based on Sensor Modalities for Robotic Systems: A Survey. Sensors 2021, 21, 7120. [Google Scholar]
  17. Cheng, J.; Bie, L.; Zhao, X.; Gao, Y. Visual information quantification for object recognition and retrieval. Sci. China Technol. Sci. 2021, 64, 2618–2626. [Google Scholar] [CrossRef]
  18. Wang, W.; Cai, Y.; Wang, T. Multi-view dual attention network for 3D object recognition. Neural Comput. Appl. 2021, 34, 3201–3212. [Google Scholar] [CrossRef]
  19. Guillem, V.; Khadidja, H.; Pere, R.; Nuno, G. Semantic Mapping for Autonomous Subsea Intervention. Sensors 2021, 21, 6740. [Google Scholar]
  20. Han, T.T.; Ru, T.Y. Architecture Design and VLSI Implementation of 3D Hand Gesture Recognition System. Sensors 2021, 21, 6724. [Google Scholar]
  21. Wang, Y.; Shen, X.J.; Chen, H.P.; Sun, J.X. Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks. Pattern Recognit. Image Anal. 2021, 31, 580–587. [Google Scholar] [CrossRef]
  22. Wang, L.; Dong, X.; Guo, S. Sand-bed defect recognition for 3D sand printing based on deep residual network. China Foundry 2021, 18, 344–350. [Google Scholar] [CrossRef]
  23. Li, W.; Cheng, H.; Zhang, X. Efficient 3D Object Recognition from Cluttered Point Cloud. Sensors 2021, 21, 5850. [Google Scholar] [CrossRef]
  24. Liu, Y.; Jiang, D.; Duan, H.; Sun, Y.; Li, G.; Tao, B.; Yun, J.; Liu, Y.; Chen, B. Dynamic Gesture Recognition Algorithm Based on 3D Convolutional Neural Network. Comput. Intell. Neurosci. 2021, 2021, 1–12. [Google Scholar] [CrossRef]
  25. Liang, Q.; Li, Q.; Zhang, L.; Mi, H.; Nie, W.; Li, X. MHFP: Multi-view based hierarchical fusion pooling method for 3D shape recognition. Pattern Recognit. Lett. 2021, 150, 214–240. [Google Scholar] [CrossRef]
  26. Nie, J.; Wei, Z.; Nie, W.; Liu, A. PGNet: Progressive Feature Guide Learning Network for Three-dimensional Shape Recognition. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–17. [Google Scholar] [CrossRef]
  27. Hughes, K.N.; Berndt, A.; Gill, S. Application of the Flipped Classroom Approach in an Undergraduate Maternal-Newborn Nursing Course to Improve Clinical Reasoning. Creat. Nurs. 2022, 28, 48–53. [Google Scholar] [CrossRef]
  28. Yassin, K.; Danilo, M.; Ervin, S. A review of Hidden Markov models and Recurrent Neural Networks for event detection and localization in biomedical signals. Inf. Fusion 2021, 69, 52–72. [Google Scholar]
  29. Paul, T.; Pynadathu, R.N.; Wei, L.C.; Rafie, B.J.M. EDTA functionalised cocoa pod carbon encapsulated SPIONs via green synthesis route to ameliorate textile dyes—Kinetics, isotherms, central composite design and artificial neural network. Sustain. Chem. Pharm. 2021, 19, 100349. [Google Scholar]
  30. Jeongsu, L.; Chul, L.Y.; Tae, K.J. Migration from the traditional to the smart factory in the die-casting industry: Novel process data acquisition and fault detection based on artificial neural network. J. Mater. Process. Tech. 2021, 290, 116972. [Google Scholar]
  31. Li, B.; Pi, D.; Lin, Y. Learning ladder neural networks for semi-supervised node classification in social network. Expert Syst. Appl. 2021, 165, 113957. [Google Scholar] [CrossRef]
Figure 1. Pytorch model building flow chart (a); LAMP working principle diagram (b).
Figure 1. Pytorch model building flow chart (a); LAMP working principle diagram (b).
Electronics 11 03797 g001
Figure 2. Action recognition overall architecture diagram.
Figure 2. Action recognition overall architecture diagram.
Electronics 11 03797 g002
Figure 3. Fitting 3D coordinates of human joints with 2D coordinates of human joints.
Figure 3. Fitting 3D coordinates of human joints with 2D coordinates of human joints.
Electronics 11 03797 g003
Figure 4. Establishing spatial coordinates.
Figure 4. Establishing spatial coordinates.
Electronics 11 03797 g004
Figure 5. Overall structure diagram of ST-GCN algorithm (a); human joint connectivity (b).
Figure 5. Overall structure diagram of ST-GCN algorithm (a); human joint connectivity (b).
Electronics 11 03797 g005
Figure 6. Recognition results of different methods in self-built basketball dataset.
Figure 6. Recognition results of different methods in self-built basketball dataset.
Electronics 11 03797 g006
Figure 7. Recognition results of the same method on different datasets.
Figure 7. Recognition results of the same method on different datasets.
Electronics 11 03797 g007
Figure 8. The comparison chart of the initial and modified recognition rates of Top1 and Top5.
Figure 8. The comparison chart of the initial and modified recognition rates of Top1 and Top5.
Electronics 11 03797 g008
Figure 9. Ball-based action recognition confusion matrix.
Figure 9. Ball-based action recognition confusion matrix.
Electronics 11 03797 g009
Figure 10. Comparison and analysis of the deep learning ability of students in the experimental class and the control class before and after the experiment.
Figure 10. Comparison and analysis of the deep learning ability of students in the experimental class and the control class before and after the experiment.
Electronics 11 03797 g010
Figure 11. Shooting action video. (a) Preprocessing stage of human motion video; (b) action recognition results.
Figure 11. Shooting action video. (a) Preprocessing stage of human motion video; (b) action recognition results.
Electronics 11 03797 g011
Figure 12. Basketball game video. (a) Figure representing the input competition video; (b) figure representing the result after recognition.
Figure 12. Basketball game video. (a) Figure representing the input competition video; (b) figure representing the result after recognition.
Electronics 11 03797 g012
Table 1. MPII dataset experimental results.
Table 1. MPII dataset experimental results.
HeadShoulderElbowWristHipKneeAnkleTotals/mAP
OpenPose91.287.677.466.975.668.962.176.80.0016
RMPE90.889.984.675.680.675.467.281.31.9
Table 2. MSC0C0 dataset experimental results.
Table 2. MSC0C0 dataset experimental results.
mAPAp50AP75APMAPL
OpenPose61.884.767.457.168.1
RMPE61.983.669.758.667.5
Table 3. Action classification and video frame number table.
Table 3. Action classification and video frame number table.
Type of Movement1 Shoot2 Layup3 In Situ Dribbling4 Running Dribbling5 Off Ball6 Slam Dunk
video frame3730310070,30027903821200
Table 4. Comparison of 3D-based recognition methods in NTURGB-D.
Table 4. Comparison of 3D-based recognition methods in NTURGB-D.
ST-LSTMCV%CS%
ST-LSTM77.769.2
TSRJI80.373.5
Clips+CNN+MTLN84.8379.57
ST-GCM88.381.5
DPRL89.983.6
SGN93.686.9
2S-AGCN95.188.5
2S-NLGCN95.188.5
MS-AAGCN96.290
Sym-GNN96.591.0
MS-AAGCN-TEM96.591.2
RNX3D101+MS-AAGCN-C99.196.1
Table 5. Comparison and analysis table of various physical qualities of students in experimental class and control class before and after the test.
Table 5. Comparison and analysis table of various physical qualities of students in experimental class and control class before and after the test.
IndicatorGroup TypeBefore Experiment (Mean ± Standard Deviation)After Experiment (Mean ± Standard Deviation)TP
50 mClass 1 72.88   ±   8.02 80.06   ±   6.01 −8.8890.002
Class 2 74.37   ±   9.38 78.7   ±   6.52 −4.3770.015
Reference class 73.32   ±   8.78 74.13   ±   8.5 −1.0990.281
Crunches/Pull-upsClass 1 75.06   ±   6.77 82.22   ±   7.78 −7.250.005
Class 2 74.23   ±   7.96 79.33   ±   8.51 −4.3480.003
Reference class 74.06   ±   6.53 75.4   ±   7.31 −2.0190.052
800/1000 mClass 1 71.78   ±   7.53 80.78   ±   7.11 −9.6210.002
Class 2 75.1   ±   8.16 78.83   ±   5.47 −4.8860.012
Reference class 73.81   ±   7.83 74.22   ±   6.92 −1.2230.021
Sitting forward bendClass 1 72.71   ±   7.07 79.19   ±   8.48 −10.7390.003
Class 2 72.87   ±   8.89 77.43   ±   7.38 −4.7450.002
Reference class 73.75   ±   7.63 74.06   ±   6.46 −0.5390.593
Cross runClass 1 72.41   ±   90.2 78.91   ±   8.25 −9.4630.001
Class 2 75.63   ±   8.57 75.8   ±   7.71 −0.3130.756
Reference class 73.59   ±   8.02 74.06   ±   7.47 −1.1600.255
Table 6. Comparison and analysis table of the students’ basketball skills and technical and tactical application abilities before and after the experiment.
Table 6. Comparison and analysis table of the students’ basketball skills and technical and tactical application abilities before and after the experiment.
IndicatorGroup TypeBefore Experiment (Mean ± Standard Deviation)After Experiment (Mean ± Standard Deviation)TP
DribbleClass 1 63.75   ±   11.52 74.37   ±   10.02 −6.8490.001
Class 2 64.67   ±   10.74 66.67   ±   7.58 −4.5620.003
Reference class 63.44   ±   7.04 66.94   ±   6.65 −1.5980.073
60 s free throwClass 1 65.94   ±   12.66 75.31   ±   12.95 −6.3250.002
Class 2 60.67   ±   10.48 67.65   ±   8.15 −3.5930.001
Reference class 61.25   ±   10.7 62.19   ±   7.95 −1.4080.001
V-layupClass 1 60.23   ±   12.7 69.23   ±   3.26 −6.5930.023
Class 2 60.59   ±   10.7 65.23   ±   8.256 −1.2650.001
Reference class 63.56   ±   8.27 69.236   ±   6.459 −2.3650.001
Teaching competitionClass 1 62.03   ±   2.36 69.65   ±   11.26 −6.2320.001
Class 2 61.56   ±   1.25 68.48   ±   1.20 −1.0290.003
Reference class 62.69   ±   9.29 67.65   ±   6.98 −1.6820.012
Table 7. Comparison of attention modules on MSC0C0 dataset.
Table 7. Comparison of attention modules on MSC0C0 dataset.
ModelAccuracy Rate %
ARSRNet62.82
ARSRNet_A61.26
ARSRNet_M62.16
ARSRNet_MA62.90
Table 8. Recognition accuracy of behavior recognition networks on UCF101 dataset.
Table 8. Recognition accuracy of behavior recognition networks on UCF101 dataset.
ModelInputTrainingAccuracy Rate %
C3D [20]RGBSports-1M82.1
TSN [21]RGBImageNet86.3
Res3D [22]RGBKinetics-40084.3
P3D [23]RGBSports-1M89.0
T3D [24]RGBKinetics-40091.2
ECO [25]RGBKinetics-40088.9
MicT-Net [26]RGBKinetics-40086.6
ARSRNet RGBKinetics-40091.2
ARSRNet-MARGBKinetics-40091.3
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zuo, K.; Su, X. Three-Dimensional Action Recognition for Basketball Teaching Coupled with Deep Neural Network. Electronics 2022, 11, 3797. https://doi.org/10.3390/electronics11223797

AMA Style

Zuo K, Su X. Three-Dimensional Action Recognition for Basketball Teaching Coupled with Deep Neural Network. Electronics. 2022; 11(22):3797. https://doi.org/10.3390/electronics11223797

Chicago/Turabian Style

Zuo, Kun, and Xiaofeng Su. 2022. "Three-Dimensional Action Recognition for Basketball Teaching Coupled with Deep Neural Network" Electronics 11, no. 22: 3797. https://doi.org/10.3390/electronics11223797

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop