1. Introduction
Human Pose Estimation (HPE) consists of estimating the position of different parts of the body, such as the joints in a 2D or 3D space depending on the estimation type, normally from visual information, such as images, and sometimes through other additional data obtained by different types of sensors, such as inertial sensors or depth sensors. This field of research can be considered a combination of Data Processing and Artificial Intelligence, more specifically, Computer Vision.
Since 2014, and mainly the past 5 years, the use and interest in HPE has increased, mainly due to the introduction of Deep Learning to the field [
1]. The methodology has evolved from the first simple neural networks to the complex Convolutional Neural Networks (CNN) of today. The use of filters to obtain lines, edges, silhouettes, and other remarkable characteristics of the elements contained in images, as well as the capability of providing information to a system that can learn some characteristics and then detect them when a similar situation is given, have supposed an inflection point.
There are some available surveys that give an overall view on the papers as well as the State Of The Art (SOTA) systems, such as [
2,
3]. The first one is focused on monocular approaches, while the second survey gives an overall view of the different types of HPE systems, such as 2D and 3D, single view and multi-view, single person, and multi-person, and so on. Depending on the different characteristics of the problem, different types of systems can be found. A view of the available public datasets, as well as the used metrics, is presented as well..
Both surveys and preliminary analysis of the available papers about HPE show how the applications of HPE have increased. Different uses of these types of systems can be found, such as in the field of health [
4], Human Computer Interaction (HCI) [
5], Motion Capture (MoCap) systems [
6], Virtual Reality (VR) [
7], Augmented Reality [
8], exergames [
9], and so on. For some applications, the systems are based on general-purpose systems that have shown very good performance in benchmarks. In the recent literature, we can find some examples of general-purpose HPE systems, which implement innovative methods and in which different systems will be probably based, such as [
10], which additionally includes the publicly available code. This system could be a very good starting point to develop a HPE system applied to Sport and Physical Exercise (SPE), as it has obtained very good results in a benchmark with images in the wild, and thus in the context of in-the-wild predictions, could be a very good option. Another good starting point for applying HPE in SPE is the system developed in [
11], which is publicly available as well. This system is specialized in situations of self-contact, so, it could be a very good base for developing a HPE system applied in yoga, for instance.
This paper consists of a systematic review based on the PRISMA guidelines, in which the objective is to provide a similar analysis of the literature as provided by other HPE survey or literature review papers, but that is focused on the application of HPE to the field of SPE, highlighting some aspects related with these systems as well as applying an analysis and review that follows the criteria specified throughout the paper. The importance of the evaluation, taking into account the used metrics and data, as well as the provided information and detail of the process, is highlighted, but other aspects related to the quality of the work and the paper are considered too. The innovations and evolution of this specific field, as well as the problems and opportunities, will be presented.
As it can be seen in the literature reviews related to general-purpose HPE systems, those systems are trained in a variety of contexts and actions, but they are not specifically focused on SPE. The movements in sport and during physical exercise tend to be different from the “standard” movements, sometimes being very explosive movements, others including occlusions of other players or tools, and others including more challenging body positions, such as in gymnastics or yoga. So, even if a general-purpose system can be applied in those contexts, depending on the sport, exercise, or specific needs, it will not perform as well as a more specialized system that is adapted to each context and trained with specific data. This is why it is important to analyze if a general-purpose system can be used in SPE, in which sports it performs better as well as getting to know the needs of adaptations to improve the performance, even if the evaluation metrics and base architectures are the same.
Several research questions are presented in
Table 1, and by the literature review. The discussion section will try to give answers to these questions, as well as reach some conclusions.
The structure of this paper is as follows.
Section 2 presents the evaluation methods used by the authors. The evaluation of the systems is considered one of the most important aspects of any system, as it serves as a tool to measure the performance of a system and be able to compare it with other authors’ works. The most used metrics, as well as datasets, will be analyzed, highlighting the fact that there are few 2D datasets for training HPE systems specialized for its application in the field of SPE, such as Leeds Sports Pose, Penn Action, and PoseTrack, and some others which are not specifically designed for this area, but include some content about some sports or physical activities, such as the broadly used ones as 2D HPE benchmarks, Common Objects in Context (COCO), and Max Planek Institut Informatik (MPII). Then, analyzing the availability of 3D datasets, a lack of sample amount as well as variety in terms of activities is detected, being able to find some datasets such as Demo for Martial Arts, Dancing and Sport Dataset (MADS), but still not being enough to improve specialized systems on SPE. Then, the literature review is presented, first, introducing the used methodology and criteria for the paper evaluation. Finally, the paper finishes with a discussion about the analyzed field, presenting some key ideas and conclusions, as well as giving some ideas of the possible future paths of the topic of HPE application in SPE.
5. Discussion
In this section, the objective is to answer the questions in
Table 1, as well as provide a conclusion regarding all the content presented in this literature review, and analyze the possibilities concerning the future applications and paths.
First of all, as a conclusion regarding the provided statistical information in the previous section, it can be said that taking into account the number of papers published in general about this topic, the topic of this systematic review can be considered a hot topic, which is attracting the interest of the research community, mainly since the year 2017.
After analyzing the review
Table 5 from the previous section, and
in terms of overall form and content of the papers, it can be concluded that concerning paper quality, implementation, use of HPE in SPE, performed evaluation, and obtained results, [
46] can be considered a reference paper to replicate in terms of form. In this paper, the authors have a specific objective that is clearly presented, as well as the method they follow. They make an analysis of the needs of the specific context in which HPE is wanted to be applied, state of the art methods of general-purpose HPE are analyzed, used as examples, and adapted to the needs. This method is combined with other technologies to contribute to a specific area of SPE, and results with other methods are compared using well known metrics and taking into account other aspects of the systems apart from the accuracy, such as the speed or the real-time applicability. Publicly available benchmarks are used, which makes possible the comparison of the performance of the system with others. A dataset including images of the specific use case is developed as well and compared the obtained results with other SOTA HPE systems, which is a very good way of evaluating the developed system. The only negative aspect of the paper is related to the replicability of the work, because, even if a comparison of the developed method and other HPE systems is provided, the code is not publicly available, nor the developed dataset. Saving the work in a private way is understandable because the developed system could have future commercial use, but making public the used dataset for the evaluation and/or training should be considered an interesting approach to be able to contribute to the research community and enable others to compare their systems and contribute to the research community too.
In general, all the papers provide a good abstract and explain their experiments and evaluation properly, but, in a lot of cases, the analysis of the limitations of the study, or the faced problems, is missing. This can be interpreted as an intend to show only the positive aspects of the work to make it more attractive but analyzing the negative aspects and showing them can be a very good habit to improve the quality of the systems by the research community. In any case, most of the papers provide innovative solutions applicable in sport or physical exercise, with good results.
As this topic is quite specific, and, as most of the works are quite recent and there is not a big amount of research papers per year, the citations per paper are quite low. In some cases, there are not citations, but, as explained in a previous paragraph, this can be because some papers have been recently published.
Different conclusions can be reached regarding different aspects of the analyzed information during the literature review. First of all, as a general conclusion, the lack of publications regarding the specific topic of HPE applied to SPE can be detected. Even if hundreds of papers can be found using related terms for the search, finally, few related high-quality papers are available.
Regarding the topic of the evaluation data, the conclusions that can be reached after the analysis of its availability are:
A bigger amount of 3D data is needed.
A higher variety in the type of actions/sports present in 3D datasets is needed.
The amount of 2D data could be enough for the development of a generic 2D HPE system to be applied in sports, but, when applying that system to specific sports, with their specific characteristics and problems, the error could be higher than expected from the overall sport evaluation. So, more variety of sports is needed, and a bigger amount of data per action/activity, including different challenges for the task of HPE.
Publishing the datasets developed by each author could be a very good way of contributing to solving this lack of publicly available data. Each contribution will be part of the data that could be used by different systems to solve the problems faced by the dataset authors or related problems of similar sports or activities.
As seen in
Table A4,
most of the HPE systems applied to sport or exercise are 2D systems, and those which are 3D systems have developed their own dataset for the specific use case, usually not making it available for the research community. This predominance of 2D systems can be due to the previously mentioned lack of 3D HPE datasets for SPE, so, there is a need for a bigger number of samples as well as an increase in the variety of activities. In addition, there are publicly available high-accuracy and fast systems such as OpenPose, introducing their method in [
58], a paper that has been used by several papers to use HPE in different fields and for different applications, such as in the case of [
59], in which their previous less effective player tracking system is replaced by this model to implement a squash player tracker effectively. The paper [
58] has been updated and amplified in terms of detail and complexity, introducing [
21], which as previously mentioned, has served already to apply HPE in different sports to different authors, and probably will continue to be used for 2D HPE problems, and maybe, would be applied to solve 3D HPE problems, by the integration to other methods to estimate the depth of the keypoints.
One of the most surprising aspects of the available literature is that a big part of the papers does not use publicly available datasets to evaluate their systems, or they do not make their developed datasets public. As explained previously, data is a key aspect in the concept of replicability of work, as well as in terms of comparison with other systems, so, not including any evaluation with a dataset that can be accessed by other authors can be considered a quite negative aspect. Another key point regarding replicability is making the code available to other authors, and the code of the analyzed papers is not available in any case. When analyzing the literature of general use HPE systems, the code of several systems can be found. In any case, it is understandable that some authors do not consider publishing their code due to potential patent or product possibilities.
Regarding the
used data for the development and testing of the systems, on the one hand, several papers such as [
12,
13,
14,
15,
17,
18,
20,
37,
38,
42], developed their own datasets using manual annotations, MoCap systems, or other ground truth generation methods, but did not make them publicly available. Other papers created and published their dataset to contribute to the research community, such as [
22,
30]. A big number of papers use publicly available datasets, at least in the training phase of the system. Most of the public datasets used for evaluation are 2D datasets, and in some cases, other datasets such as UCF are used to provide qualitative results of the systems. In most cases, the type of data used is the same, for input image data in combination with 2D or 3D joint localizations as ground truth, and the generated data by the system are the estimated joint localizations, and in some cases some extra information related with the performance or other physical parameters of the use case.
Obviously, and as found in the case of general use HPE systems, CNNs are the base of the methodology of most of the systems, in combination with different methods, such as the use of heatmaps and physical constraints to reduce the error by estimating only feasible body positions. Most of the authors use approaches previously introduced by other authors, and pretrained with public datasets, as the base of their system, and then apply methods to improve the usability of those systems in specific sports or exercise movements. It is common as well to use HPE as a tool to generate new information regarding performance parameters, location of the CoM of the athlete, application of forces, etc.
Several approaches are trying to
solve specific estimation problems in different environments, such as the ones for basketball [
14], diving [
24], hockey [
35], etc, while others try to create a general sports use system, such as [
39,
41]. Taking into account the limited amount of work in specific sports, we can say that interesting research and development can be found regarding
HPE and hockey. Some of the authors of [
35,
60,
61] are involved in the three papers, starting from [
60], in which the dataset HARPE is introduced, focusing the work more on action recognition than in HPE. Then, the paper [
35] is published, in which results of implementing the network introduced in [
60], Stacked Hourglass, in the task of HPE are presented, and compared with the newly introduced HyperStackNet. Obviously, the newer network obtained better results, as, aside from being based in the previous network, it makes use of additional information apart from the image, including the position of the center of the body as input. Finally, in the paper [
61], the dataset introduced by the first paper is improved to HARPET, including temporal information. Thanks to this, without making use of any additional information apart from the image itself as input for the network, a high PCKh score is obtained, a little bit higher than the one obtained in [
60]. As a negative aspect, taking into account that we are talking about the training of Deep Neural Networks, and considering that HARPET only contains 1.200 images, the amount of data used for these papers can be considered too low, and, in addition, it has not been publicly available in any of the publications. Obviously, there is a lot of work and experimentation to do in regards to HPE and its application in hockey, and more data is missing for the training of HPE in this specific task in this specific sport, but these three papers make a good job of showing some possible paths to follow.
From the technical point of view, considering the carried-out research, and the results presented in
Table 4, it can be concluded that there is a variety in terms of HPE application in SPE. On the one hand, several papers can be found which
directly apply general-purpose HPE systems for a specific sport in a specific context, trying to measure the
applicability of those systems in that specific use case. On the other hand, several papers try to
improve existing systems or architectures that have shown good performance in general-purpose contexts, by
applying different methods focused on solving specific problems of specific contexts, which includes the type of exercise or sport, the environment, the involved tools, or the objective of the pose estimation. For example, [
24], to solve the problem of self-occlusions of athletes in the air, use the mutual relations between the key nodes in the heatmap generated by each level network. Ref. [
27] create a structural-aware Spatial-Temporal relation convolution module to solve a usual problem in sports videos, which is suffering from blur due to the fast movement of athletes. Ref. [
30] implement a hierarchical top-down HPE method, which makes the method invariant to rotation and occlusion, two problematic situations very common in dancing. [
35,
42] both focus on sports that include the use of tools, one in hockey and the other one in skiing, implement methods that can learn non-body keypoints, with interesting applications for other sports as well. In [
37], the authors evaluate a widely used HPE system, and see that even if being a general-purpose system does not perform badly in the case of HPE for swimmers, it can be improved. So, they implemented three methods to solve several problems related to the visually challenging environment.
Thus, it can be concluded that the need for a specialized HPE system will depend highly on the context in which it is going to be applied, as well as the objective of its application. Sometimes, using a general-purpose system could be enough to get acceptable performance, but, in other cases, with special needs/objectives or challenging characteristics, the implementation of some methods will be necessary. In any case, more experimentation is needed in this field, as the variety of contexts to apply HPE is high, and the needs differ.
In addition, we can see a very interesting method to reduce the needed amount of estimations or manual interactions when constructing a dataset in the paper [
12]. This
could be especially interesting in the case of some sports or the practice of physical exercise, and probably is the reason why the authors decide on using this use case to test their system. In a lot of sports, there are sequences in which some “body configurations” are repeated in a cyclic way, such as in the case of rowing or running. In these cases, using a method similar to the one introduced in that paper could improve the performance of the system, as well as serve as a tool that can make easier the process of human labeling of body parts.
The paper [
25] obtained good results in public datasets related to sports, but, does not manage occlusions and person pose inversions properly, so, the field of application is quite limited. If its method is combined with a method to manage occlusions, and data augmentation is applied, it could get outstanding results, generating a system that could be applied in several SPE contexts.
Another aspect to be highlighted is the focus of most of the systems in obtaining a higher accuracy or lower error, while there are few systems that take into account other aspects such as the lightness of the speed of the system, such as [
13]. We think this is strange from the point of view of utility in sports, as, the need for a real-time or fast system, or the need for a light model to run in a low resources hardware could be common in the field of SPE, and it looks like few authors are focusing on those aspects.
In terms of results obtained by each paper, it can be said that the use of HPE in sports and exercise activities is very beneficial, as, apart from the biomechanical aspects of the body by the pose estimation itself, different parameters and value information can be generated for the athlete, as well as for the coaches and other sports experts. The applicability and possibilities of HPE in sport are just at the early stages, there are still several sports and applications to test and systems to be developed. The number of sports in which HPE has not been applied, or has been barely applied, is huge, and, as previously explained, the development focused on different aspects than accuracy or low error, such as the speed or the lightness of a system, the specialized setup to a concrete problem, or the use of low-cost hardware, could be a great opportunity to study.
Taking into account the
problems faced by different authors when applying HPE to specific sports or movements during physical exercise, apart from the interest in getting a higher accuracy in terms of low error regarding the prediction of the position of the joints, implementing methods to avoid the problems generated by occlusions could be an interesting branch of the field to research and develop. For example, in [
62], in which an analysis system for rowers is pretended to be developed, an important part of the ground truth data was excluded due to occlusion problems. Another recurrent problem when applying HPE to different sports is the huge error when rare poses are present, such as in gymnastics, pole vault, swimming, dance, etc. There are some papers, such as [
63], that try to lower the problem using data augmentation methods, but there is still a lot of work to do on this topic.
Looking at the
future, there are
interesting paths to be explored and methods to be exploited, such as the use of GANs and synthetic datasets as a way of increasing the available data to train and test systems. As an example, there are works such as [
43], in which these methods are applied as a way of reducing the amount of human work and time needed, and, as a tool for data augmentation. It can be very interesting to analyze the results of these methods, applied in different sports, contexts, and integrated with other methods and with different configurations. Another interesting area of research combining HPE with other Computer Vision algorithms applied to SPE could be the analysis of the interactions and relationships between athletes and the tools and elements involved in the sports practices, such as balls, rackets… as presented in [
64]. Being able to get this data, estimate the pose of athletes with considerable accuracy, as well as track ifferent elements involved in the game, and establish relationships, could be a very useful tool for the field of sport and performance analytics. In these specific papers, the experimentation and presented results are quite limited, only qualitative results are included, but further research on this area could make huge contributions to the field of SPE.