*Article* **An Intelligent Athlete Signal Processing Methodology for Balance Control Ability Assessment with Multi-Headed Self-Attention Mechanism**

**Nannan Xu <sup>1</sup> , Xinze Cui <sup>2</sup> , Xin Wang 2,\*, Wei Zhang <sup>3</sup> and Tianyu Zhao 4,\***


**Abstract:** In different kinds of sports, the balance control ability plays an important role for every athlete. Therefore, coaches and athletes need accurate and efficient assessments of the balance control ability to improve the athletes' training performance scientifically. With the fast growth of sport technology and training devices, intelligent and automatic assessment methods have been in high demand in the past years. This paper proposes a deep-learning-based method for a balance control ability assessment involving an analysis of the time-series signals from the athletes. The proposed method directly processes the raw data and provides the assessment results, with an end-to-end structure. This straight-forward structure facilitates its practical application. A deep learning model is employed to explore the target features with a multi-headed self-attention mechanism, which is a new approach to sports assessments. In the experiments, the real athletes' balance control ability assessment data are utilized for the validation of the proposed method. Through comparisons with different existing methods, the accuracy rate of the proposed method is shown to be more than 95% for all four tasks, which is higher than the other compared methods for tasks containing more than one athlete of each level. The results show that the proposed method works effectively and efficiently in real scenarios for athlete balance control ability evaluations. However, reducing the proposed method's calculation costs is an important task for future studies.

**Keywords:** athlete signal processing; deep learning; balance control ability; multi-headed selfattention mechanism

**MSC:** 68T07

### **1. Introduction**

Because of its significance, almost every sport requires accurate and efficient assessments of the balance control ability of athletes [1]. Meanwhile, the scientific management of athletes depends on good assessments of their balance control ability, including for selection, training, and competition. It is very difficult to accurately assess the balance control ability, since massive and complex data are produced during training and events, and large amounts of expert knowledge and human labor are also required in order to explore the underlying ability of the athletes from the data, making it hard to carry out such assessments in practical scenarios [2,3].

In recent years, with the rapid development of measurement devices and artificial intelligence technology, data-driven methods of balance control ability assessment have demonstrated some excellent effects [4]. In this paper, all of the utilized data were collected from the athletes using a movement pressure measurement machine. When an athlete

**Citation:** Xu, N.; Cui, X.; Wang, X.; Zhang, W.; Zhao, T. An Intelligent Athlete Signal Processing Methodology for Balance Control Ability Assessment with Multi-Headed Self-Attention Mechanism. *Mathematics* **2022**, *10*, 2794. https://doi.org/10.3390/ math10152794

Academic Editor: Jakub Nalepa

Received: 9 July 2022 Accepted: 3 August 2022 Published: 6 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

stands on the machine, the pressure signals are collected, which reflect the balance control ability of the athlete. In general, a smaller movement pressure indicates better balance control ability, while a larger movement pressure shows a lower level of balance control [5]. Therefore, we can analyze the data to assess the balance control ability of the athletes and even explore their underlying abilities.

In the traditional methods, statistical features are usually used to assess the balance control ability, such as the mean, root mean square, and so on. However, these methods are too simple to reflect the complex features in the collected data. In recent years, many signal processing methods have been used to extract better features, such as wavelet analysis [6] and stochastic resonance [7] techniques. In addition, it is very popular to use machine learning and statistical inference techniques for related problems, such as artificial neural network (ANN) [8], support vector machine (SVM), random forest, fuzzy inference, and other techniques [9–11]. Although the existing methods have achieved success, they are generally less capable of dealing with the collected movement pressure data, which contain a lot of noise. Furthermore, the distinction of the balance control ability among athletes at different levels is quite hard, especially for professional high-level freestyle skiing athletes, which makes it difficult to use the existing methods for assessments of balance control ability. This is also a great challenge for the traditional data-driven methods with related problems.

With the rapid development of computing technologies, deep neural networks have been the advanced methods of choice for artificial intelligence in recent years [12–16], which have achieved a lot of effective and fruitful results in many fields, such as image recognition [17–21] and natural language processing [22]. Deep neural networks can achieve high prediction accuracy via training with big data to automatically learn the mapping function between the input data and the output target. They can automatically analyze the input data without prior knowledge related to signal processing or domain expertise. Therefore, DNNs are quite suitable for the assessment of balance control ability with freestyle skiing athlete data.

For the analysis of time-series data, the recent studies [23–28] show many good applications of deep neural network models, and higher feature learning efficiency can be achieved using deep neural network models. Therefore, deep learning is being used on various types of time-series data, such as in financial analyses, traffic monitoring, industrial optimization, machinery fault diagnosis problems, and so on [29–33]. In a related study [34], a deep-learning-based LSTM method was used for COVID-19 pandemic spread forecasting, which achieved great success.

However, the simple structure of the basic deep neural network models cannot be applied well in real tasks with quite complicated data. Normally, adding a number of inside neurons and layers can enhance the learning ability. However, the consumption of computing power also increases in the meantime. On the other hand, the deep architecture generally causes losses of feature information.

In this paper, a novel multi-headed self-attention mechanism is proposed to address the assessment problem of balance control ability for freestyle skiing athletes. The main novelties and contributions of this paper are as follows:


3. A real freestyle skiing athlete under-feet movement pressure measurement dataset is adopted to validate the adopted method, which shows high assessment accuracy and promise for applications in real scenarios. which causes excessive calculation costs. In this paper, the preliminary aspects are described in Section 1. The proposed method is presented in Section 2. The experiments used to validate the proposed method

3. A real freestyle skiing athlete under-feet movement pressure measurement dataset is adopted to validate the adopted method, which shows high assessment accuracy and

However, we have to say that the proposed method becomes inefficient when processing high-dimension data because of the multi-headed self-attention structure,

*Mathematics* **2022**, *10*, 2794 3 of 16

promise for applications in real scenarios.

However, we have to say that the proposed method becomes inefficient when processing high-dimension data because of the multi-headed self-attention structure, which causes excessive calculation costs. and the results are presented in Section 3. We close the paper with our conclusions in Section 4.

In this paper, the preliminary aspects are described in Section 1. The proposed method is presented in Section 2. The experiments used to validate the proposed method and the results are presented in Section 3. We close the paper with our conclusions in Section 4. **2. Dataset and Methodology**  *2.1. Dataset* 

#### **2. Dataset and Methodology** 2.1.1. Introduction of the Dataset

#### *2.1. Dataset*

#### 2.1.1. Introduction of the Dataset In this paper, a dataset collected from the real freestyle skiing aerial athletes involved

In this paper, a dataset collected from the real freestyle skiing aerial athletes involved in the balance control ability assessment task is used to validate the proposed method. The dataset includes a number of people in different balance control levels. They are required to stand with a balance meter under their feet and try their best to keep still with their eyes closed. This is done to achieve a better balance control effect by reducing the vision disturbance and focusing on the body control. The area of the balance meter measures 65 cm × 40 cm, and the balance meter can collect the movement pressure data in the anteroposterior and mediolateral directions, which are denoted as Y and X. The scenarios for data collection are shown in Figure 1. in the balance control ability assessment task is used to validate the proposed method. The dataset includes a number of people in different balance control levels. They are required to stand with a balance meter under their feet and try their best to keep still with their eyes closed. This is done to achieve a better balance control effect by reducing the vision disturbance and focusing on the body control. The area of the balance meter measures 65 cm × 40 cm, and the balance meter can collect the movement pressure data in the anteroposterior and mediolateral directions, which are denoted as Y and X. The scenarios for data collection are shown in Figure 1.

**Figure 1.** The scenarios of the athlete movement pressure data collection experiments. **Figure 1.** The scenarios of the athlete movement pressure data collection experiments.

The levels of balance control ability are divided into four classes. Specifically, they people from different groups, including top freestyle skiing athletes, professional skill athletes, normally trained students from non-skill sports, and common people. The four classes are denoted as A, B, C, and D, respectively. The balance control ability levels decrease from A to D. For instance, the A group has the best ability, and the B group has the second best ability. We select three athletes from each level, who are represented by #1, #2, and #3 respectively. The athletes are required to keep their upper bodies stationary The levels of balance control ability are divided into four classes. Specifically, they people from different groups, including top freestyle skiing athletes, professional skill athletes, normally trained students from non-skill sports, and common people. The four classes are denoted as A, B, C, and D, respectively. The balance control ability levels decrease from A to D. For instance, the A group has the best ability, and the B group has the second best ability. We select three athletes from each level, who are represented by #1, #2, and #3 respectively. The athletes are required to keep their upper bodies stationary to and use two feet while standing with their eyes closed, then to reduce their body swing.

to and use two feet while standing with their eyes closed, then to reduce their body swing. The movement pressure data sampling frequency is 100 Hz. We show the information for The movement pressure data sampling frequency is 100 Hz. We show the information for the dataset used in this study in Table 1. C 3 C#1, C#2, C#3 100 Hz D 3 D#1, D#2, D#3 100 Hz

**Table 1.** Information for the athlete movement pressure measurement dataset used in this paper.

**Athlete Level Number of Athletes Code Names Sampling Frequency** A 3 A#1, A#2, A#3 100 Hz B 3 B#1, B#2, B#3 100 Hz


**Table 1.** Information for the athlete movement pressure measurement dataset used in this paper.

*Mathematics* **2022**, *10*, 2794 4 of 16

2.1.2. Pre-Processing of the Dataset The tasks are demonstrated in Table 2. Every sample of the tasks contains 200 continuous

In this study, the task involves predicting different athlete balance control ability levels through learning features from the collected data with the proposed method. In order to fully examine the performance of the proposed method, we implement 5 tasks with different training and testing datasets, which include different athletes in each level. The tasks are demonstrated in Table 2. Every sample of the tasks contains 200 continuous points. The proposed method and compared methods can be fairly evaluated through the tasks by using a wide range of experimental settings. points. The proposed method and compared methods can be fairly evaluated through the tasks by using a wide range of experimental settings. **Table 2.** Information for the different athlete balance control ability evaluation tasks used in this study. **Task Name Concerned Athletes Sample Number of Ratio of Training to**

**Table 2.** Information for the different athlete balance control ability evaluation tasks used in this study. **Every Athlete** T0 A#1, B#1, C#1, D#1 200 4:1

**Testing**


*2.2. Methods 2.2. Methods*

> The flow chart of our proposed method is displayed in Figure 2. The flow chart of our proposed method is displayed in Figure 2.

In this paper, we propose a novel method based on the Transformer. The Transformer is one type of auto-encoder (AE). Auto-encoder are some of the most popular

2.2.1. Proposed Method

#### 2.2.1. Proposed Method *Mathematics* **2022**, *10*, 2794 5 of 16

In this paper, we propose a novel method based on the Transformer. The Transformer is one type of auto-encoder (AE). Auto-encoder are some of the most popular neural network structures in the current research, and are wildly used in many application scenarios, such as in image classification tasks, speech recognition problems, video processing problems, and so on [35–37]. scenarios, such as in image classification tasks, speech recognition problems, video processing problems, and so on [35–37].

In general, an auto-encoder includes an encoder and a decoder, which are symmetric. The function of the encoder is to compress the input data, while the function of the decoder is to decompress the data that the encoder outputs close to the original data. In brief, the auto-encoder is used to reproduce the input data. In this way, the auto-encoder can explore the features of the input data automatically. The process can be expressed as: In general, an auto-encoder includes an encoder and a decoder, which are symmetric. The function of the encoder is to compress the input data, while the function of the decoder is to decompress the data that the encoder outputs close to the original data. In brief, the auto-encoder is used to reproduce the input data. In this way, the auto-encoder can explore the features of the input data automatically. The process can be expressed as:

$$\mathbf{h}(f(\mathbf{x})) \approx \mathbf{x} \tag{1}$$

where *x* is the input data. The function *f* represents the encoder and the function *h* represents the decoder, which are inverse processes. The encoder and decoder require different building approaches. For example, the Vanilla Auto-Encoder is made of fully connected neural networks and is the most primitive auto-encoder. Convolutional neural networks (CNNs) are also used to build the auto-encoder [38]. where *x* is the input data. The function *f* represents the encoder and the function *h* represents the decoder, which are inverse processes. The encoder and decoder require different building approaches. For example, the Vanilla Auto-Encoder is made of fully connected neural networks and is the most primitive auto-encoder. Convolutional neural networks (CNNs) are also used to build the auto-encoder [38].

As one of the latest and most powerful auto-encoders, the Transformer was originally used in natural language processing [39], and the encoder and decoder of the Transformer mainly rely on the self-attention mechanism. In addition to being used to solve natural language processing problems, the Transformer is also reformed to deal with the image classification tasks and video processing problems [40,41]. Its effectiveness has been well validated for analyzing time-series signals. As one of the latest and most powerful auto-encoders, the Transformer was originally used in natural language processing [39], and the encoder and decoder of the Transformer mainly rely on the self-attention mechanism. In addition to being used to solve natural language processing problems, the Transformer is also reformed to deal with the image classification tasks and video processing problems [40,41]. Its effectiveness has been well validated for analyzing time-series signals.

The basic Transformer consists of an input layer, multi-headed self-attention block, normalization layer, feedforward layer, and residual connected layer. Because the basic Transformer is used in natural language processing, the input layer includes word embedding and position embedding. The word embedding is used to transform the words of input sentences to a series of vectors, while the position embedding is used to describe the information about the corresponding positions of the words in the sentence. The multiheaded self-attention block is the most important part to explore for the features of the input data. The details of the structure of the basic Transformer are illustrated in Figure 3. The basic Transformer consists of an input layer, multi-headed self-attention block, normalization layer, feedforward layer, and residual connected layer. Because the basic Transformer is used in natural language processing, the input layer includes word embedding and position embedding. The word embedding is used to transform the words of input sentences to a series of vectors, while the position embedding is used to describe the information about the corresponding positions of the words in the sentence. The multiheaded self-attention block is the most important part to explore for the features of the input data. The details of the structure of the basic Transformer are illustrated in Figure 3.

**Figure 3.** The architecture of the basic Transformer. **Figure 3.** The architecture of the basic Transformer.

The basic Transformer consists of an encoder and decoder. It is mostly used in natural language processing tasks, the input data and target data for which are sentences that contain complex information. In such tasks, researchers use the encoder to analyze the input data and the decoder to analyze the target data. The underlying connection between the two results is also explored. In this paper, only the encoder part of the basic Transformer is adopted. This is because the target data for our task are class numbers The basic Transformer consists of an encoder and decoder. It is mostly used in natural language processing tasks, the input data and target data for which are sentences that contain complex information. In such tasks, researchers use the encoder to analyze the input data and the decoder to analyze the target data. The underlying connection between the two results is also explored. In this paper, only the encoder part of the basic Transformer is adopted. This is because the target data for our task are class numbers without complex

without complex information such as sentences. This means we only need to explore the

input data and predict their class. The detailed structure is illustrated in Figure 4.

information such as sentences. This means we only need to explore the input data and predict their class. The detailed structure is illustrated in Figure 4. *Mathematics* **2022**, *10*, 2794 6 of 16

**Figure 4.** The detailed architecture of our proposed Transformer. **Figure 4.** The detailed architecture of our proposed Transformer.

Before the Transformer encoder block, the dimensions of the input data should be extended with a trainable pre-training linear layer in order to explore the deep information. In the results of the experiments, we will show the significant effectiveness of such layers. In addition, similar to most existing methods for natural language processing [42], we propose a learnable embedding approach to the input data, whose Before the Transformer encoder block, the dimensions of the input data should be extended with a trainable pre-training linear layer in order to explore the deep information. In the results of the experiments, we will show the significant effectiveness of such layers. In addition, similar to most existing methods for natural language processing [42], we propose a learnable embedding approach to the input data, whose state at the output of the Transformer encoder is as the representation of the input data.

state at the output of the Transformer encoder is as the representation of the input data. After the aforementioned operations, the input data for the Transformer encoder are transformed into a series of vectors. Therefore, the word embedding layers are not After the aforementioned operations, the input data for the Transformer encoder are transformed into a series of vectors. Therefore, the word embedding layers are not required. Specifically, the position embedding layer is needed to describe the information about the time point order of the time series, such as the word position in the sentence.

required. Specifically, the position embedding layer is needed to describe the information about the time point order of the time series, such as the word position in the sentence. There are two common ways to achieve the position embedding. The first one is to randomly generate a series of vectors and update them during the training, while the second one is to encode the information with sin and cos functions. We choose the first There are two common ways to achieve the position embedding. The first one is to randomly generate a series of vectors and update them during the training, while the second one is to encode the information with sin and cos functions. We choose the first method in this paper. In both the methods, we should create a matrix whose shape is the same as the input data, and assign its parameters using one of the above methods. Afterwards, the matrix is added to the input data.

method in this paper. In both the methods, we should create a matrix whose shape is the same as the input data, and assign its parameters using one of the above methods. Afterwards, the matrix is added to the input data. The core of the Transformer is the self-attention mechanism. Its function is The core of the Transformer is the self-attention mechanism. Its function is calculating the relationships between all parts of the input data, which are always sequential data, and the relationships are expressed by a series of probabilities whose sum is one. According to the probabilities, the mechanism will distribute different weights to corresponding parts of the input data.

calculating the relationships between all parts of the input data, which are always sequential data, and the relationships are expressed by a series of probabilities whose sum is one. According to the probabilities, the mechanism will distribute different weights to corresponding parts of the input data. In this study, the self-attention mechanism is modified from the attention mechanism. In the attention mechanism, the input part consists of three matrices, *Q*, *K*, and *V*. *K* and *V* come from the input data, while *Q* generally comes from the output data. In the self-attention mechanism, all matrices are from the input data. In addition, the attention mechanism is usually used to connect the outputs of the encoder and decoder. However,

In this study, the self-attention mechanism is modified from the attention mechanism.

In the attention mechanism, the input part consists of three matrices, Q, K, and V. K and V come from the input data, while Q generally comes from the output data. In the selfattention mechanism, all matrices are from the input data. In addition, the attention

the self-attention mechanism is the core in the structures of the encoder and decoder. The

*Q*

*XW Q*

=

=

(2)

=

*K*

*XW K*

*V*

*XW V*

generator method of the three matrices can be expressed as:

the self-attention mechanism is the core in the structures of the encoder and decoder. The generator method of the three matrices can be expressed as:

$$\begin{array}{l}X \bullet W\_{\mathbb{Q}} = \mathbb{Q} \\ X \bullet W\_{\mathbb{K}} = K \\ X \bullet W\_{V} = V \end{array} \tag{2}$$

where *X* is the input data, whose length equals the number of the time steps; *WQ*, *WK*, *W<sup>V</sup>* are three matrices with the same shape but different parameters, and the parameters can be changed by the training. The operation • is the dot product. The calculation of the self-attention mechanism can be defined as:

$$\text{Self-Attention}(\text{Q}, \text{K}, V) = \text{softmax}(\frac{\text{Q} \bullet \text{K}^T}{\sqrt{d\_k}}) \bullet V \tag{3}$$

where *d<sup>k</sup>* is the dimension of the matrix *K*, which can prevent the result of the dot product from flooding. The *so f tmax* function can transform the results of the dot product into probabilities as the weights that describe the relationship between all parts of the data.

Based on the self-attention mechanism, the multi-headed self-attention block can explore the features of the input data effectively. The multi-headed self-attention approach involves using many self-attention blocks to explore the same data together, then integrating the results of every block. It should be noted that one block is called one head.

In general, there are two ways to achieve a multi-headed self-attention block. In the first one, we map the input data to *Q*, *K*, *V* without changing the shape and evenly divide them into many small matrices. Next, we calculate them with the self-attention mechanism. In another approach, we can map the input data to *Q*, *K*, *V* with the same shape, which equals the input data dimension multiplied by the number of heads that are needed, then we can calculate them with the self-attention mechanism and finally map the result to the matrix with the linear projection, whose shape is same as the input data. In this way, we can set the number of heads freely. However, more computing power will be consumed. In this paper, we select the later one.

In the *so f tmax* function, we let *x* (*i*) denote the input samples and *r* (*i*) denote the corresponding class labels for them; *i* = 1, 2, · · · , *N*, where *i* is the number of trained samples and *N* is the quantity of samples. We also have *x* (*i*) <sup>∈</sup> *<sup>R</sup> <sup>d</sup>*×<sup>1</sup> and *r* (*i*) <sup>∈</sup> {1, 2, · · · , *<sup>L</sup>*}, where *L* is the whole number of target classes in this paper. According to the input data *x* (*i*) , the function can give the probability *p*(*r* (*i*) = *j x* (*i*) ) for different class labels. The calculation is based on the below algorithm:

$$J\_{\lambda}(\mathbf{x}^{(i)}) = \begin{bmatrix} p(r^{(i)} = 1 \, \middle| \, \mathbf{x}^{(i)}; \lambda \text{ \,} \middle| \, \mathbf{x} \overset{\,}{\mathbf{}})\\p(r^{(i)} = 2 \, \middle| \, \mathbf{x}^{(i)}; \lambda \text{ \,} \middle| \, \mathbf{x} \overset{\,}{\mathbf{}})\\\vdots\\p(r^{(i)} = L \, \middle| \, \mathbf{x}^{(i)}; \lambda \text{ \,} \middle| \, \mathbf{x} \overset{\,}{\mathbf{}}) \end{bmatrix} = \frac{1}{\sum\_{l=1}^{L} e^{\lambda\_{l}^{T} \bullet \mathbf{x}^{(i)}}} \begin{bmatrix} e^{\lambda\_{1}^{T} \bullet \mathbf{x}^{(i)}}\\e^{\lambda\_{2}^{T} \bullet \mathbf{x}^{(i)}}\\\vdots\\e^{\lambda\_{L}^{T} \bullet \mathbf{x}^{(i)}} \end{bmatrix} \tag{4}$$

where *λ* = [*λ*1, *λ*2, · · · , *λL*] *T* represents the coefficients of the *so f tmax* function. The output values of the *so f tmax* function are all positive and the sum of them is one. Therefore, the result of the *so f tmax* function can be used to predict the probabilities of the target classes and to evaluate the relationship among the parts of the input data in the self-attention mechanism.

After the multi-headed self-attention block, there is a feedforward layer block, which is used to explore the output of the Transformer encoder block again. The core of the block is a MLP model that consists of two linear layers with a GELU non-linearity activation function. In the Transformer encoder, the normalization layer is applied before the multi-headed self-attention block and feedforward layer block, and there are residual connections after

every block. The MLP head is set after the Transformer encoder, which carries the task as a classifier to predict the class of the input data. The MLP head contains two linear layers. At last, we select the Adam optimizer for the proposed method.

#### 2.2.2. Compared Methods

The proposed Transformer model offers a new perspective for the assessment of the athletes' balance control performance with artificial intelligence technology. In this paper, we also implement some popular methods in the current literature for comparisons in order to prove the effectiveness and superiority of the proposed method. The following methods are included.

1. NN

As a typical neural connection method, we select the basic neural network (NN) to join the comparisons, which includes one hidden layer with 1000 neurons, a leaky ReLU activation function, and other typical operations.

2. DNN

The deep neural network (DNN) is based on the basic neural network structure. The used DNN method consists of three layers with 1000, 1000, and 500 neurons. Likewise, similar techniques are also employed, such as a leaky ReLU activation function and so on.

3. DSCNN

The deep single-scale convolution neural network (DSCNN) method is a basic and popular deep learning neural network, which is widely used as a basic cell to build many complex networks, such as LeNet-5, Alex-Net, VGG-16, and so on [43–45]. In the comparison, we use a basic network with one convolutional filter size for the feature extraction.

4. RNN

The recurrent neural network (RNN) method is a typical deep learning neural network, which works well in dealing with sequential data. Therefore, we can better demonstrate the advantage of the proposed method.

5. Random Forest

The random forest is a classical machine learning approach, which is widely used for classification tasks. It consists of a lot of decision trees, and every one of them works independently. The approach performs well with noise. Therefore, it can be used as a suitable comparison method.

#### **3. Results**

#### *3.1. Experiment Description*

We organize our experiments here as follows. Experiment 1 aims to show the necessity of the pre-training linear layer before the Transformer encoder. Experiment 2 aims to find the optimal hyper-parameters for the proposed method. Experiment 3 aims to show the superiority of the proposed method by competing with the compared methods. The parameters in the experiments are listed in Table 3. The test data are involved into the parameter selection process and the accuracy score might potentially be biased. The selected parameters are popular choices for the deep learning framework, which can be generally used in different applications.

**Table 3.** Parameter information.


#### *3.2. Experiment and Results Analysis* In experiment 1, we aim to investigate the influence of the pre-training linear layer.

3.2.1. Experiment 1

**Table 3.** Parameter information.

*3.2. Experiment and Results Analysis* 

#### 3.2.1. Experiment 1 Therefore, we set one group with a linear layer and the control group without a linear

In experiment 1, we aim to investigate the influence of the pre-training linear layer. Therefore, we set one group with a linear layer and the control group without a linear layer. Then, we set 512 neurons for the pre-training linear layer, with a 12-layered Transformer encoder with 8 heads for the multi-headed self-attention part. There are 32 dimensions for every head, and the output layer of the feedforward part contains 64 neurons. The results of the 5 tasks are displayed in Figure 5. layer. Then, we set 512 neurons for the pre-training linear layer, with a 12-layered Transformer encoder with 8 heads for the multi-headed self-attention part. There are 32 dimensions for every head, and the output layer of the feedforward part contains 64 neurons. The results of the 5 tasks are displayed in Figure 5.

*Mathematics* **2022**, *10*, 2794 9 of 16

**Parameter Value Parameter Value**  Batch size 32 Learning rate 1 × 10ିସ

Epoch number 100 Sample dimension 200 × 2

**Figure 5.** The experimental results of the proposed method with the pre-training linear layer and **Figure 5.** The experimental results of the proposed method with the pre-training linear layer and the method without the pre-training linear layer.

the method without the pre-training linear layer. In the figure, the accuracy of the method with the pre-training linear layer is significantly higher than the method without the pre-training linear layer. The pre-In the figure, the accuracy of the method with the pre-training linear layer is significantly higher than the method without the pre-training linear layer. The pre-training linear layer plays an important role in exploring features of the input data, which we will investigate in the following section.

#### training linear layer plays an important role in exploring features of the input data, which 3.2.2. Experiment 2

we will investigate in the following section. 3.2.2. Experiment 2 In experiment 2, we investigate the optimal hyper-parameters for the proposed method, which include the depth of the Transformer encoder block, the number of pretraining layer neurons, the number of multi-headed self-attention heads, the dimensions of every head, and the output dimensions of the feedforward block. We select the T2 dataset to train the proposed method.

In experiment 2, we investigate the optimal hyper-parameters for the proposed method, which include the depth of the Transformer encoder block, the number of pretraining layer neurons, the number of multi-headed self-attention heads, the dimensions Firstly, we investigate the influence of the Transformer encoder block. Therefore, we set the number of pre-training layer neurons, the number of multi-headed self-attention heads, the dimensions of every head, and the output dimensions of the feedforward block as (512, 8, 32, 64), the results of which are shown in Figure 6.

of every head, and the output dimensions of the feedforward block. We select the T2 dataset to train the proposed method. Firstly, we investigate the influence of the Transformer encoder block. Therefore, we According to the results, the multi-layer approaches are much more effective than the mono-layer approach. However, the larger Transformer encoder's depth does not usually lead to better results. The accuracy does not significantly increase when the depth of the multi-layer approach increases.

set the number of pre-training layer neurons, the number of multi-headed self-attention heads, the dimensions of every head, and the output dimensions of the feedforward block as (512, 8, 32, 64), the results of which are shown in Figure 6. Secondly, the number of pre-training layer neurons is an important factor for the effectiveness of the proposed method. We set the parameters of the proposed method as (6, 8, 32, 64). According to Figure 7a, the accuracy of the proposed method increases as the neuron number of the pre-training linear layer increases. In particular, the accuracy

becomes significantly higher when the number of neurons increases compared with the dimensions of the input data. According to Figure 7b, the training loss function of the methods, whose neuron number is smaller than the dimensions of the input data, decreases slowly or changes a little. However, the training loss function decreases rapidly as the neuron number is bigger than the dimensions of the input data. In this study, it is found that the number of neurons has the largest influence on the training of the proposed method. *Mathematics* **2022**, *10*, 2794 10 of 16 **Figure 6.** The influence of the depth of the Transformer encoder.

*Mathematics* **2022**, *10*, 2794 10 of 16

**Figure 6.** The influence of the depth of the Transformer encoder. **Figure 6.** The influence of the depth of the Transformer encoder.

decreases slowly or changes a little. However, the training loss function decreases rapidly as the neuron number is bigger than the dimensions of the input data. In this study, it is **Figure 7.** The influence of the dimensions of the pre-training linear layer. (**a**) influence on testing accuracy; (**b**) influence on training loss. **Figure 7.** The influence of the dimensions of the pre-training linear layer. (**a**) influence on testing accuracy; (**b**) influence on training loss.

found that the number of neurons has the largest influence on the training of the proposed method. Thirdly, the influence of the number of multi-headed self-attention heads is shown in Figure 8. The parameters are set as (6, 512, 32, 64). It is shown that more than 2 heads is suitable, which means the multi-headed self-attention is more effective than the basic self-Thirdly, the influence of the number of multi-headed self-attention heads is shown in Figure 8. The parameters are set as (6, 512, 32, 64). It is shown that more than 2 heads is suitable, which means the multi-headed self-attention is more effective than the basic self-attention.

attention. In addition, the dimensions of every head also play an important role in the proposed method, which are displayed in Figure 9. The parameters are set as (2, 64, 4, 64). In general, it is noted that the testing accuracy increases as the dimensions of every head increase.

accuracy; (**b**) influence on training loss.

attention.

(**a**) (**b**)

**Figure 7.** The influence of the dimensions of the pre-training linear layer. (**a**) influence on testing

Thirdly, the influence of the number of multi-headed self-attention heads is shown

in Figure 8. The parameters are set as (6, 512, 32, 64). It is shown that more than 2 heads is suitable, which means the multi-headed self-attention is more effective than the basic self-

**Figure 8.** The influence of the number of multi-headed self-attention heads. **Figure 8.** The influence of the number of multi-headed self-attention heads. it is noted that the testing accuracy increases as the dimensions of every head increase.

*Mathematics* **2022**, *10*, 2794 11 of 16

*Mathematics* **2022**, *10*, 2794 11 of 16

**Figure 9.** The influence of the dimension of the multi-headed self-attention heads. **Figure 9.** The influence of the dimension of the multi-headed self-attention heads. Finally, the output dimensions of the feedforward layer are investigated with the

**Figure 9.** The influence of the dimension of the multi-headed self-attention heads. Finally, the output dimensions of the feedforward layer are investigated with the parameters of (4, 128, 4, 64). The results are shown in Figure 10. It can be observed that Finally, the output dimensions of the feedforward layer are investigated with the parameters of (4, 128, 4, 64). The results are shown in Figure 10. It can be observed that the dimensions have a great influence on the testing accuracy. To be specific, the minimum accuracy is about 8 percentage points less than the maximum accuracy. parameters of (4, 128, 4, 64). The results are shown in Figure 10. It can be observed that the dimensions have a great influence on the testing accuracy. To be specific, the minimum accuracy is about 8 percentage points less than the maximum accuracy.

the dimensions have a great influence on the testing accuracy. To be specific, the minimum

**Figure 10.** The influence of the dimensions of the feedforward layer. **Figure 10.** The influence of the dimensions of the feedforward layer.

3.2.3. Experiment 3

3.2.3. Experiment 3

3.2.3. Experiment 3

**Figure 10.** The influence of the dimensions of the feedforward layer.

proposed method. The results are displayed in Figure 11.

proposed method. The results are displayed in Figure 11.

**Figure 10.** The influence of the dimensions of the feedforward layer.

proposed method. The results are displayed in Figure 11.

In the experiment, we compare the proposed method with the current methods that were mentioned before, in order to demonstrate the superiority and effectiveness of the

were mentioned before, in order to demonstrate the superiority and effectiveness of the

In the experiment, we compare the proposed method with the current methods that

In the experiment, we compare the proposed method with the current methods that

#### 3.2.3. Experiment 3

In the experiment, we compare the proposed method with the current methods that were mentioned before, in order to demonstrate the superiority and effectiveness of the proposed method. The results are displayed in Figure 11. *Mathematics* **2022**, *10*, 2794 12 of 16

**Figure 11.** The results of the different compared methods for different tasks. **Figure 11.** The results of the different compared methods for different tasks. better than the others, whose testing accuracy levels for all tasks are all higher than 95%.

In Figure 11, it is obvious that the learning ability of the proposed method is much better than the others, whose testing accuracy levels for all tasks are all higher than 95%. Although every method performs well in task T0, which only contains one athlete's data of every level, it should be noted that task T1 also contains only one athlete's data for every level. However, all methods' accuracy levels are reduced to different degrees, except for the proposed method. According to the results for T2 and T4, when different In Figure 11, it is obvious that the learning ability of the proposed method is much better than the others, whose testing accuracy levels for all tasks are all higher than 95%. Although every method performs well in task T0, which only contains one athlete's data of every level, it should be noted that task T1 also contains only one athlete's data for every level. However, all methods' accuracy levels are reduced to different degrees, except for the proposed method. According to the results for T2 and T4, when different athletes' data are used, every method's accuracy is reduced. The proposed method maintains the accuracy to higher than 95%, although the others all drop to lower than 90%. Although every method performs well in task T0, which only contains one athlete's data of every level, it should be noted that task T1 also contains only one athlete's data for every level. However, all methods' accuracy levels are reduced to different degrees, except for the proposed method. According to the results for T2 and T4, when different athletes' data are used, every method's accuracy is reduced. The proposed method maintains the accuracy to higher than 95%, although the others all drop to lower than 90%. In addition, in Figures 12–14, we use the T-SNE algorithm to process the features of

athletes' data are used, every method's accuracy is reduced. The proposed method maintains the accuracy to higher than 95%, although the others all drop to lower than 90%. In addition, in Figures 12–14, we use the T-SNE algorithm to process the features of the methods for dimension reductions and visualizations of the learned features [46]. In addition, in Figures 12–14, we use the T-SNE algorithm to process the features of the methods for dimension reductions and visualizations of the learned features [46]. Especially, we compare the proposed method with the DNN method. It is clear that the discrimination effect of the proposed method is better. The different clusters of the DNN method are more overlapped by comparison with the proposed method. the methods for dimension reductions and visualizations of the learned features [46]. Especially, we compare the proposed method with the DNN method. It is clear that the discrimination effect of the proposed method is better. The different clusters of the DNN method are more overlapped by comparison with the proposed method.

(**a**) (**b**) **Figure 12.** The visualization results of the learned features using different methods for task T2. The different colors represent different athletes; balance control ability levels: (**a**) the result of our proposed method; (**b**) the result of the DNN method. **Figure 12.** The visualization results of the learned features using different methods for task T2. The different colors represent different athletes; balance control ability levels: (**a**) the result of our proposed method; (**b**) the result of the DNN method.

**Figure 12.** The visualization results of the learned features using different methods for task T2. The different colors represent different athletes; balance control ability levels: (**a**) the result of our

*Mathematics* **2022**, *10*, 2794 13 of 16

**Figure 13.** The visualization results of the learned features using different methods for task T3. The different colors represent different athletes' balance control ability levels: (**a**) the result of our proposed method; (**b**) the result of the DNN method. **Figure 13.** The visualization results of the learned features using different methods for task T3. The different colors represent different athletes' balance control ability levels: (**a**) the result of our proposed method; (**b**) the result of the DNN method. **Figure 13.** The visualization results of the learned features using different methods for task T3. The different colors represent different athletes' balance control ability levels: (**a**) the result of our proposed method; (**b**) the result of the DNN method.

(**a**) (**b**)

**Figure 14.** The visualization results of the learned features using different methods for task T4. The different colors represent different athletes' balance control ability levels: (**a**) the result of our proposed method; (**b**) the result of the DNN method. **Figure 14.** The visualization results of the learned features using different methods for task T4. The different colors represent different athletes' balance control ability levels: (**a**) the result of our proposed method; (**b**) the result of the DNN method. **Figure 14.** The visualization results of the learned features using different methods for task T4. The different colors represent different athletes' balance control ability levels: (**a**) the result of our proposed method; (**b**) the result of the DNN method.

From Figure 11, we can see that the DNN is the best of the compared methods; sometimes its accuracy is close to the proposed method's, such as for task T3, but Figure 13 shows their clustering effects using scatter diagrams. It is obvious that the scatter diagram of the proposed method shows the clusters of the results with clear boundaries, but the scatter diagram of the DNN shows that the clusters are farraginous, which means From Figure 11, we can see that the DNN is the best of the compared methods; sometimes its accuracy is close to the proposed method's, such as for task T3, but Figure 13 shows their clustering effects using scatter diagrams. It is obvious that the scatter diagram of the proposed method shows the clusters of the results with clear boundaries, but the scatter diagram of the DNN shows that the clusters are farraginous, which means From Figure 11, we can see that the DNN is the best of the compared methods; sometimes its accuracy is close to the proposed method's, such as for task T3, but Figure 13 shows their clustering effects using scatter diagrams. It is obvious that the scatter diagram of the proposed method shows the clusters of the results with clear boundaries, but the scatter diagram of the DNN shows that the clusters are farraginous, which means there is a great difference between the features of the two methods, and the proposed method's effect is significantly better.

there is a great difference between the features of the two methods, and the proposed method's effect is significantly better. In addition, it shows that the proposed method maintains the excellent ability for there is a great difference between the features of the two methods, and the proposed method's effect is significantly better. In addition, it shows that the proposed method maintains the excellent ability for exploring the deep features of the data in Figures 12 and 14, when the sampling frequency In addition, it shows that the proposed method maintains the excellent ability for exploring the deep features of the data in Figures 12 and 14, when the sampling frequency becomes sparse and the numbers of athletes at every level increase. In contrast, the results of the DNN method become more chaotic.

#### exploring the deep features of the data in Figures 12 and 14, when the sampling frequency becomes sparse and the numbers of athletes at every level increase. In contrast, the results becomes sparse and the numbers of athletes at every level increase. In contrast, the results **4. Conclusions and Future Works**

of the DNN method become more chaotic. **4. Conclusions and Future Works**  In this paper, a simplified Transformer-based deep neural network model was of the DNN method become more chaotic. **4. Conclusions and Future Works**  In this paper, a simplified Transformer-based deep neural network model was In this paper, a simplified Transformer-based deep neural network model was proposed for the assessment of athlete balance control ability, which processes and analyzes the time-series pressure measurement data from the balance meter. The original data were directly used as the inputs to the model for an automatic assessment without any prior knowledge. Therefore, it is well suited for real applications in various industries.

proposed for the assessment of athlete balance control ability, which processes and analyzes the time-series pressure measurement data from the balance meter. The original data were directly used as the inputs to the model for an automatic assessment without

proposed for the assessment of athlete balance control ability, which processes and analyzes the time-series pressure measurement data from the balance meter. The original data were directly used as the inputs to the model for an automatic assessment without

The multi-headed self-attention process is the core of the proposed method, which calculates the deep connections between every point of the input time-series data and explores complex features via the calculations. In addition, the pre-training linear layer is also necessary, which is used to expand the dimensions of the raw input data to expose the deep information. The connection of two parts can enhance the model training

The multi-headed self-attention process is the core of the proposed method, which calculates the deep connections between every point of the input time-series data and explores complex features via the calculations. In addition, the pre-training linear layer is also necessary, which is used to expand the dimensions of the raw input data to expose the deep information. The connection of two parts can enhance the model training The multi-headed self-attention process is the core of the proposed method, which calculates the deep connections between every point of the input time-series data and explores complex features via the calculations. In addition, the pre-training linear layer is also necessary, which is used to expand the dimensions of the raw input data to expose the deep information. The connection of two parts can enhance the model training efficiency and quality, making it well suited for many tasks with time series data. The real freestyle skiing athletes under-feet pressure measurement dataset was used in the experiments for validation. The results showed that the proposed method has many advantages in the intelligent assessment of freestyle skiing athletes' balance control abilities. It holds promise for achieving significant success in practical implementations in real scenarios.

However, it should be pointed out that the proposed method generally requires significant computing power, especially with large amounts of data for freestyle skiing athletes. In addition, in order to more efficiently and accurately assess the balance control ability and other related abilities, we could use high-dimension data for freestyle skiing athletes in the future. Therefore, a reduction in the computational burden of the proposed method will be investigated, as well as optimization of the deep neural network architecture. Better pre-processing methods will also be proposed in the next study.

**Author Contributions:** Conceptualization, X.W.; Formal analysis, N.X.; Investigation, X.C.; Project administration, W.Z.; Resources, T.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was funded by the Key R&D Plan of China for the Winter Olympics (No. 2021YFF0306401), the Key Special Project of the National Key Research and Development Program "Technical Winter Olympics" (2018YFF0300502 and 2021YFF0306400), and the Key Research Program of Liaoning Province (2020JH2/10300112).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data is contained within the article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

