1. Introduction
Deaf people communicate through sign language, which consists of a series of gestural signs articulated with the hands and accompanied by facial expressions, intentional gaze and body movement, endowed with linguistic function [
1]. According to data from the World Health Organization (WHO), 1.5 billion people live with some degree of hearing loss. According to the National Institute of Statistics and Geography (INEGI), in Mexico, 1.3% of the population aged three years or older cannot hear. Mexican Sign Language (MSL) is officially recognized as a national language and is part of the linguistic heritage of the Mexican nation [
2,
3]. On the other hand, this form of communication has not yet been disseminated throughout the entire population of Mexico, since there are less than 500,000 people who communicate with this language [
4]. Therefore, it is important to develop a tool for deaf people to communicate, otherwise this limits their development, access to information, social inclusion and participation in everyday life [
5].
In recent years, several works have been conducted on hand gesture recognition based on visual information [
6,
7,
8] in order to develop human–computer applications. In these works [
9], machine learning algorithms are used to recognize static and moving signs in various applications of human–computer interactions, such as controlling a robot. However, there are fewer works in which an animation or avatar is developed to communicate with deaf people. The authors of [
10] implemented an avatar that performs signs; motion capture (MoCap) was employed to capture body, limb and head movements in 3D space, consequently, these movements had to be corrected during the post-production process and additional animations had to be made by moving each finger bone to the required sign position.
In [
11], facial animations are created from two images with a numerical traced algorithm. Their methodology uses the homotopy curve path to generate intermediate frames for different
values. The intermediate frames are the deformations from the initial image to the final image. A hyperspherical tracking method establishes deformations with visually consistent and smooth changes. In the experiments, the radius of the hypersphere is constant. This method showed good results in the examples presented.
In this research, we are interested in creating animations between pairs of sign language letters using the method proposed in [
11]. The original contribution of this research is to use a genetic algorithm [
12] to optimize the radius and its increment to plot the homotopy curve of the numerical traced algorithm to calculate the animation between pairs of letters of the sign language alphabet. In this way, from a base of images of letters of the alphabet (
https://acortar.link/1KWigu (accessed on 8 February 2024)), animations can be generated to spell words in sign language. One of the advantages of generating animations with this algorithm is that once an animation is generated between two pairs of letters, such as (a,b), this same animation can be used for pairs of gestures (b,a) that are to be executed in the reverse order. The files containing the animations between pairs of letters weigh on average 18.3 KB, and they are executed in Matlab (R2024a).
The manuscript is organized as follows:
Section 2 describes the hand gesture animation system. In
Section 3, the homotopy-based animation method is introduced. In
Section 4, optimization with a genetic algorithm is explained. After that, in
Section 5, the experimental design and the obtained results are presented. In
Section 6, a brief discussion is presented. Finally,
Section 7 summarizes the findings of this research and sets up future work.
3. Homotopy-Based Animation Method
Homotopy continuation methods [
15] are based on the insertion of a homotopy parameter
into non-linear algebraic equations in order to obtain a continuous deformation from a trivial state to a non-linear state. In (
1)
n is the number of variables and
is the set of variables from the system of equations
Transitions between the starting and ending hand gesture are calculated by applying the homotopy-based animation method (HAM) explained in [
11]. Following the same notation, the initial hand gesture is named
and the final hand gesture is named
. At the stage of hand joint position detection (
Figure 1), each gesture is stored in a matrix of 21 rows and 3 columns, since 21 joints with 3 components are detected in each hand (see the first column of
Table 1). In (
2), the initial
and ending
gesture hands are introduced to the system of equations:
where
represents the homotopy parameter and
and
are
and
system equations respectively. Each joint corresponds to variables
for the initial hand gesture (second column in
Table 1) and to
for the end hand gesture (third column in
Table 1).
According to
and
, the starting and ending hand gesture can be established as follows:
The systems of the equations shown in (
3) and (
4) are substituted into (
2) to obtain a global system of equations that contains a combination of the systems from the starting and ending hand gesture in order to create the animation. To achieve deformations or transition from the initial gesture hand
when
= 0, to the end gesture hand
, where
, it is necessary to track the homotopic curve using a numerical traced algorithm. For this purpose, the hypersphere equation [
16] is introduced.
where
is the value of
in each transition. To start the tracing of the curve [
17], the value of
is 0;
are the dimensions of the hypersphere;
are the coordinates of the center of the hypersphere; and
r is the radius of the hypersphere. Therefore, the system of equations to be solved to calculate the transitions from a starting hand gesture to a final hand gesture contains (2) and (5).
The numerical traced algorithm calculates the transitions between the hand gestures G1 and G2 as follows:
- 1.
Matrix A and C are created with random values; for this research A and C are equal to simplify the calculations.
- 2.
Matrix B is calculated using the values of the initial hand gesture joints and matrix A. Matrix D is calculated using the values of the joints of the end hand gesture and matrix A. B and D are kept constant during the execution of the algorithm.
- 3.
Since G1 is
Ax = B and G2 is
Ay = D, and thus G1 is
Ax − B and G2 is
Ay − D, these equations are substituted into (
2) to obtain (
6).
- 4.
In (
6),
x and
y correspond to the joint positions in G1 and G2, respectively, and both sets of variables correspond to the same joint positions in the intermediate gestures of the hand. Therefore,
x and
y correspond to the same joint, and thus the variable
y is changed to
x. Then, to calculate the intermediate transitions, (
6) is changed as follows:
- 5.
Thus, solving the system of Equations (
8), x contains the transitions needed to obtain animations between pairs of hand gestures.
- 6.
The centers
of the hypersphere are substituted by the values of G1 and
is equal to the initial value of
, which has a value of 0. A value is assigned to
r. The system of Equations (
8) is solved iteratively with the Newton–Raphson method [
18].
Figure 3 shows the hand gestures corresponding to the letters a, b, c and d. The transitions between pairs of hand gestures (a,b), (b,c), (d,c) and (b,d) were calculated. For each pair of gestures the numerical traced algorithm was run 7 times, and the value of the radius
r was set as shown in
Table 2.
Figure 4 shows a graph for each pair of the following hand gestures: (a,b), (b,c), (d,c) and (b,d).
Each graph shows the 7 runs of the numerical traced algorithm, with each line corresponding to an animation. The
x-axis shows the iterations executed to solve the system of Equations (
8) and the y-axis shows the calculated values of
corresponding to each iteration. In each graph it is observed that the lines start at a
equal to 0, which corresponds to G1. As the iterations are executed to solve the system of equations, the value of
must increase, and then, according to (
2), when
is equal to 1 the algorithm calculated the transition from G1 to G2 successfully.
Circles in the transition’s line indicate the transition gestures that were calculated for each value of
r. In each graph, the line that reached a value of
close to 1 is highlighted in black. For
in the interval [0,1], it is observed from the transition line in black that for (a,b) it has 5 circles, while it has 7 circles for (b,c), 2 circles for (d,c) and 5 circles for (b,d). No transitions were created for (d,c), since the solution to the system of equations is the initial gesture G1 when
is equal to 0 and the final gesture G2 when
is equal to 1 and, in the next iteration, it is observed that the value of
decreases.
Figure 5 shows the animations created for each letter pair (a,b), (b,c), (d,c) and (b,d).
According to [
19] if the radius of the hypersphere varies, more transitions can be calculated. To prove the above, the letters (d,c) were chosen, because only two transitions were obtained with a fixed radius value in each run. In each run, the value of the radius was increased when it was observed that there was no change in the value of
between the current and previous iteration. Several tests were performed, and one of the best results is shown in
Figure 6. Five transition lines are shown with initial radius values of 0.05, 0.1, 0.15, 0.20 and 0.25, respectively; in each run the radius increment was set to a value of 0.05, and all runs reached the value of
equal to 1. The line highlighted in black corresponds to an initial radius of 0.1 and has 20 circles, which means that the animation has 20 transitions.
Figure 7 shows the 20 transitions calculated for the letters (d,c) using an initial radius of 0.1 that increased by a value of 0.05.
In order to automate the generation of transitions in this research, a genetic algorithm (GA) was used to optimize the radius parameters and their increment in the numerical traced algorithm to obtain transitions between pairs of letters of the sign alphabet.
5. Experiments and Results
The experiments designed to evaluate the animations created with the numerical traced algorithm optimized with a GA are as follows:
Three videos were recorded in which a person spells the following pairs of letters: (h,o), (o,l) and (l,a).
With the Mediapipe library and python, the positions of the joints were obtained and recorded in a .txt file. The file structure is as follows: the first column corresponds to the x-coordinate, the second to the y-coordinate and the third to the z-coordinate of the joints; the first 21 rows correspond to the 21 joints in the first frame, the next 21 rows to the second frame and so on. The videos, text files and matlab program that show the animation have been uploaded to the following link
https://acortar.link/YI1ajV (accessed on 8 February 2024).
The positions of the joints in the first and last frame of each .txt file were used to create the animation with the numerical traced algorithm optimized with a GA. The GA was run 30 times (10 for each pair of letters).
Table 4 shows the statistical results.
Figure 9 shows the execution of the three best individuals, one for calculating the animation of (h,o), another for (o,l) and the last one for (l,a).
The similarity between the animations created with the numerical traced algorithm and the recorded sequence was measured using Dynamic Time Warping (DTW) [
23]. DTW [
24,
25] is useful because lets us to compare time series with different numbers of frames.
Table 5 shows the similarity value between the recordings made and the animations created. The diagonal corresponds to the similarity when comparing the recording of the pair of letters to the corresponding animation. For each row, the cells on the diagonal have the smallest value, so the recordings and their corresponding animations are more similar when compared to themselves and not others.
Table 6 shows the similarity value between the recordings and
Table 7 shows the similarity value between the animations.
When comparing the real sequences and the animations there is a greater difference between the sequences (h,o) and (l,a), in second place are the sequences (o,l) and (l,a) and in third place are the sequences (h,o) and (o,l), so the similarity measures between the real recordings and the simulations maintain the same order of similarity.
For the last experiment, for each of the images that have been uploaded to the following link
https://acortar.link/1KWigu, accessed on 8 February 2024, the positions (x,y,x) of 21 joints were obtained and stored in a .txt file. Subsequently, 156 animations, calculated using the numerical algorithm optimized by a GA from pairs of gestures taken from the 20 .txt files containing the positions (x,y,z) of the 21 hand translations, were loaded into the animation folder. When creating the animations we realized that having the animation of, for example, (b,c) means that it is not necessary to create a file for the animation (c,b); we can just run the animation (b,c) in reverse order. In this way, it is not necessary to record animations between all pairs of letters. A file that uses 30 frames to create an animation weighs on average 18.3 KB.
For the 156 animations created, it was measured whether the last frame corresponds to the letter indicated in the sequence. For example, in the sequence “lt” we want to know whether the last frame corresponds to the letter t. For the 156 sequences that were created, we compared, using Euclidean distances, the position of the joints of the last frame in the sequence with the positions of the joints of the 20 gestures a, b, c, d, e, f, g, h, i, l, m, n, o, p, r, s, t, u, v and w. The gesture is identified as the one that is closest in distance.
Table 8 shows the result of this classification. The first column shows the final letter of an animation, and the second column shows the number of sequences that end with the letter indicated in the first column and are correctly classified, while the third column shows that only two sequences that end with the letter v were confused with the letter u. The accuracy of creating animations between two pairs of letters in which the final letter is the desired letter is 98.8%.
Finally, in the animation folder, two files—phrase1.m and phrase2.m—were uploaded to show the animation of the phrases “We eat some bananas” and “The table is big”. In these files it was observed that to spell the word banana, the animation of “an” was used to spell “an” and “na”; in the program, the order of the execution of this sequence is indicated.
6. Discussion
Using the numerical traced algorithm proposed in [
11], gesture transitions between pairs of letters of the sign alphabet were calculated. The transitions have an associated
value in the interval of [0,1]. The calculated transitions are smooth changes from an initial gesture G1 to a final gesture G2. Calculating the final gesture and the number of transitions depends on assigning values to the radius of a hypersphere and increasing that radius.
Figure 5c shows that, with a fixed radius value in all iterations, two transitions were calculated for the letter pair (d,c) that had different values of
in [0,1]; subsequently, by increasing the radius, 20 transitions were created (
Figure 7).
In this research, we proposed the use of a GA to set the value of the radius and its increment to calculate the animation between pairs of letters. For each individual, the numerical algorithm was executed 30 times and the best individual had a fitness of equal to 1 at the end of the iterations. In this manner, the best individual must contain at least 30 different gestures. To decrease the number of gestures in the animation, fewer iterations must be performed, and to increase the number of gestures, the number of iterations must be increased. DTW was used to measure the similarity between the animations created between three pairs of letters and sequences recorded by a person; the distance between the recordings and the animations corresponding to the same pairs of letters is smaller, so the animations created can be used as patterns for dynamic sign recognition.
7. Conclusions
In this research, the parameters of the radius and its increment were optimized to obtain animations between pairs of letters of the sign alphabet using the numerical traced algorithm presented in [
11]. We performed experiments with the proposed method and observed the following: a value is assigned to the radius of the hypersphere, and the number of intermediate images calculated depends on this radius value. There is no guarantee that a final image will be calculated [
19]; better results are obtained if the value of this radius is increased to plot homotopy curves while calculating the deformations between the initial and the final image. The number of transitions can be changed by changing the number of iterations that the numerical algorithm executes for each individual.
The animations created with the proposed optimization were compared with the real recordings and were most similar to their corresponding recordings. With the proposal made in this research, it is concluded that animations can be generated between pairs of sign language letters to implement applications that communicate with and provide assistance to deaf people.
From 20 gestures of the sign alphabet, pairs of animations were created to form sentences. The advantage is that each animation and three pairs of letters weighs 18.3 KB and that the same animation, such as (a,b), can be used to execute its inverse animation, (b,a); only the direction in which the animation is executed is changed. This can be seen in the files loaded in the link
https://acortar.link/1KWigu, accessed on 8 February 2024, which makes it easy to create animations of words and phrases.
Future work based on this research should focus on implementing these animations in avatars that are controlled by joint movement. Additionally, in [
26], actions were recognized from a set of key poses and using DTW. The animations calculated by the method proposed in this research can be used as key poses to recognize dynamic sign language.