**4. Results**

Here, we present a detailed analysis of the proposed sign language learning system using the 3D Avatar model. This section consists of four sections, namely: sign database (Section 4.1), speech recognition results (Section 4.2), results of translation process (Section 4.3), and generation of sign movement from ISL sentence (Section 4.4).

#### *4.1. Sign Database*

We created a sign language database that contains sentences based on 50 daily used ISL words (e.g., I, my, come, home, welcome, sorry, rain, you, baby, wind, man, woman, etc.) and other dialogues between different users. We create 150 sentences that contain 763 different words, including the most used words in ISL. For each word, the sign movements were defined in the blender toolkit. The description of the sign database is depicted in Table 4. The vocabulary items were created based on the unique words in ISL. For better understanding, we represented four animation sequences of each word. For the sake of simplicity, we present some example sequences of sign movement (Table 5) using two animated series. The table shows the sign movements along with their actual words in English. All such sign movements were defined with the help of a sign language expert from a hearing-impaired school ('Anushruti') in the Indian Institute of Technology Roorkee.

**Table 4.** Dataset description.



**Table 5.** English word to sign movements (each sign movement consists of 2 sequential motions).

#### *4.2. Speech Recognition Results*

The speech recognition is performed using the "IBM-Watson speech to text" service that converts English audio recordings or audio files into the respective text. The service takes a speech or audio file (.wav, .flac, or .mp3 format) with a different sampling frequency and converts the resulting text as output. The sampling frequency of our audio files is 16 KHz. The results of the speech recognition module for both isolated words (discrete speech) and complete sentences (continuous speech) are presented in Table 6. The *X*-axis denotes the time in seconds, and the *Y*-axis represents the amplitude of the signal. For the sake of simplicity, we take two discrete and two continuous speech signals for conversion.


**Table 6.** Mapping of the speech signal to text for different types of speech.

#### *4.3. Results of Translation Process*

This section illustrates the translation process of the proposed model. The translation process includes English to ISL sentence conversion and ISL sentence-to-sign representation. The evaluation of the proposed system was performed by dividing the generated sentences into a 80:20 ratio between the training and testing sets, respectively, and the Word Error Rate (WER) of the input word was recorded. The result of the text processing system is presented in Table 7, where the WER metric is derived from Levenshtein distance (edit distance function). Here we compare the word from the reference sentence and the output sentence. The distance calculates the number of edits/changes (insertion/deletion and modifications) required to convert the input text to the correct reference text. In Table 7, Ins, Del, and Sub refer to the number of insertion, deletion, and modification/substitution operations for converting source text to the proper target text, respectively.

**Table 7.** Text processing results based on Word Error Rate (WER).


For evaluating the performance of the translation system, some metrics have been considered: SER, BLEU, and NIST. SER computes the sign error rate during the generation of each sign from the ISL sentence. In this work, we recorded SERs of 10.50 on the test data. This error occurred due to WER happening during text entry input, which resulted into wrong sign generation by the avatar. BLEU and NIST are used for evaluating the quality of text during the translation from English (source language) to ISL (target language). The translation is done based on the multiple reference text (used from the vocabulary), and it calculates the precision score based on the unigram, bigram, ... , *n*-gram model where *n* is the number of words in the reference text.

BLEU assigns equal weights to all *n*-grams, whereas NIST gives more importance to the rare words and small weights to the frequently used words in the sentence, so the overall score of NIST is better than BLEU. The result of the rule-based translation process is presented in Table 8. From Table 8, it can be concluded that the NIST score outperforms the BLEU score.

**Table 8.** Performance

 analysis of proposed translation system. SER: Sign Error Rate; BLEU: Bilingual

Evaluation Understudy; MIST: National Institute of Standards and Technology.


#### *4.4. Generation of Sign Movement from ISL Sentence*

After converting the ISL sentence from the English sentence in module 2 (Section 3.2), we proceeded to generate the sign movement for the ISL sentence. The avatar generates animated sign movements for each meaningful word. Here, we have used the avatar using blender software. We have plotted all the movements of ISL corresponding to English sentences. Figure 6A describes the actions (action 1, action 2, and action 3) of sign language representation for the English sentence "Come to my home". Figure 6B depicts the sign language representation of the English sentence "Hello, Good morning" (action 1, action 2, and action 3), and Figure 6C describes the sign language representation of the English sentence "Bye baby" (action 1, action 2). Figure 6D represents the sign language representation (action 1, action 2) of the English sentence "Please come".

**Figure 6.** Sign language representations of English sentences: ( **A**) Come to my home, (**B**) Hello, good morning, ( **C**) Bye baby, ( **D**) Please come.

The quality of the proposed system was evaluated by adopting the Absolute Category Rating (ACR) [38] scheme. The ratings presented to the users are sorted by quality in decreasing order: Excellent, Good, Fair, Poor, and Bad. The performance is measured based on the output sign movements produced using various input speech or text. A majority rating of "Good" was recorded among the rater population of 25. A prototype video representation of the system was also made (available online: https://www.youtube. com/watch?v=jTtRi8PG0cs&ab\_channel=PradeepKumar (accessed on 22 December 2020)) on Youtube.
