1. Introduction
Every country has a different sign language which is based on their native language. It is not easy for us to speak when we know the other person is not listening, let alone hearing-impaired. Even we with sufficient hearing abilities tend to ignore or avoid communication with those who do not hear, and for those who cannot hear it becomes even more difficult. Having the skill to talk to those who cannot hear can not only bridge the gap between the two but also help in the exchange of a lot of ideas and new thoughts which could encourage these people to contribute to the development of technology. Every mind can contribute to making unknowns into knowns and impossible possible.
1.1. Indian Sign Language
Indian Sign Language can facilitate people to create an inclusive society in which people with disabilities have equal chances for growth and development so that they can live productive, safe, and dignified lives. In India’s hard-of-hearing community, Indian Sign Language (ISL) is widely utilized. However, ISL is not utilized to teach hard-of-hearing students in deaf schools. Teacher education programs do not train teachers to use ISL in their classrooms. Sign language is not included in any of the teaching materials. The parents of hard-of-hearing children are often unaware of sign language’s value in bridging communication gaps. ISL interpreters are in high demand at institutes and other locations where hard-of-hearing and hearing individuals communicate, yet India only has about 300 licensed interpreters.
ISL aims to conquer the following points:
To train people to use Indian Sign Language (ISL) and educate and do research on the language, including bilingualism.
To encourage hard-of-hearing students in primary, intermediate, and higher education to use Indian Sign Language as a form of instruction.
To educate and train diverse groups, such as government officials, teachers, professionals, community leaders, and the public, on Indian Sign Language and how to use it.
To promote and propagate Indian Sign Language in collaboration with hard-of-hearing groups and other institutions working on disabilities.
1.2. HamNoSys V/S ISL Gestures
Unlike other sign language scripts, HamNoSys is not intended to be a practical writing tool for everyday communication. Rather, its motivation is similar to that of the International Phonetic Alphabet, which is intended to consistently transcribe the sounds of any spoken language [
1]. Because the increased number of possible parameter versions precluded the use of a well-known alphabet, newly produced glyphs must be developed in such a way that memorizing or deducing the meaning of the symbols is as simple as possible [
2].
HamNoSys consists of handshapes and gestures as shown in
Figure 1.
Indian Sign Language consists of gestures as shown in
Table 1.
HamNoSys contains pointers as shown in
Figure 2.
Indian Sign Language consists of gestures as shown in
Table 2.
Sign language [
3] can be defined as not just hand gestures to express words or sentences but also their meanings. Moreover, when it comes to defining sign language in depth, it can be explained in three broad categories, namely (as described in
Figure 3): nonmanual, one-handed, and two-handed.
When it comes to body gestures, as the name suggests, it includes body movements other than hand gestures, such as head movements, facial expressions, mouth shapes, etc.andtaken as modulation, when speaking we modulate our voice to show the variation in our speech.
Talking about one-handed, it is explained as static and dynamic movements which are further divided into manual and nonmanual.
Two-handed signs are divided into the above sections too, but movements can be further divided into Type 0 and Type 1. Explaining the terms, Type 0 includes the signs which are performed by both hands, whereas Type 1 includes the signs which too involve both hands, but one hand is dominant over the other, taking the lead.
2. Literature Survey
2.1. Sign Language in English
A two-way communication system is suggested in the paper [
4] but the authors are only able to convert 26 alphabets and three characters with an accuracy rate of 99.78% using CNN models. The authors only suggest that future work must be conducted in the field of natural language processing to convert speech into sign language.
In the paper [
5], the authors propose a system that converts sign language into English and Malayalam. The authors of the paper suggest using an Arduino Uno, which uses a pair of gloves to recognize the signs and translates the signs from ISL to the preferred language. The system is useful, as it recognizes two-hand and motion signs.
The Indian Sign Language interpreter presented in the paper [
6] uses hybrid CNN models to detect multiple sign gestures and then goes on to predict the sentence that the user is trying to gesture by using natural language processing techniques. The system is able to achieve 80–95% accuracy under various conditions.
In another study, the HSR model is used by the authors in converting ISL signs into text. The HSR model gives an advantage over RGB-based models, but this system has an accuracy ranging from 30 to 100% depending upon the illumination, hand position, finger position [
7] etc.
The authors of [
8] paper propose a system that recognizes 26 ASL signs and converts them into English text. They use principal component analysis to detect the signs in MATLAB.
The ASL to sign language synthesis tool uses VRML avatars and plays them using a BAP player. The major problem with the system is that many complex movements are not possible using the current VRML avatars. For example, touching the hand to any part of the body is not possible in the current system [
9].
In another study mentioned in [
10], one video-based sign language translation system converts signs from ISL, BSL, and ASL with an overall accuracy of 92.4%. The software utilizes CNN and RNN for the real-time recognition of dynamic signs. The system then converts the signs into text and then uses text–speech API to give an audio output to the user.
The authors of another paper first use the Microsoft Kinect 360 camera to capture the movement of the ISL signs. A unity engine is used to display the Blender 3D animation created by the authors. Although the system can successfully convert words into sign language, it is not able to convert phrases/multiple words into ISL.
2.2. Sign Language in Other Languages
The work presented by the authors of [
11] is another bidirectional sign language system. The system is able to achieve 97% accuracy when translating sign languages to text or audio. Authors use Google to API to convert speech to text and then the system produces a 3D figure using the unity engine after extracting keywords from the input [
12].
Another system proposed by the authors in the paper [
13] converts Malayalam text and gives a 3D animated avatar as the sign language output. The system uses HamNoSys notation, as it is the main structure of the signs [
14]. A unique Russian text to Russian sign language [
15] system utilizes semantic analysis algorithms to convert text to sign language. It focuses on the lexical meanings of the words. Although the system can reduce the sentence into gestures, it is observed by the authors that the sentence proposition can be improved by further making the algorithm more efficient.
3. Comparison
As shown in
Table 3, most of the existing models which convert English text to a sign language [
16], whether it is BSL, ASL, or ISL, use natural language processing. The major problem which almost all the existing models face is that the conversion of text to sign language only happens if the sign language for that particular word is present in the database.
Our model not only overcomes the problem mentioned in the previous paragraph but also takes it one step further by converting and displaying sign language for phrases/sentences, i.e., if the input from the user has any combination of words for which there is a particular sign present in the ISL database then the proposed system will display the sign for those combinations of words in one go.
4. Proposed Work
The proposed system presented in this paper is a real-time audio to Indian Sign Language conversion system which will assist hearing-impaired people to communicate easily with other hearing people. The system comprises mainly six components:
Audio-to-text conversion if the input is audio.
The tokenization of English text into words.
Parsing the English text into phrase structure trees.
The reordering of sentences based on Indian Sign Language grammar rules.
Using lemmatization along with part-of-speech tagging so that synonyms of words or the root form of a word can be used if the exact word is not present in the database.
Indian Sign Language video output.
The overall efficiency of the system is improved, as it splits the word into letters. If the video for the corresponding word is not present in the database then it shows video output letter by letter so that it does not skip any word. Another unique feature of the system is that it can recognize phrases in the sentence and show sign language videos corresponding to the phrase if present in a database instead of word by word. The database contains around 1000+ videos which are a combination of self-recorded videos by the author and open-source ISL faculty videos. Thus, it increases the horizon and literacy of the system.
5. System Architecture
Figure 4 shows the system architecture. The user has an option to enter the input either by text or by audio. The input is processed through the natural language model designed by the authors and keywords are given as the output. If the text within the keywords contains phrases or a combination of multiple words for which the sign language video is present in the database, then those videos are shown for such casesotherwise, the keywords are tokenized further into words or letters.
6. Hidden Markov Model
The hidden Markov model is one of the models that may be used as a classifier; it consists of a set of states where the transition from one state to the next is determined by a specific input. As a result, the shift from state to state continues until the output state or observation is reached. Furthermore, the likelihood of a specific transition is influenced by the likelihood of a transition from before a current condition. The probability model is made up of three basic components: well-defined experiments, a sample space (Ω) that contains all possible events, and an event that is chosen from the sample space. HMM, on the other hand, is based on conditional probability, which implies that the likelihood of a specific event X occurring relies on the probability of a previous event Y occurring [
27]. This conditional probability can be expressed as follows: P(X|Y) = P(X Y)/P(Y) 1. p(X|Y) = P(X ∩ Y)/P(Y).
Assume we have three states: verb, noun, and adjective, as indicated in
Figure 5.
Table 4 shows the transition matrix that shows the transition from one state to another. The transition probability from verb to adj is 0.01 and from adj to verb is 0.02.
HMM tagging, also known as sequence labeling, is a method of mapping the tag sequence to the input sequence. Assume there is a tagging process with the following inputs: X1, X2, X3, …, Xm. The output will be a tag sequence or state sequence with the following inputs: Y1, Y2, Y3, …, Ym. The sentence will be the input for part-of-speech tagging, and the tag for each word in the sentence will be the output. As an example, if we have a five-word sentence, the output will be five tags, each representing a different portion of the speech. If the input is a sentence in the source language, the label will be a sentence in the target language in machine translation.
There are three methods for dealing with the tagging problem: the first is the rule-based strategy, which relies on the use of predefined rules. However, rule-based systems can have several issues, including grammatical leaks, the inability to list all the rules, and the last issue, which is the variation in the rules over time, place, and a variety of other factors. The statistical-based strategy can also be used to tackle the tagging problem. Furthermore, a statistically based model is reliant on statistics as well as the availability of a trainable and already labeled corpus. Finally, the hybrid model, which includes both rule-based and statistical-based models, is the most common and practical solution to dealing with the tagging problem.
7. Methodology
Figure 6 shows the natural language model’s architecture. If the text input which is given as input to the model matches a video in the database, then the input is given as the keyword; otherwise, the input text is processed through the various NLP technologies and then the keywords are generated. Each part of the technology is discussed in more detail in the heading “Methodology”.
NLTK is the heart of the audio to Indian Sign Language conversion system, as it is the most powerful open-source NLP library which is used to assist with human language data. Text processing is performed using NLTK, which involves various steps, such as tokenization, the removal of stop words, lemmatization, parse tree generation, part-of-speech (POS) tagging, etc.
Tokenization is the process of splitting text into a list of words also known as tokens. NLTK has a module named tokenize () which is further divided into two types, i.e, word tokenize to split a sentence into a sequence of words and sentence tokenize to split a paragraph into a list of sentences.
Stop words are a list of very common but less informative words that can be ignored. For example—her, me, has, itself, he, so, too, they, them, etc. Since they are not so important in the sentence, they can be removed from the sentence. This will improve the overall performance of the system.
Parsing is the syntax analysis phase in which it checks whether the string obtained after tokenization belongs to proper grammar or not. Parsing helps to adjust the text based on the target language’s grammar structure. One of the most used parsers is the Stanford Parser.
The process of transforming the inflected forms of a word to its root-based dictionary form is known as lemmatization. This root-based dictionary form of a word is referred to as a lemma. The importance of this step lies in ISL, as it requires a root word.
To check the results of lemmatization, we analyze its results through sample sentences.
For example—“He was playing and eating at the same time”.
The results in
Table 5 show that lemmatization alone is not sufficient enough to give accurate root words, as it does not take into consideration the context of the sentence. To overcome this problem, part-of-speech tagging comes into the picture.
POS tagging refers to the process of labeling words with different constructs of English grammar, such as adverbs, adjectives, nouns, verbs, prepositions, etc. POS is a collection of a list of tuples where the first part of the tuple is the word itself and the second part is a tag that identifies whether the word is an adjective, verb, noun, etc.
To check whether part-of-speech tagging can improve the results obtained after lemmatization, we analyze the same sample sentence used above in lemmatization.
The results in
Table 6 show that after integrating part-of-speech tagging with lemmatization it gives a correct base form of a word, which in turn improves the accuracy of word to base word conversion.
Thus, the combination of part-of-speech tagging and lemmatization is used in our proposed system to enhance the accuracy of our system.
8. Performance Evaluation
The performance of our system is measured based on the net promoter score system. The net promoter score measures the willingness of customers to recommend products or services to their friends and family. A survey was taken with 30 disabled people that asked them to give a rating to the system in terms of how likely they were going to recommend our system to their friend or family who is hearing-impaired.
- ➢
Promoters: responses from 9 to 10.
- ➢
Passives: responses from 7 to 8.
- ➢
Detractors: responses from 0 to 6.
- ➢
Total number of people who participated in the survey: 30.
- ➢
Total number of promoters: 26.
- ➢
Percentage of promoters: 86.6%.
- ➢
Total number of passives: 3.
- ➢
Percentage of passives: 10%.
- ➢
Total number of detractors: 1.
- ➢
Percentage of detractors: 3.33%.
- ➢
Net promoter score = total percentage of promoters − total percentage of detractors.
- ➢
Net promoter score = 86.6 − 3.33 = 83.27~83.
- ➢
A net promoter score above 50 is considered excellent by the creators of the NPS score.
9. Results and Discussion
Case 1:
Input: New Delhi is the national capital of India
ISL sentence: New Delhi national capital of India
Videos shown: {New Delhi, n,a,t,i,o,n,a,l, c,a,p,i,t,a,l, of, India}
In Case 1 (as shown in
Figure 7 and
Figure 8), the user enters the sentence “New Delhi is the national capital of India”. The sentence/keywords after applying NLP and ISL grammar rules is “New Delhi national capital of India”. In this, “is” is removed, as it is a stop word. The following videos are shown to the user.
“New Delhi” is shown in one video, as the video for the same is present in the database. This is an example of how the system identifies multiple words/phrases in the sentence for which videos are present in the database.
The keywords “national” and “capital” are broken down into letters and a sign language video for each letter is shown to the user as the video, for neither “national” nor “capital” is present in the database.
The videos for keywords “of” and “India” are shown to the user, as the sign language videos for both are present in the database and there is no need to further break them into letters.
Case 2
Input: Change in Temperature
Videos shown: {Change in Temperature}
In Case 2 (as shown in
Figure 9), the user enters “Change in Temperature” as the input. The video for the entire input is present in the database and therefore only the video which depicts “Change in Temperature” in sign language is shown to the user.
Case 3
Input: Teacher
Videos shown: {t,e,a,c,h,e,r}
In Case 3 (as shown in
Figure 10 and
Figure 11), the user enters “Teacher” as the input. There is no video present in the database for “Teacher”; therefore, the system breaks the input into letters and shows the videos for each individual letter.
Case 4
Input: exchange rate
Videos shown: {exchange rate}
In Case 4 (as shown in
Figure 12), the user enters “exchange rate” as the input. The video for the entire input is present in the database and therefore only the video which depicts “exchange rate” in sign language is shown to the user.
Case5
Input: Kangaroo is an animal
Videos shown: {Kangaroo, a,n,i,m,a,l}
In Case 5 (as shown in
Figure 13 and
Figure 14), the user enters “Kangaroo is an animal”. The sentence/keywords after applying NLP and ISL grammar rules is “Kangaroo animal”. In this, “is” is removed, as it is a stop word. The following videos are shown to the user.
“Kangaroo” is shown in one video, as the sign language video for the entire word is present in the database.
“animal” is broken into letters and videos for individual letters which are shown, as the sign language video for “animal” is not present in the database.
Case 6
Input: Letter of authority
Videos shown: {Letter of authority}
In Case 6 (as shown in
Figure 15), the user enters “Letter of authority” as the input. The video for the entire input is present in the database and therefore only the video which depicts “Letter of authority” in sign language is shown to the user.
Case 7
Input: 2
Videos shown: {2}
In Case 7 (as shown in
Figure 16), the user enters the number “2” as the input. Since the numbers between 0 and 9 are present in the database, only one video which depicts “2” in sign language is shown to the user.
Case 8
Input: 30
Videos shown: {3, 0}
In Case 8 (as shown in
Figure 17 and
Figure 18), the user enters the number “30” as the input. There is no video present in the database for the number “30”; thus, the system breaks the input into two components, i.e., “3” and “0”, and shows the separate videos for both of them.
Case 9
Input: How are you?
Videos shown: {How, you?}
In Case 9 (as shown in
Figure 19 and
Figure 20), the user enters “How are you”. The sentence/keywords after applying NLP and ISL grammar rules is “How you”. In this, “are” is removed, as it is a stop word. Since there is no video present in the database for “How are you”, the sentence is broken into words, i.e., “how” and “you”.
Case 10
Input: Good Evening
Videos shown: {Good Evening}
In Case 10 (as shown in
Figure 21), the user enters “Good Evening” as the input. The video for the entire input is present in the database and therefore only the video which depicts “Good Evening” in sign language is shown to the user.
10. Conclusions
Through this paper, we have presented a user-friendly audio/text to Indian Sign Language translation system specially developed for the hearing- and speaking-impaired community of India. The main aim of the system is to bring a feeling of inclusion among the hearing-impaired community in society. The system does not only help the person who is suffering from a disability but would also be beneficial for the hearing people who want to understand the sign language of a hearing-impaired person so that they can communicate with them in their language. The core of the system is based on natural language processing and Indian Sign Language grammar rules. The integration of this system in areas such as hospitals, buses, railway stations, post offices, and even in video conferencing applications, etc., could soon be proved a boon for the hearing-impaired community in India.
In the future, the features of the system could be enhanced by integrating reverse functionality, i.e., an Indian Sign Language to audio/text translation system which could open the path for a two-way communication system. In addition, the database of the system could be increased to enhance the literacy and scope of the system.