1. Introduction
Deaf communities globally encounter significant challenges in accessing vital services like education, healthcare, and employment due to language barriers, rather than auditory limitations [
1]. Their primary language is often a signed language, such as American, French, German, or Greek Sign Language, each a unique and complete language, distinct from spoken languages and each other. These languages, with over two hundred identified varieties, possess the same depth and expressive power as spoken languages [
2,
3]. However, for the Deaf, any spoken language is secondary, leading to low literacy rates; for instance, in the U.S., deaf high school graduates have an average reading level of third to fourth grade [
4]. This language gap not only hinders everyday interactions with the hearing, non-signing population but also affects access to critical services. While certified sign language interpreters are the best solution for essential services, their scarcity and cost render them impractical for everyday, brief interactions. Thus, the development of effective automatic translation systems for spoken and signed languages could significantly improve communication and inclusivity for the Deaf community.
To overcome this barrier, technological solutions have been developed. Translator gloves [
5,
6,
7], mobile applications, and automatic translators are the leading technologies that have been used for unidirectional or bidirectional communication.
Our research aims to develop a bi-directional sign language translator that can translate from Spanish to Mexican Sign Language (MSL) and vice versa, bridging the gap between these two languages. The system involves two operation modes: From Mexican Sign Language to Spanish (MSL-SPA) and from Spanish to Mexican Sign Language (SPA-MSL). In the MSL-SPA mode, the system captures live video and processes it to recognize the sign and translate it to Spanish, as shown in the upper path in
Figure 1. Conversely, in the SPA-MSL mode, the user types the phrase and the system displays a sign language animation, as shown in the lower path in
Figure 1.
Our system is based on deep learning techniques, which have shown great success in various computer vision and natural language processing tasks. Specifically, we used MediaPipe for keypoint detection, which is an advanced, real-time framework that utilizes machine learning to detect and track keypoints on objects, faces, hands, or poses in images and videos. We also used recurrent networks such as RNN, BRNN, LSTM, and GRU, as well as an encoder-only transformer for the translation process, which we treated as a time-series classification.
One of the main challenges in developing a bi-directional sign language translator is the variability and complexity of sign language gestures, as well as the need to capture the nuances and context of the conversation. Another challenge is the lack of large and diverse sign language datasets, which are crucial for training accurate models. To address these challenges, we collected a new dataset consisting of gestures from MSL, which we used to train and evaluate our system.
The proposed bi-directional sign language translator has the potential to significantly improve the communication and integration of the deaf community into society by allowing them to communicate more effectively with hearing people. Moreover, it can facilitate the learning of sign language for hearing people and promote a more inclusive and diverse society.
To provide an overview of the paper, we have organized it in the following manner: In
Section 2, we summarize the relevant literature, while
Section 3 outlines the methodology we employed in our project.
Section 4 will showcase the results we obtained, and finally, in
Section 5, we present our concluding thoughts.
2. Related Work
The landscape of sign language translation and recognition research is rich and varied, marked by a series of interconnected advancements that build upon each other. This section weaves through these developments, highlighting how each contribution sets the stage for the next.
Starting with Bungeroth & Ney [
8], we see the foundations being laid with a German Sign Language (DGS) translation system. This innovative approach, integrating audio feedback and animated representation, utilizes IBM Model 1-4 and Hidden Markov Models (HMM) for training. The challenge they faced due to limited training samples echoes the necessity for a robust corpus, as further exemplified by the notation method of [
9].
Building on the concept of practical translation, San-Segundo et al. [
10] introduced a real-time method for Spanish-to-sign language translation. Their dual approach, blending rule-based and statistical methods, demonstrated adaptability and precision, particularly in contexts with limited vocabulary.
Pichardo-Lagunas et al. [
11] continued this trajectory, focusing on Mexican Sign Language (MSL). They brought a meticulous, analytical lens to Spanish text, using Freeling to classify words for accurate translation. This method, though currently limited to one-way translation, reflects the evolving complexity of sign language translation systems.
Segueing to pose detection and classification, Qiao et al. [
12] utilized the OpenPose model, demonstrating a significant leap in motion analysis without the dependency on specialized hardware. This development represents a shift towards more accessible and cost-effective solutions in the field.
Barrera-Melchor et al. [
13] then added a new dimension by applying these technologies to educational content translation into MSL. Their cloud-based system, which translates speech to text and then to MSL using a 3D avatar, exemplifies the integration of cloud computing in sign language translation.
In a similar vein, focusing on a specific application area, Sosa-Jimenez et al. [
14] developed a research prototype tailored for primary care health services in Mexican Sign Language. Their use of Microsoft Kinect sensors [
15] and HMMs highlights the trend of specialized systems addressing distinct contexts like healthcare.
Parallel to these developments, Martínez-Gutiérrez et al. [
16] and Martinez-Seis et al. [
17] focused on MSL alphabet recognition through advanced computational methods, each achieving notable accuracy in their respective areas.
Carmona et al. [
18] introduced a system for recognizing the static alphabet in Mexican Sign Language using Leap Motion and MS Kinect 1 sensors. Their unique application of 3D affine moment invariants for sign recognition demonstrated a significant improvement in accuracy, showcasing the potential of 3D modeling in sign language recognition.
Naranjo et al. [
19] attempt to expand the field, developing a graphical tool to aid in learning Costa Rican Sign Language (LESCO). Their methodology, utilizing phonological parameters and a similarity formula, provides a bridge for learners to grasp the nuances of sign languages, emphasizing the role of educational tools in sign language dissemination.
Complementing these efforts, Trujillo et al. [
20] presented a translation system from Mexican Sign Language to spoken language, employing 3D hand movement trajectories. Their approach to refining movement patterns and using advanced algorithms like KNN highlights the continuous push for higher precision and efficiency in translation systems.
In a similar spirit of refinement, Jimenez et al. [
21], and Cervantes et al. [
22] each contributed distinct methodologies for sign language recognition, whether through 3D affine invariants or sophisticated video analysis. These studies underscore the diverse technological avenues being explored to enhance sign language translation and recognition accuracy.
With the advent of the Transformer as a powerful deep learning model for translation, it has been used to improve the accuracy of sign language translation by effectively extracting joint visual-text features and capturing contextual information [
23]. One approach is to design an efficient transformer-based deep network architecture that exploits multi-level spatial and temporal contextual information, such as the proposed heterogeneous attention-based transformer (HAT) model [
24]. Another approach is to address the local temporal relations and non-local and global context modeling in sign videos, using techniques like the multi-stride position encoding scheme and the adaptive temporal interaction module [
25]. Additionally, transfer learning with pretrained language models, such as BERT, can be used to initialize sign language translation models and improve performance [
26]. Furthermore, incorporating content-aware and position-aware convolution layers, as well as injecting relative position information to the attention mechanism, can enhance sign language understanding and improve translation quality [
27].
Using avatars to translate sign language presents both challenges and benefits. One of the main challenges is the complexity of sign languages, which requires a deep understanding of their grammatical mechanisms and inflecting mechanisms [
28]. Additionally, the lack of direct participation from the deaf community and the underestimation of sign language complexity have resulted in structural issues with signing avatar technologies [
29]. However, the benefits of using avatars for sign language translation include increased accessibility for the deaf community and the potential for automation and efficiency in translating spoken or written language to sign language [
30,
31,
32]. Avatars can also be used as educational tools and have the potential to improve the naturalness and believability of sign language motion either from text to animation [
33], from animation to text [
23,
34] or both ways, as proposed in this work.
In recap, this collective body of work forms a tapestry of innovation, each research piece contributing to a greater understanding and capability in the field of sign language translation and recognition. Our research aims to add to this rich tapestry by developing a bi-directional translator between Spanish and Mexican Sign Language (MSL). Leveraging advanced techniques like MediaPipe and deep learning models, our goal is to bridge the communication gap for the deaf community. The
Section 3 that follows will detail our unique approach, situating it within this dynamic and evolving research landscape.
3. Methods
This section describes the methodology pursued to develop the bidirectional translation system. The development stages were the following:
Hardware selection
Feature Selection
Data collection
Model definition
Graphical user interface
3.1. Hardware Selection
To select the computing board, we compared the Raspberry Pi 4 Model B [
35], the Up Square [
36], and the Nvidia Jetson Nano [
37], running a benchmark to evaluate the inference speed of each of them.
For this, we run MediaPipe’s Holistic model on each card. This model includes detecting points on the body, hands, and face, making it computationally expensive. The RaspberryPi 4 ran at four frames per second, the UpSquared ran at six frames per second and the Jetson Nano, being the most powerful due to its GPU, ran at 13 frames per second.
3.2. Feature Selection
The inference and translation of the model depend on the input of keypoints they receive, so it is necessary to define those features, or in this case, keypoints, that are statistically significant and contribute to the model’s inference process, always seeking the balance between the number of features to process and computational cost.
To optimize the Jetson Nano’s resources for keypoint coordinate detection, we reduced the number of features. This reduction freed up computational capacity for other tasks in our translation system. We conducted performance tests using MediaPipe’s pose detection, hand detection, and holistic pipelines. The holistic pipeline was the most resource-intensive, leading us to combine pose and hand detection pipelines for greater efficiency. This combination created a lighter version than the holistic model by eliminating the dense facial keypoint mesh computation.
Figure 2, shows the full face mesh containing 468 keypoints and the eleven keypoints we end using shown in blue. We chose this approach because the facial mesh keypoints added little value to our model’s inference, particularly since the signs we needed to identify mainly involved arm movements and finger positions.
For the body, we reduced the body keypoints to five: four from the original BlazePose model and one midpoint between the shoulder keypoints for chest detection. This selection was due to the movements in the signs being above the waist, making leg keypoints irrelevant for our model. For the hands, we kept all 21 keypoints because hand and finger positions are crucial for distinguishing between signs.
Figure 3 displays the final topology of our translation system, comprising 58 keypoints. We calculated the X, Y, and Z coordinates for each, resulting in 174 features for processing.
By reducing the keypoints, we optimized the model’s input layer, thereby decreasing its computational demands. This optimization made both the training and inference processes more efficient and reduced the data volume needed for training, validation, and testing splits of the model.
3.3. Data Collection
To make the system manageable, we chose a subset of ten signs, precisely phrases applicable in a school setting. The selected phrases are: “Hello”, “Are there any questions?”, “Help me”, “Good morning”, “Good afternoon”, “Good night”, “What is the homework?”, “Is this correct?”, “The class is over”, and “Can you repeat it?”. Approximately 1000 samples of each sign were collected from six individuals, comprising an equal gender split of three women and three men, with their ages ranging from 22 to 55.
For sample collection of these phrases, we developed a Python script that uses MediaPipe’s keypoint detector to gather samples of each sign. We collected each sign from a distance of 2m containing 15 frames with detections. We found that in practice, all signs fit within this time period.
For each keypoint, we calculated the X, Y, and Z coordinates.
To compute the Z coordinate, we used the depth provided by the OAK-D camera [
38]. The depth camera is composed of a stereo pair of OMNIVISION’s OV9282 1MP grayscale image sensor [
39]. The depth accuracy varies depending on the distance from the object being measured being more accurate at closer ranges. From 0.7 m to 4 m, the camera maintains an absolute depth error below 1.5 cm [
40], which is sufficient for our application.
We used perspective projection to determine the distance relative to the camera for each keypoint of interest, as shown in Equation (
1).
To increase the variability of the samples, signs were collected from six different individuals, aiming to reduce sample bias. For each of the ten signs, we gathered approximately 900 samples on average, resulting in a total of around 9300 samples.
3.4. Model Definition
Given the specific challenges of our project, we chose to implement a Recurrent Neural Network (RNN) model within our translation system. RNNs are particularly effective for tasks like natural language processing, video analysis, and machine translation, mainly because of their ability to maintain a form of memory. This memory helps in understanding sequences, as it can track changes over time.
To find the most suitable RNN model for classifying signs in Mexican Sign Language (MSL), we evaluated various RNN architectures. The models we considered included the following:
Each of these models was trained and evaluated for its effectiveness in classifying MSL signs. The designed architectures for the RNN and BRNN used in our tests are depicted in
Figure 4.
The LSTM and Bidirectional LSTM architectures are shown in
Figure 5. The GRU architecture is shown in
Figure 6.
3.5. Graphical User Interface Design
To improve user interaction with our translation system, we developed a graphical user interface (GUI). This GUI is aimed at enhancing the usability and accessibility of the system. A User Interface (UI) is essentially the point of interaction between the user and the system, enabling the user to input commands and data and to access the system’s content. UIs are integral to a wide variety of systems, including computers, mobile devices, and games.
Beyond the UI, we also focused on User Experience (UX). UX is about the overall experience of the user, encompassing their emotions, thoughts, reactions, and behavior during both direct and indirect engagement with the system, product, or service. This aspect of design is critical because it shapes how users perceive and interact with the system.
The outcomes of our efforts to develop a compelling UI and UX for the translation system are detailed in
Section 4.2.
5. Conclusions
Our bidirectional Mexican Sign Language (MSL) translation system aims to bridge the communication gap between the deaf community and the hearing world. Utilizing machine learning, including recurrent neural networks, transformers, and keypoint detection, our system shows promise in enabling seamless communication, with the promise of integrating individuals with hearing disabilities into society and education more effectively. This innovation emphasizes the importance of inclusive technology and the role of artificial intelligence in surmounting language barriers. The project’s key successes lie in its bidirectional translation system, characterized by efficiency and accuracy. The use of MediaPipe for keypoint detection, along with RNN, BRNN, LSTM, GRU, and Transformer architectures, facilitates accurate translation of signs into text and vice versa. The system’s real-time functionality, adaptability to various sign language variations, and user-friendly interface make it a practical tool for everyday use.
However, the study faced challenges related to the variability and complexity of sign language gestures, and the scarcity of diverse sign language datasets, impacting the training accuracy of the models. The system’s performance under different real-world conditions like varied lighting and backgrounds also presents ongoing challenges. These issues highlight the necessity for further research, particularly in dataset development and enhancing the system’s adaptability. Future directions include expanding the system to more languages and sign language variants, refining algorithms for complex signs and non-manual signals, and collaborating with the deaf community for feedback and improvements. This research opens pathways for more inclusive communication technologies, aiming to significantly reduce communication barriers for the deaf and hard-of-hearing, leading to a more inclusive society.