Multimodal Human–Robot Interaction Using Gestures and Speech: A Case Study for Printed Circuit Board Manufacturing

Salinas-Martínez, Ángel-Gabriel; Cunillé-Rodríguez, Joaquín; Aquino-López, Elías; García-Moreno, Angel-Iván

doi:10.3390/jmmp8060274

Open AccessArticle

Multimodal Human–Robot Interaction Using Gestures and Speech: A Case Study for Printed Circuit Board Manufacturing

by

Ángel-Gabriel Salinas-Martínez

¹,

Joaquín Cunillé-Rodríguez

¹,

Elías Aquino-López

¹

and

Angel-Iván García-Moreno

^1,2,*

¹

Center for Engineering and Industrial Development (CIDESI), Av. Playa Pie de la Cuesta No. 702, Querétaro 76125, Mexico

²

National Council for Humanities, Science, and Technology of Mexico (CONAHCYT), Av. Insurgentes Sur. 1582, Mexico City 03940, Mexico

^*

Author to whom correspondence should be addressed.

J. Manuf. Mater. Process. 2024, 8(6), 274; https://doi.org/10.3390/jmmp8060274

Submission received: 31 October 2024 / Revised: 24 November 2024 / Accepted: 27 November 2024 / Published: 30 November 2024

(This article belongs to the Special Issue Smart Manufacturing in the Era of Industry 4.0)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, technologies for human–robot interaction (HRI) have undergone substantial advancements, facilitating more intuitive, secure, and efficient collaborations between humans and machines. This paper presents a decentralized HRI platform, specifically designed for printed circuit board manufacturing. The proposal incorporates many input devices, including gesture recognition via Leap Motion and Tap Strap, and speech recognition. The gesture recognition system achieved an average accuracy of 95.42% and 97.58% for each device, respectively. The speech control system, called Cellya, exhibited a markedly reduced Word Error Rate of 22.22% and a Character Error Rate of 11.90%. Furthermore, a scalable user management framework, the decentralized multimodal control server, employs biometric security to facilitate the efficient handling of multiple users, regulating permissions and control privileges. The platform’s flexibility and real-time responsiveness are achieved through advanced sensor integration and signal processing techniques, which facilitate intelligent decision-making and enable accurate manipulation of manufacturing cells. The results demonstrate the system’s potential to improve operational efficiency and adaptability in smart manufacturing environments.

Keywords:

human–robot interaction; gesture recognition; speech recognition; PCB manufacturing

1. Introduction

The printed circuit board (PCB) manufacturing industry plays a vital role in the production of modern electronic devices, from consumer equipment to military and aerospace technology. In this context, the evolution towards more decentralized and flexible manufacturing systems is critical to address challenges such as design complexity, quality standards, and efficiency in production processes [1]. Today, traditional strategies are being complemented by smart technologies such as the Industrial Internet of Things (IIoT), cyber–physical systems (CPSs), and artificial intelligence (AI) tools. These enable real-time monitoring, control, and manipulation in increasingly decentralized and dynamic production lines, providing a competitive advantage in the market.

As the complexity and miniaturization of PCB designs continues to grow, collaborative robotics (cobots) is emerging as a crucial solution to address accuracy, efficiency, and adaptability challenges in assembly processes. Furthermore, the use of cobots enables mass customization and the transition to smart factories, thereby optimizing workflows. Technological advancements driven by the Fourth Industrial Revolution (4IR) have introduced innovative solutions for integrating industrial ecosystem components with IIoT services. These developments emphasize interoperability between elements like robots, sensors, and control systems to optimize processes in advanced manufacturing cells.

The optimization of cell manufacturing interoperability represents a current research topic [2]. This is due, among other factors, to the numerous hardware and software architectures and communication protocols. The objective of such optimization is to enhance the efficiency of the processes involved in the transformation of raw materials, thus reducing waste and energy consumption.

As robots become more sophisticated, their autonomy could be extended to learn or retain human-observed behaviors as skills, develop these skills through practice, and then use them in novel task environments (autonomous behaviors) [3,4]. As a result, robots and their interactions with humans will become more personalized, interactive, and engaging than ever before, providing assistance in many areas of life, such as manufacturing [5]. It is expected that robots will be able to perform tasks autonomously in a variety of environments and communicate safely with humans. This can be achieved through various approaches, such as digital twins or the development of sensor-based interaction and communication systems, known as human–robot interactions (HRIs) [6]. HRI approaches have become a fundamental element of advanced automation, enabling effective human–robot collaboration in industrial settings. HRI ecosystems have integrated several key technologies, including AI, speech recognition, gestures, and visual perception, thereby enhancing communication and efficiency in manufacturing cells.

The advent of novel sensors and edge computing has accelerated the development of HRI systems capable of interpreting human commands through an array of channels or modes (including speech and gesture recognition), leveraging the power of deep learning. This improves operational efficiency and safety in industrial processes. In this context, where precision-requiring tasks are performed, automatic speech recognition (ASR) becomes a crucial complement, serving to overcome the dependence on other types of sensors [7].

The prevailing trend is toward the creation of more intelligent industrial environments, which allow greater customization and enhanced operational efficiency. In contrast, this trend also presents novel challenges, including those related to security, management, and hierarchical control over PCB manufacturing cells. By granting various manipulation alternatives, there is a latent risk regarding the permissions and privileges that users have over their various components [1]: what users can or cannot do? HRI management systems must be developed in a way that fosters an efficient collaborative dynamic while simultaneously ensuring the maintenance of security and productivity.

In modern manufacturing cells, CPS control approaches facilitate uninterrupted interaction between machines and humans. In addition to performing automation tasks, robots are also capable of adapting to changing environmental conditions [8]. The security and privacy risks associated with IoT devices, including denial of service (DoS) attacks and vulnerabilities in communication protocols, are current topics. Various mitigation strategies, such as the use of accurate authentication and encryption protocols, are proposed in [9].

This paper describes the development of an HRI platform that provides the ability to simultaneously control multiple components of a PCB manufacturing cell (developing at the National Laboratory for Research in Digital Technologies). The HRI platform consists of various devices for controlling components using gesture and speech recognition. The platform allows the integration of new control devices into the system, providing the capability for multiple users to control multiple components simultaneously. The proposed HRI platform is designed to be scalable, thereby facilitating the registration of new gestures and input devices (such as gestures or speech commands) and their subsequent linkage to the components of a manufacturing cell. Furthermore, it offers the capability to simultaneously control all the components of the cell. The manuscript is organized as follows: Section 2 presents a review of the current literature. Section 3 outlines the materials and methods used, describes the technical specifications of all devices and software employed, and delineates the methodology implemented. Thereafter, the results obtained in gesture control, speech control, and decentralized multimodal control are presented and discussed. Finally, the conclusions are presented.

2. State of the Art

As stated previously, one of the most prevalent techniques for gesture recognition employs the use of vision-based devices, such as the Leap Motion Controller (LMC) [10]. This sensor, which is capable of the highly accurate tracking of hand movements, has been the subject of extensive evaluation and implementation in recent years. The LMC device is equipped with the capability to track hands beyond the maximum range of 60 cm (field of view), with some cases exhibiting tracking capabilities up to 100 cm [11]. In [12] is demonstrated the control of a Delta robot using the LMC for gesture recognition through robot kinematics. A comparable approach was employed to develop a neural network (NN)-based control system for the DLR-HIT II robotic hand utilizing the LMC [13]. This resulted in enhanced stability and accuracy in teleoperation through an inverse kinematics methodology.

A combination of Generative Adversarial Network (GAN) and Convolutional Network (CNN) techniques for hand gesture classification is proposed in [14]. By generating synthetic data, they achieved significant improvement in classification compared to other deep learning models. In contrast, a system composed of a mmWave radar and a TRANS-CNN model for gesture recognition, achieving an accuracy of 98.5%, is presented in [15]. This analysis demonstrates that the choice of AI model has a significant impact on the accuracy and robustness of a gesture recognition system with the LMC.

Another device that has recently been the subject of study with regard to its potential for use in gesture-based HRI applications is the Tap Strap 2 (TS2). The device comprises a series of inertial measurement units (IMUs), accelerometers, and haptic feedback mechanisms [16]. The usability of this device is investigated in [17], which proposes an RNN-based machine learning model to reduce the error rate in virtual keyboard applications on various surfaces. Furthermore, ref. [17] investigated methods for enhancing the precision of device input readings through the implementation of three long short-term memory (LSTM) recurrent neural network (RNN) models. The models that were constructed were a standard LSTM model, a model with both CNN and LSTM layers, and a convolutional LSTM (ConvLSTM). The LSTM was found to have the highest accuracy, with an average of 97.470%. The two previous studies demonstrate the effectiveness of RNNs, specifically the LSTM variant, in processing accelerometer data. Similarly, a trajectory clustering machine learning model that recognizes gestures resembling the rotation of a dial, using trajectory data from the TS2 optical mouse sensor, is proposed in [18].

Other devices have been employed in the context of HRI systems, such as Microsoft Kinect. The Kinect device has been employed to control the joints of a robotic head via hand gestures, with a success rate exceeding 90% [19]. Alternatively, electromyographic sensors have been employed in conjunction with algorithms such as K-Nearest Neighbors (K-NN) and Dynamic Time Warping (DTW) [20].

Recent advancements in AI, particularly in automatic speech recognition (ASR) and Large Language Models (LLMs), are increasingly being adopted in the manufacturing sector. ASR technologies facilitate hands-free speech control and real-time data capture, improving operational efficiency and worker safety in industrial environments. Meanwhile, LLMs enable advanced processing of unstructured textual data, supporting applications such as predictive maintenance through log analysis. The integration of these technologies is accelerating the transition toward smarter, more adaptive HRI frameworks aligned with Industry 4.0 principles [21,22].

These approaches are also capable of inferring latent or absent components of given commands based on semantic associations learned from noisy text corpora [23,24]. In other words, these systems acquire new knowledge in order to complete incomplete instructions. Accordingly, interfaces for speech command recognition can be constructed, and they can be converted to code for controlling robot movements [25].

Some study cases for pick-and-place operations are proposed in [26,27] using vision–language instructions. The results demonstrate a high degree of accuracy, greater than 99%, allowing the robots to execute tasks accurately and reliably. By incorporating visual information such as lip reading and gesture recognition, audiovisual speech recognition (AVSR) systems can improve the accuracy of human speech interpretation in noisy environments [28,29].

In the context of more intricate control scenarios, which on exceptional occasions demand the use of multiple devices, Qi et al. [30] developed a gesture recognition system based on multiple LMCs, utilizing RNN-LSTM for the teleoperation of surgical robots. The technique demonstrated a high level of recognition accuracy and inference speed. Similarly, Liu et al. [31] combined radar and vision data, enhancing the system’s adaptability and robustness through a deformable fusion network with LSTM-RNN. This device fusion approach is particularly advantageous in scenarios where environmental conditions are variable, effectively addressing the constraints associated with systems that rely exclusively on a single device. In the construction industry, a multimodal control system that employs TS2 for gesture recognition and incorporates a gaze awareness layer with Tobii Pro Glasses 3 is reported [16]. This system demonstrated an accuracy rate of 93.8% in tasks involving construction machinery.

The implementation of access control and privilege strategies for control and manipulation tasks in CPS-based manufacturing approaches presents a significant challenge. Inadequate access control can result in malicious activities, including the alteration of system behavior, the theft of sensitive data, or the infliction of physical damage. It also preserves the integrity of the cyber (software and data) and physical (hardware and machinery) components of the system. CPS regulatory standards vary from one industry to another, but several frameworks and guidelines have been established to ensure security and interoperability [32,33]. The system approach proposed by Mthetwa [34] is based on blockchain technology for credential management and access control in HRI systems.

The use of Lightweight Directory Access Protocol (LDAP)-based systems has proven to be an effective solution for managing user identities and controlling access to devices connected to CPSs [35]. In a particular use case, the Mycros system serves for LDAP-based enterprise identity management, not only centralizing user administration but also improving security by applying an access hierarchy [36]. There are also alternatives for encapsulating services and applications, which are based on containers (e.g., Docker and Kubernetes), which allow decentralized management of the connected devices in the HRI. They optimize both resource allocation and latency in data transmission between the devices and the control system [37].

Although container-based HRIs are flexible, they can create a security breach if not managed properly. A solution based on network isolation mechanisms implemented as well as network hardening using a container firewall and using the communication bridge via Open vSwitch is proposed in [38]. On the other hand, the centralization of control through systems entirely based on the encapsulation of services and applications within a virtual system provides flexibility and security in resource management. These platforms optimize resource allocation and data transmission latency between devices and the control system, enabling decentralized management of connected devices [37].

In the context of the increasing prevalence of CPSs and accelerated HRI development, it is of critical importance to ensure the security and privacy of interactions between physical and digital components. An additional layer of security is provided when the CPS incorporates biometric recognition systems [39].

Biometrics has emerged as a crucial tool for authentication, utilizing distinctive features such as fingerprints and facial and iris patterns [40,41]. These technologies facilitate the implementation of robust control systems to prevent unauthorized access and reinforce security through the authentication of unique and non-transferable characteristics [42,43]. Furthermore, developments in artificial intelligence have enhanced the precision and versatility of these biometric systems, leading to novel prospects in automation.

3. Materials and Methods

The proposed methodology delineates the development of an accurate HRI platform, integrating multiple devices within a PCB manufacturing cell. Furthermore, it introduces a comprehensive strategy for managing users, including their permissions, control privileges, and the implementation of biometric security protocols. This user management framework facilitates efficient request handling on the multi-user platform while enabling intelligent manipulation of system components. Figure 1 illustrates the proposed methodology for developing an HRI platform, which comprises five levels. These levels are as follows: (1) user, (2) data acquisition and processing, (3) decentralized multimodal control server (DMCS), (4) MQTT broker, and (5) cell components.

At the user level, instructions are provided for manipulating the cell components. These instructions can be given in different modes, including speech or hand-based gestures. Furthermore, these instructions can be accessed by multiple devices, even simultaneously.

At the next level, the acquisition of user instructions is conducted via the implementation of input devices, such as standard microphones, LMC devices, and TS2 devices (for technical specifications, refer to Table 1). The LMC device serves as an interface that enables users to control components through gestures and hand/finger movements in the air, obviating the necessity for physical contact with the device. The TS2 is a device placed on the hand as a set of rings connected to the fingers. Its purpose is to facilitate interaction with electronic devices through gestures. The data obtained from both devices are subjected to processing in order to facilitate the recognition of gestures through the utilization of deep learning (DL) techniques.

The LMC and TS2 devices offer several advantages in the context of HRI platforms, including real-time responsiveness and touchless interaction. These features enhance the usability and functionality, which in turn facilitate a more natural and intuitive HRI. Both devices provide high-resolution tracking of hand movements with sub-millimeter accuracy, which is essential for applications requiring precise inputs, such as robotic manipulation or teleoperation. Furthermore, they have low latency and fast processing, ensuring user-friendly interactions. It is also important to mention that software development kits (SDKs) are available for both devices, allowing for customized interactions.

The data obtained from the input devices are then transferred to processing units, from now on referred to as data transcoders (DTs). The LMC and TS2 data are processed using an Intel NUC 11. In contrast, a Jetson AGX Orin was employed to process the data obtained from the audio devices. Table 2 presents the technical specifications of the two DTs. The DMCS also requires specific user data for identification purposes, such as biometric data, sensor index, and interaction name. Furthermore, specialized software was implemented in the DTs for device communication, data processing (e.g., hand gesture recognition/interpretation and natural language processing), and the construction of the DMCS. Table 3 shows the software specifications.

DTs were selected on the basis of their high computational performance and their efficacy in processing large amounts of data in real time, due to their edge AI capabilities and their capacity to optimize latency. It is also noteworthy that these devices consume considerably less energy than traditional architectural solutions, which is an additional advantage.

The next level, referred to as decentralized multimodal control, oversees all interactions between users and components within the manufacturing cell. The manipulation requests from the speech and gesture recognition systems are then processed. The requests are then filtered according to the level of permissions and privileges that the users have been granted. Furthermore, requests are addressed to prevent complications during the execution of tasks. Conversely, a container network was constructed using TCP sockets (through port 5000) for communication between the DTs and the containerized server, thereby enabling these devices to establish a connection with the Docker server. The server platform is based on Node.js technology, which serves as the central processing point with a relational database. Asynchronous queries are employed to facilitate efficient access for real-time verification of user permissions. All interactions, requests, and actions are recorded in an event log.

As HRI systems frequently comprise a multitude of components (e.g., vision processing, NLP, and control logic), the Docker platform enables the execution of these as isolated microservices, thereby enhancing maintainability and scalability. Moreover, it eliminates issues that arise from discrepancies in the environment, such as differences in dependency versions. Furthermore, it supports continuous integration and deployment pipelines, which facilitate seamless integration of updates, testing of functionality, and deployment of improvements to HRI systems. In contrast, Node.js is particularly adept at managing asynchronous, real-time communication. This is highly advantageous for HRI systems, where robots frequently require the processing of inputs such as sensor data, commands, or feedback in real time. Additionally, their capacity to handle concurrent tasks with a non-blocking I/O model enables the development of scalable and responsive applications, even when managing multiple robotic systems.

At the following level, the MQTT broker enables bidirectional communication via messages between the components of the manufacturing cell and the DMCS. The broker is responsible for distributing the orders received from the server to the corresponding components through a tree of predefined topics. Furthermore, this level has the capacity to manage several topic changes simultaneously, allowing multiple cell components to receive and respond to commands in parallel, which improves system synchronization and efficiency.

At the final level, there are the components of the PCB manufacturing cell. These components are subscribed to specific topics through the MQTT broker, which enables them to receive operational data and perform the requisite movements and commands. Upon receipt of a command, these components, which include collaborative robots, automation actuators, and specific components for PCB manufacturing, execute the corresponding action and transmit a confirmation back to the central server. The response is then processed and stored in the database, thereby updating the status of the task in progress and allowing the system to make decisions in parallel.

Metrics

Several metrics were calculated to validate the accuracy of our system as well as the fine-tuning processes.

Confusion matrices were generated to evaluate the accuracy of gesture recognition and NLP. These matrices were constructed by organizing the results obtained in the validation stage of the systems in a table. The rows represent the real classes of gesture and speech recognition and the columns the predicted classes. Each cell indicates the number of times a real class was correctly or incorrectly classified into a predicted class. The system’s overall accuracy is obtained by dividing the total number of correct predictions (the sum of the diagonal elements of the matrix) by the total number of examples evaluated.

Two common metrics were used to evaluate the fine-tuning applied to the ASR model. The Word Error Rate (WER) metric was employed for the assessment of the ASR’s performance. The accuracy of the transcriptions generated by the system was evaluated by comparing the output text with the correct reference text [44]. The Character Error Rate (CER) metric is analogous to the WER metric, but it is calculated at the level of individual characters rather than whole words [8].

4. Results

The following will present and discuss the results obtained in developing the multimodal human–robot interaction platform. First, the results of gestural control are presented and discussed, then the results of speech control, and finally the proposed decentralized multimodal control server are presented and discussed.

The proposed HRI platform is integrated into a PCB production cell (see Figure 2). To facilitate the conveyance of raw materials to the various stages of PCB assembly, the cell is divided into three zones, each of which is equipped with a cobot. The sequence of the manufacturing process is illustrated by the red line. It is noteworthy that our proposal allows for the integration of multiple devices for gesture recognition. In the case of TS2s, each user who operates the device can send manipulation commands, which are then handled by the DMCS. The subsequent sections will provide further details on this process. In the case of LMCs, a procedure must be followed prior to the transmission of requests to the DMCS. The objective of establishing a network of LMCs is to expand the field of view. As each LMC is capable of detecting both hands, it is feasible to detect as many pairs of hands as there are number of connected devices. Consequently, the distance of each hand is measured in relation to each LMC, and only the data corresponding to the shortest distance for each hand are processed. This guarantees that only one pair of hands is identified.

4.1. Hand Gesture Control

Gesture control represents a fundamental component of the proposed HRI platform, as it allows for user interaction through the use of specific hand gestures, which are tracked by the LMC and/or TS2 devices. Initially, a gesture dictionary was established with the aim of standardizing gesture control, thereby ensuring that any user is able to interact with the cell components, regardless of the peripheral device being used.

Figure 3 shows the proposed gesture dictionary. This approach was devised to ensure a straightforward and efficient learning curve for all system users. The definition strategy for this dictionary was based on the assignment of a unique gesture to each component of the manufacturing cell (see Figure 2), as well as to each action that can be performed [14,31]. To represent the order in which the cell components interact according to the production sequence of a PCB, gestures were assigned to each component. With regard to the actions, each one is associated with a gesture that reflects the action it triggers (e.g., move or pick). Furthermore, a gesture has been defined to enable access to and exit from the different zones of the cell. Consequently, in order to control a component, it is first necessary to access the zone in which it is located. Table 4 shows the gesture–component relationship for each zone.

The gesture control algorithms were implemented on the DT:NUC. The Leap Motion software (4.1 version) development kit (SDK) was employed to identify the extended fingers for LMC. Furthermore, functionality was incorporated to quantify the number of active fingers in each interaction, considering both hands. Additionally, the distance between fingers was measured to detect more complex gestures such as Pick. This information was utilized to recognize the gestures. Listing 1 illustrates the algorithm for hand gesture recognition with the LMC. Alternatively, given TS2 devices’ capability for transmitting extensive amounts of data, resulting from their incorporation of multiple accelerometers and IMUs, LSTM models were configured for gesture recognition. These architectures are particularly well suited to processing continuous data inputs, as evidenced by their efficacy in the context of such applications [17].

Listing 1. The execution of gesture recognition (TS2 and LMC), speech recognition, and facial recognition is described in general terms, as well as the sending of instructions to the DMCS via TCP.

To assess the precision of gesture recognition, a series of experimental trials were conducted with 10 end-users of the HRI platform. Each subject was instructed to perform each gesture with each hand 100 times. A dataset comprising 6000 samples (per hand) was employed. In total, 70% of the samples were allocated for training, while 30% were reserved for testing and validation. The Optuna framework was employed for optimizing the hyperparameters [45]. The configuration is presented in Table 5. The sequence length was one time step, with 15 features per sample, which aligns with the nature of the TS2 data. Moreover, the Adam optimizer was employed to adaptively adjust the learning rate and the cross-entropy categorical loss function for multiclass classification. The models were trained for 50 epochs, during which the weights were adjusted and performance improved.

Figure 4 illustrates the degree of accuracy achieved for each device, (LMC and TS2). Concerning the LMC, an accuracy of 95.17% and 95.66% was obtained for the left and right hands, respectively. As illustrated in the confusion matrix in Figure 4a (left hand), the Open and Close gestures exhibit reduced accuracy. These gestures involve the little finger, and it is possible that due to anatomical differences in the hands of the users, there is an impact on the recognition of the gestures. A similar phenomenon occurs for gestures Three and Four, Figure 4b (right hand). In this case, because the gestures involve hiding either the thumb or the little finger, variations related to anatomical differences of users are negligible. This demonstrates the accuracy of the proposal for gesture recognition tasks.

In the case of TS2, the accuracy was found to be 97.59% and 97.56% for the left and right hands, respectively. The LSTM architecture has been demonstrated to be an accurate and stable approach to gesture recognition. Figure 4c,d, illustrate the confusion matrices for each hand. Nevertheless, the high degree of accuracy is contingent upon the specific characteristics of the training dataset. Furthermore, due to the error propagation inherent to accelerometers and IMUs, it is plausible that new users may experience a decrease in gesture recognition capacity when utilizing TS2. To mitigate the impact of this error, a calibration algorithm was devised for new users, whereby correction factors are calculated for the data acquired with the TS2. The correction factors for each accelerometer are derived from the measurement uncertainty (type A and B) with a coverage factor of 2.57 [46] and are stored in a JSON format. The correction factors are adjusted to the accelerometer signals prior to being introduced to the pre-trained LSTM models for gesture prediction.

The potential for misuse of gesture control devices by any individual represents a significant security risk. In this regard, a security level was defined to ensure that only authorized users are able to take control of the components of the manufacturing cell. A facial recognition system was developed with the objective of validating the identity of the user who performs the gestures. Furthermore, this system oversees the allocation of permissions and privileges, as detailed in Section 5. The Local Binary Pattern Histogram (LBPH) algorithm was incorporated into the DT:NUC, thereby conferring upon the system the capacity to withstand fluctuations in illumination and facial appearance [47]. Consequently, upon the recognition of a gesture, the facial recognition system calculates the pertinent biometric information, which is then transmitted via TCP along with the corresponding instruction to the DMCS.

4.2. Speech Control

Acoustic and language models operate in synchrony to drive automatic speech recognition (ASR) systems; these two intertwined pillars are the foundation upon which ASR is built. The DeepSpeech engine can perform speech recognition in real time, thereby ensuring low-latency responses and facilitating efficient operation on edge devices. This is a crucial attribute for fluid interactions between humans and robots. Furthermore, it can run on local hardware without requiring an internet connection, which ensures privacy, security, and consistent performance. Additionally, the DeepSpeech engine is robust to background noise, which is a common occurrence in real-world HRI scenarios. The Common Crawl Large Multilingual Text-to-Voice (CCLMTV) dataset was used [48].

Figure 5 shows the audio processing flow for speech recognition. The signals are processed through convolutional layers (Conv-BN-ReLU), which are essential for capturing spatial hierarchies in the input signal, reducing internal covariate shift, and introducing non-linearities that enhance feature learning [49]. These are followed by repeated blocks of time and channel separable convolutions (TCSConv-BN-ReLU), a configuration that reduces parameter complexity and improves the model’s capacity to capture temporal dependencies [50]. These blocks include batch normalization and ReLU activations, employed to extract temporal features effectively. The implemented model incorporates residual connections and deep convolutions, thereby enhancing efficiency and accuracy. Subsequently, the output is subjected to a Connectionist Temporal Classification (CTC) layer, which aligns variable-length input signals with their textual outputs without requiring pre-segmented data, optimizing the transcription process [51]. The returned text must be included within the corpus. A corpus is defined as a set of textual data comprising the phrases or commands that a system is expected to identify and assimilate. To address the aforementioned issue, a corpus oriented towards the tasks of robot and device control within a production environment was developed.

The proposed corpus has been designed with specific instructions for interfacing with the components of the manufacturing cell. It is important to mention that the language defined in the ASR is Spanish. Now, each call-to-action phrase (utterance) comprises three fundamental components: actions (as illustrated in the bottom line of Figure 3), the specific devices or components involved, and the zone or destination where the action is to be executed (see Figure 2).

An intent represents the intention behind an utterance issued by the user. Each intent groups elements into different slots, such as action, component, and destination. This enables the system to ascertain the intention behind the interaction and to execute the appropriate action by capturing all elements of an utterance. Figure 6 illustrates an instance of an utterance, delineating the constituent slots. As previously stated, the language utilized is Spanish. To allow a more effective understanding, the rounded gray boxes describe the corresponding English translations of the actions, components, and destinations. The wake word “Cellya” activates the system, “mueve” (move) is the action slot, “Cobot” is the component slot, and “zona de limpieza” (clean zone) is the destination slot. The associated intent, designated as MoveUR3CleanZone, aggregates all of the aforementioned elements to facilitate the desired interaction. The corpus incorporates not only the textual phrases themselves but also the various ways in which users might express them, thus enabling the speech recognition system to adapt to different jargon. To illustrate, the corpus includes alternative forms of expression for the command “Cellya, mueve Cobot a la zona de limpieza” (Cellya, move Cobot to the cleaning zone). Such variations permit the system to adapt more effectively to the inherent flexibility of human language.

4.3. Fine-Tuning

A process of fine-tuning was performed on the DeepSpeech ASR model in order to adapt it to the requirements of the HRI platform. For this process, a dataset of 850 audios recorded between 10 different users, each interacting with the different components of the system, was constructed. The audios were normalized and processed at 16 kHz in mono format, to ensure optimal compatibility with the DeepSpeech model. This dataset was then divided for training, validation, and testing tasks (70/15/15%, respectively). The hyperparameters configured in the fine-tuning are shown in Table 6.

To evaluate the accuracy of the ASR, WER and CER metrics were calculated. The WER was calculated and employed as a metric for evaluating the precision of the model in the speech-to-text task. Fifty audios and a total of 500 words were used. A total of 111 errors were calculated, comprising 40 substitutions, 35 insertions, and 36 deletions. The WER was 22.22% (

\frac{40 + 35 + 36}{500} \times 100

), indicating that a proportion of 22.22% of the words were transcribed incorrectly. Conversely, the CER quantified discrepancies at the character level. A total of 3000 characters were examined, and 357 errors were identified. This comprised a total of 130 errors by substitution, 110 errors by insertion, and 117 errors by deletion. This resulted in a CER = 11.90% (

\frac{130 + 110 + 117}{3000} \times 100

).

The ASR model pre-trained with the CCLMTV dataset has a WER = 44.44% and a CER of 19.05%. In this sense the fine-tuning process, using our own dataset, improves the WER and CER metrics by 50% and 62.27%, respectively. In the context of text-to-speech tasks, the model achieves WER and CER values that are an acceptable upper limit: 20–25% [52,53].

Nevertheless, the WER and CER metrics provide insight into the capacity of the ASR model for speech-to-text translation. However, these metrics do not provide insight into the system’s ability to interpret instructions in a real-world setting. To evaluate the performance of the model, an experiment was designed whereby users interacted with different components of the manufacturing cell, issuing instructions at random from a defined corpus. A total of 10 users provided 100 valid instructions per action, in addition to 100 supplementary instructions that had a similar phonological structure but did not correspond to concrete instructions (labeled as None).

Figure 7a shows the results of the accuracy obtained by the ASR model for open/close instructions for two randomly selected components (LPKF and oven). The figure displays the confusion matrix for the LPKF component, with an accuracy of 93.07%. A trend of false positives is observed for the label class ‘None’. Figure 7b shows the accuracy of the oven component. For this test, an accuracy of 92.47% was achieved. Finally, Figure 7c shows the confusion matrix, which evaluates the interaction of the cobots performing different tasks such as “mover, tomar y colocar” (move, pick, and place), achieving an overall accuracy of 92.85%. False positives in general terms may be due to phonetic similarity between valid and undefined instructions, as well as to environmental noises or variations in users’ pronunciation. Finally, the model achieved an accuracy of 100 for instructions that should not be recognized, even though they are phonetically similar.

5. Decentralized Multimodal Control Server

The objective of the DMCS is to obtain control and supervision of the interactions between users and system components. This interaction is based on the hierarchy of permissions and the priority that each user has in accordance with their access level.

Priority given to users in the DMCS enhances the efficiency and safety since it lets users interact with the M.C. according to their expertise, High-priority users are enabled to perform tasks right away, and users with lower priority have limited access granted only to specific functions, given that the user has gone through scant training, so they focus on operations with a low risk of incidents. It assigns medium priority to regular users with moderate training to deal with the management of medium-risk interactions; this keeps the workflow organized and avoids access conflicts.

Table 7 illustrates the roles, access level, and priority level associated with each user created within the system. Each user has the ability to control some component of the manufacturing cell; however, with the exception of the admin and the super_user, none of them has full control of all the components of the cell. Furthermore, Table 7 illustrates the commands that are available to users. Each command is associated with a script that initiates the designated action in accordance with the system’s predefined logic. Subsequently, Listing 2 describes the logic underlying the functionality of each command.

Listing 2. DMCS pseudocode, describes a task processing system that handles user requests, validates permissions, processes tasks based on access level, and logs results, using a queue for user validation.

Figure 8 displays a sequence diagram (SD:Main) of the complete interaction within the CMC. It delineates the communication and message-sending behaviors exhibited by all elements of the system, along with the interactions between users and system components. The section entitled “Data Acquisition and Processing” outlines how the user interacts with the platform, including the use of gestures and speech commands, and describes the processes employed by the DTs to process the data. The instructions are identified and conveyed to the DMCS via a JSON, which is appended to the request queue. These JSON-like structures contain data that the DMCS requires in order to identify the user, including biometric data, sensor index, and interaction name. The DMCS then performs a search in the DBMS for permission and privilege (see the SD:DMCS sequence diagram in Figure 9).

The interactions between the DMCS and the DBMS are performed using the TCP protocol (port 3306), and are located within a network of containers within the Docker platform.

The sequence of DMCS interactions is illustrated in Figure 9. The process begins with the DMCS verifying the permissions and privileges that the user has within the database. In accordance with the assigned priority level, the instruction is then inserted into the process queue. In the event that the instruction is requested by a user with a high priority level, it is executed without delay. The DMCS establishes a connection with the broker via the MQTT protocol and subsequently publishes the instruction. The broker employs Transport Layer Security (TLS), which guarantees the confidentiality and integrity of communications between the client and the broker through the use of encryption. A connection is not feasible in the absence of particular user credentials, see Listing 3.

Listing 3. Pseudocode for task management using MQTT, including status verification, command execution, and result reporting.

Subsequently, the broker requests that the component execute the instruction. Upon completion of the instruction, the broker returns the result (see Figure 9). The resultwork is then processed in the section entitled ’Disconnection and Tracker Resume’. At this point, the DMCS disconnects from the broker. Finally, the DMCS notifies the user of the outcome of the instruction, indicating whether it was completed successfully or if there was a failure via audio (see Listing 2).

All interactions and state changes are recorded in the DBMS, thereby integrating an event log. The log serves to facilitate the monitoring and identification of errors, thereby ensuring the traceability of events through the documentation of execution irregularities. The instruction cycle persists in accordance with the process queue.

In general, the DMCS allows critical administrative tasks to be performed in a way that ensures efficiency and security. The scope of these management tasks encompasses a range of activities, including access validation, instruction queue processing, connectivity with the MQTT broker, monitoring of the DMCS, feedback to the user, and system traceability. Table 8 provides a description of the tasks that are managed and the DMCS subsystems that are involved.

One of the principal attributes of the proposed HRI is to afford users the capacity to interact with the cell through a multiplicity of devices. It is also possible for multiple users to interact with several cell components simultaneously. The DMCS facilitates the integration of additional DTs and devices (LMC, TS2, and audio). This enables flexibility and modularity in scaling the platform. Figure 10 illustrates a deployment diagram for this capability. For example, if a new DT is required to be integrated, the process would be only to install the specified packages within the speech and gesture control subsystems and add their respective peripheral devices (LMC, TS2, and audio). In other words, in a test case involving a cell with N components, the system successfully supported concurrent operation by N users, each employing a distinct input device. This functionality not only illustrates the system’s robustness but also underscores its potential to enhance productivity and adaptability in multi-operator industrial environments.

6. Conclusions

In recent years, human–robot interaction (HRI) technologies have advanced significantly, enabling more intuitive, secure, and efficient collaborations between humans and machines. The manufacturing sector, in particular, has benefited from innovations in gesture recognition, speech control, and biometric authentication, which have the potential to streamline operations, improve accuracy, and enhance safety within industrial environments. This paper examines the development and application of several HRI tools—specifically, the Leap Motion Controller (LMC), Tap Strap 2 (TS2), and the Cellya speech control system—each uniquely contributing to a multimodal, secure, and adaptable system for controlling manufacturing processes.

In conclusion, both the Leap Motion Controller (LMC) and Tap Strap 2 (TS2) have proven to be reliable for gesture recognition in human–robot interaction (HRI) settings. By employing multiple LMC devices in a daisy-chained configuration, we effectively expanded the gesture recognition field beyond a limited viewpoint, while the TS2 calibration tailored recognition accuracy to individual user anatomy and gesture variations. Moreover, integrating a facial recognition module for credential management added a security layer, verifying user identity before granting access.

The gesture recognition accuracy achieved with TS2, supported by the LSTM architecture, reached 97.59% for left hand gestures and 97.56% for right hand gestures, underscoring the model’s pattern recognition effectiveness with IMU data. LMC results were similar, with 95.17% and 95.66% accuracy for left and right hands, respectively.

The speech control system, Cellya, demonstrated effective performance in industrial settings, with fine-tuning significantly lowering the Word Error Rate (WER) from 44.44% to 22.22% and the Character Error Rate (CER) from 19.05% to 11.90%, resulting in more accurate command recognition for manufacturing cell control.

The DMCS integrates these advanced HRI and biometric components to improve operational accuracy, reduce unauthorized access, and ensure efficient task execution. This system is highly adaptable and scalable, with significant potential for enhancing productivity and security in various manufacturing environments. Future research could explore DMCS’s scalability in more complex manufacturing scenarios and its integration with advanced AI and IoT technologies, further enhancing real-time data processing, operational efficiency, and system performance in smart manufacturing contexts.

This multimodal, multi-device approach demonstrates high suitability for PCB manufacturing, especially using gesture and speech recognition.

The results obtained confirm the effectiveness of our HRI platform within smart manufacturing environments. This integrated multimodal system with the proposed platform provides accurate adaptability and faster processing to effectively execute complex manufacturing commands with high precision. Both the gesture recognition systems considered, LMC and TS2, demonstrated acceptable levels of precision to enhance user interaction with the components. Error rates decreased further following adjustments, especially in the speech recognition embedded by Cellya, which now shows improved speech processing capabilities. Task management is enhanced by the DMCS, which, based on priority-based access control, ensures that users perform tasks aligned with their expertise levels and granted security clearances.

The proposed HRI platform demonstrates significant scalability, allowing for the effortless incorporation of new input methods—such as gestures or speech commands—and their dynamic association with manufacturing cell components. This flexibility extends to enabling simultaneous multi-component control, which is particularly advantageous in complex manufacturing scenarios. For instance, a manufacturing cell with N components can be efficiently operated by N users in parallel, each utilizing a dedicated input device. These capabilities underline the platform’s potential to enhance adaptability, streamline operations, and accommodate diverse industrial requirements, positioning it as an accurate and robust solution for modern manufacturing environments.

Author Contributions

Conceptualization, A.-I.G.-M.; methodology, A.-I.G.-M., E.A.-L., J.C.-R. and Á.-G.S.-M.; software, E.A.-L., J.C.-R. and Á.-G.S.-M.; validation, E.A.-L., J.C.-R. and Á.-G.S.-M.; formal analysis, A.-I.G.-M.; investigation, A.-I.G.-M.; writing—original draft preparation, E.A.-L., J.C.-R., Á.-G.S.-M. and A.-I.G.-M.; writing—review and editing, E.A.-L., J.C.-R., Á.-G.S.-M. and A.-I.G.-M.; visualization, E.A.-L., J.C.-R. and Á.-G.S.-M.; supervision, A.-I.G.-M.; project administration, A.-I.G.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Council for Humanities, Science, and Technology of Mexico (CONAHCYT) with grant number F003-322609 (LANITED).

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the CONAHCYT National Laboratory for Research in Digital Technology (LANITED). They would also like to acknowledge the Investigadores por México-CONAHCYT program for providing research opportunities through project 730-2017.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kim, J.; Seo, D.; Moon, J.; Kim, J.; Kim, H.; Jeong, J. Design and implementation of an HCPS-based PCB smart factory system for next-generation intelligent manufacturing. Appl. Sci. 2022, 12, 7645. [Google Scholar] [CrossRef]
Barata, J.; Cardoso, A.; Haenisch, J.; Chaure, M. Interoperability standards for circular manufacturing in cyber-physical ecosystems: A survey. Procedia Comput. Sci. 2022, 207, 3320–3329. [Google Scholar] [CrossRef]
Müller, M.; Müller, T.; Ashtari Talkhestani, B.; Marks, P.; Jazdi, N.; Weyrich, M. Industrial autonomous systems: A survey on definitions, characteristics and abilities. at-Automatisierungstechnik 2021, 69, 3–13. [Google Scholar] [CrossRef]
Kim, S.; Anthis, J.R.; Sebo, S. A taxonomy of robot autonomy for human-robot interaction. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Boulder, CO, USA, 11–15 March 2024; pp. 381–393. [Google Scholar]
Liu, Y.; Li, Z.; Liu, H.; Kan, Z. Skill transfer learning for autonomous robots and human–robot cooperation: A survey. Robot. Auton. Syst. 2020, 128, 103515. [Google Scholar] [CrossRef]
Jahanmahin, R.; Masoud, S.; Rickli, J.; Djuric, A. Human-robot interactions in manufacturing: A survey of human behavior modeling. Robot. Comput. Integr. Manuf. 2022, 78, 102404. [Google Scholar] [CrossRef]
Kheddar, H.; Himeur, Y.; Al-Maadeed, S.; Amira, A.; Bensaali, F. Deep transfer learning for automatic speech recognition in manufacturing. Knowl. Based Syst. 2023, 243, 110851. [Google Scholar] [CrossRef]
Zhang, H.; Dong, J. A Novel Architecture for Information Sharing & Exchange between IoT Systems. In Proceedings of the 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall), Toronto, ON, Canada, 24–27 September 2017; pp. 1–5. [Google Scholar] [CrossRef]
Masoodi, F.; Alam, S.; Siddiqui, S.T. Security & privacy threats, attacks and countermeasures in internet of things. Int. J. Netw. Secur. Its Appl. 2019, 11, 67–84. [Google Scholar] [CrossRef]
Majdoub Bhiri, N.; Ameur, S.; Alouani, I.; Mahjoub, M.A.; Ben Khalifa, A. Hand gesture recognition with focus on leap motion: An overview, real world challenges and future directions. Expert Syst. Appl. 2023, 226, 120125. [Google Scholar] [CrossRef]
Tölgyessy, M.; Dekan, M.; Rodina, J.; Duchoň, F. Analysis of the Leap Motion Controller Workspace for HRI Gesture Applications. Appl. Sci. 2023, 13, 742. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, R.; Chen, L.; Zhang, X. Natural Gesture Control of a Delta Robot Using Leap Motion. J. Phys. Conf. Ser. 2019, 1187, 032042. [Google Scholar] [CrossRef]
Li, C.; Fahmy, A.; Sienz, J. Development of a Neural Network-Based Control System for the DLR-HIT II Robot Hand Using Leap Motion. IEEE Access 2019, 7, 136914–136923. [Google Scholar] [CrossRef]
Chatterjee, K.; Raju, M.; Selvamuthukumaran, N.; Pramod, M.; Krishna Kumar, B.; Bandyopadhyay, A.; Mallik, S. HaCk: Hand Gesture Classification Using a Convolutional Neural Network and Generative Adversarial Network-Based Data Generation Model. Information 2024, 15, 85. [Google Scholar] [CrossRef]
Zhang, H.; Liu, K.; Zhang, Y.; Lin, J. TRANS-CNN-Based Gesture Recognition for mmWave Radar. Sensors 2024, 24, 1800. [Google Scholar] [CrossRef]
Wang, X.; Veeramani, D.; Zhu, Z. Gaze-aware hand gesture recognition for intelligent construction. Eng. Appl. Artif. Intell. 2023, 123, 106179. [Google Scholar] [CrossRef]
Mrazek, K.; Holton, B.; Klein, T.; Khan, I.; Ayele, T.; Khan Mohd, T. The Tap Strap 2: Evaluating performance of one-handed wearable keyboard and mouse. In Proceedings of the HCI International 2021-Late Breaking Papers: Multimodality, eXtended Reality, and Artificial Intelligence: 23rd HCI International Conference, HCII 2021, Virtual Event, 24–29 July 2021; Proceedings 23. Springer: Berlin/Heidelberg, Germany, 2021; pp. 82–95. [Google Scholar] [CrossRef]
Mohd, T.K.; Freas, R. A Study of Supervised Clustering Methods for Optical Mouse Trajectory Data from Tap Strap 2. GROUP 5 (TR) 2023, 104, 1. [Google Scholar]
Rosca, S.D.; Leba, M.; Sibisanu, R.C.; Muntean, E. Gesture Control of a Robotic Head using Kinect. In Proceedings of the 2022 7th International Conference on Mathematics and Computers in Sciences and Industry (MCSI), Athens, Greece, 22–24 August 2022; pp. 101–108. [Google Scholar] [CrossRef]
Jaramillo, A.G.; Benalcazar, M.E. Real-time hand gesture recognition with EMG using machine learning. In Proceedings of the 2017 IEEE Second Ecuador Technical Chapters Meeting (ETCM), Salinas, Ecuador, 16–20 October 2017; pp. 1–5. [Google Scholar] [CrossRef]
Longo, F.; Nicoletti, L.; Padovano, A. Caspar: Towards decision making helpers agents for IoT. Eng. Appl. Artif. Intell. 2021, 104, 104269. [Google Scholar] [CrossRef]
Zhang, C.; Chen, J.; Li, J.; Peng, Y.; Mao, Z. Large language models for human-robot interaction: A review. Biomim. Intell. Robot. 2023, 3, 100131. [Google Scholar] [CrossRef]
Nyga, D.; Roy, S.; Paul, R.; Park, D.; Pomarlan, M.; Beetz, M.; Roy, N. Grounding robot plans from natural language instructions with incomplete world knowledge. In Proceedings of the Conference on Robot Learning. PMLR, Zürich, Switzerland, 29–31 October 2018; pp. 714–723. [Google Scholar]
Chen, H.; Leu, M.C.; Yin, Z. Real-time multi-modal human–robot collaboration using gestures and speech. J. Manuf. Sci. Eng. 2022, 144, 101007. [Google Scholar] [CrossRef]
Deuerlein, C.; Langer, M.; Seßner, J.; Heß, P.; Franke, J. Human-robot-interaction using cloud-based speech recognition systems. Procedia CIRP 2021, 97, 130–135. [Google Scholar] [CrossRef]
Park, S.; Wang, X.; Menassa, C.C.; Kamat, V.R.; Chai, J.Y. Natural language instructions for intuitive human interaction with robotic assistants in field construction work. Autom. Constr. 2024, 161, 105345. [Google Scholar] [CrossRef]
Fan, J.; Zheng, P. A vision-language-guided robotic action planning. J. Manuf. Syst. 2024, 74, 5–18. [Google Scholar] [CrossRef]
Ryumin, D.; Ivanko, D.; Ryumina, E. Audio-visual speech and gesture recognition by sensors of mobile devices. Sensors 2023, 23, 2284. [Google Scholar] [CrossRef] [PubMed]
Yongda, D.; Fang, L.; Huang, X. Research on multimodal human-robot interaction based on speech and gesture. Comput. Electr. Eng. 2018, 72, 443–454. [Google Scholar] [CrossRef]
Qi, W.; Ovur, S.E.; Li, Z.; Marzullo, A.; Song, R. Multi-sensor guided hand gesture recognition for a teleoperated robot using a recurrent neural network. IEEE Robot. Autom. Lett. 2021, 6, 6039–6045. [Google Scholar] [CrossRef]
Liu, H.; Liu, Z. A multimodal dynamic hand gesture recognition based on radar–vision fusion. IEEE Trans. Instrum. Meas. 2023, 72, 1–15. [Google Scholar] [CrossRef]
Griffor, E.; Greer, C.; Wollman, D.; Burns, M. Framework for Cyber-Physical Systems: Volume 1 Overview. Spec. Publ. (NIST SP) 2017. [Google Scholar] [CrossRef]
International Organization for Standardization. ISO/IEC 27000:2018 Information Technology—Security Techniques—Information Security Management Systems—Overview and Vocabulary. 2018. Available online: https://www.iso.org/standard/73906.html (accessed on 15 October 2024).
Mthethwa, S.; Singano, T.; Ndlovu, L.; Khutlang, R.; Shadung, D.; Ngebeni, B. Blockchain Technology for IoT based Educational Framework and Credentials. In Proceedings of the 2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM), Pekan, Malaysia, 24–26 August 2021; pp. 194–199. [Google Scholar] [CrossRef]
Thakur, M.A.; Parvat, T.J.; Walunj, V.S. Data Security Using Directory Server in Identity and Access Management System. In Proceedings of the ICT Analysis and Applications: Proceedings of ICT4SD 2020; Springer: Berlin/Heidelberg, Germany, 2021; Volume 2, pp. 73–84. [Google Scholar] [CrossRef]
Manolache, F.B.; Evans, J.; Rusu, O. Mycros-An Automated Enterprise IT Management System Based on LDAP. In Proceedings of the 2018 17th RoEduNet Conference: Networking in Education and Research (RoEduNet), Cluj-Napoca, Romania, 6–8 September 2018; pp. 1–6. [Google Scholar] [CrossRef]
Muthanna, A.; Tselykh, A. Development of Docker and Kubernetes Orchestration Platforms for Industrial Internet of Things Service Migration. In Proceedings of the 2018 IEEE 12th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 31 January–2 February 2018; pp. 168–173. [Google Scholar] [CrossRef]
Mousa, A.; Tuffaha, W.; Abdulhaq, M.; Qadry, M.; Othman Othman, M.M. In-Depth Network Security for Docker Containers. In Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 6–8 July 2023; pp. 1–6. [Google Scholar]
Shaheed, K.; Mao, A.; Qureshi, I.; Kumar, M.; Abbas, Q.; Ullah, I.; Zhang, X. A systematic review on physiological-based biometric recognition systems: Current and future trends. Arch. Comput. Methods Eng. 2021, 28, 4917–4960. [Google Scholar] [CrossRef]
Lucia, C.; Zhiwei, G.; Michele, N. Biometrics for Industry 4.0: A survey of recent applications. J. Ambient Intell. Humaniz. Comput. 2023, 14, 11239–11261. [Google Scholar] [CrossRef]
Yang, W.; Wang, S.; Sahri, N.M.; Karie, N.M.; Ahmed, M.; Valli, C. Biometrics for internet-of-things security: A review. Sensors 2021, 21, 6163. [Google Scholar] [CrossRef]
Gupta, S.; Maple, C.; Crispo, B.; Raja, K.; Yautsiukhin, A.; Martinelli, F. A survey of human-computer interaction (HCI) & natural habits-based behavioural biometric modalities for user recognition schemes. Pattern Recognit. 2023, 139, 109453. [Google Scholar]
Awad, A.I.; Babu, A.; Barka, E.; Shuaib, K. AI-powered biometrics for Internet of Things security: A review and future vision. J. Inf. Secur. Appl. 2024, 82, 103748. [Google Scholar] [CrossRef]
Ali, A.; Renals, S. Word Error Rate Estimation for Speech Recognition: E-WER. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, 15–20 July 2018; pp. 20–24. [Google Scholar] [CrossRef]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
Williams, J.H. Guide to the Expression of Uncertainty in Measurement(the GUM); IOP Publishing: Bristol, UK, 2015. [Google Scholar] [CrossRef]
Rani, E.; Sakthimohan, M.; Raj, M.A.; Nithya, V.; Karthigadevi, K.; Swetha, R. An Automatic Face Recognition Using Local Binary Pattern Histogram (LBPH) Algorithm. In Proceedings of the 2023 7th International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 22–24 November 2023. [Google Scholar] [CrossRef]
Bermuth, D.; Poeppel, A.; Reif, W. Scribosermo: Fast Speech-to-Text models for German and other Languages. arXiv 2021, arXiv:2110.07982. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sainath, T.N.; Weiss, R.J.; Wilson, K.W.; Senior, A. Learning the Speech Front-End With Raw Waveform CLDNNs. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015; pp. 1–5. [Google Scholar]
Graves, A.; Fernandez, S.; Gomez, F.; Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
Lazzaroni, L.; Bellotti, F.; Berta, R. An embedded end-to-end voice assistant. Eng. Appl. Artif. Intell. 2024, 136, 108998. [Google Scholar] [CrossRef]
Munteanu, C.; Penn, G.; Baecker, R.; Toms, E.; James, D. Measuring the Acceptable Word Error Rate of Machine-Generated Webcast Transcripts. In Proceedings of the INTERSPEECH 2006—ICSLP, Pittsburgh, PA, USA, 17–21 September 2006; pp. 157–160. [Google Scholar]

Figure 1. Methodology to develop the proposed HRI platform. This workflow outlines the complete process, starting from user input, followed by data transcoding and task queue management, and culminating in command execution and feedback. The proposed HRI platform guarantees simultaneous control of all components of the manufacturing cell, as well as the possibility of supporting the use of multiple gesture and speech manipulation devices.

Figure 2. The HRI platform is integrated within a PCB manufacturing cell. The red line represents the sequence of the manufacturing process. Furthermore, the speech and gesture control systems are located within the control zone. From this zone, it is possible to interact simultaneously with the various components of the cell through speech commands or through gestures (LMC or TS2) using multiple devices.

Figure 3. Gesture dictionary. A combination of gestures with both hands is required to initiate a specific interaction with a designated component. Prior to initiating the interaction, it is necessary to first access the zone in which the component is located (top row) and then the component itself (bottom row).

Figure 4. Performance evaluation of the gesture recognition system utilizing the LMC and TS2 devices. The overall accuracy exceeds 95% for both hands with both systems. The upper row (a,b) illustrates results from the LMC, revealing lower accuracy for gestures involving the thumb and pinky finger. The lower row (c,d) presents data from the TS2, demonstrating the effectiveness of LSTM models in achieving high accuracy in gesture recognition.

Figure 5. Implemented ASR Spanish model architecture. The DeepSpeech framework comprises convolutional layers with batch normalization, ReLU activations, and time–channel separable convolutions, leading to a CTC layer for output prediction.

Figure 6. Corpus and variations in interactions, assigning intent, action, component, and destination to control the component.

Figure 7. Performance evaluation of the speech recognition system, evaluating the different interactions with the different components of the cell.

Figure 8. The sequence diagram provides a visual representation of the processing flow of user commands, delineating the sequence of operations from input to execution. The diagram is composed of three sections: (1) data acquisition and processing, (2) task management, and (3) disconnection and tracker resume.

Figure 9. The diagram presents the interaction flow within the DMCS, detailing how user permissions are checked against the DBMS and how high- and low-priority tasks are queued, executed, or denied.

Figure 10. The deployment diagram illustrates the integration of input devices, data transcoders, and the DMCS, delineating the sequence of operations from user input to task execution, validation, and feedback within the system’s infrastructure.

Table 1. Technical specifications of the input devices used in the HRI platform.

Device	Technology	Tracking Area	Precision	Sample Rate	Connectivity
Leap Motion Controller	Infrared cameras and sensors for hand/finger tracking	150° field, extends up to 60 cm	Up to 0.01 mm	200 fps	USB 3.0
Tap Strap 2	Inertial measurement units and accelerometers	no limitations	±2G up to ±16G	200 Hz	Bluetooth 4.0

Table 2. Technical specifications of the data transcoders used to process LMC/TS2 data.

Feature	Intel NUC 11	Jetson AGX Orin
CPU	Intel Core i7	NVIDIA Carmel ARM v8.2
CPU cores	4–8 cores	12 cores (ARM)
GPU	Iris Xe Graphics	2048 CUDA Cores, 64 Tensor Cores
RAM	64 GB DDR4	32 GB LPDDR5
Operatign System	Windows	Linux (Ubuntu, JetPack SDK)

Table 3. Software requirements for the HRI platform.

Device	Software Requirements
Leap Motion 2	Ultra Gemini Development Package 5.13.2, OpenSSL 1.1.1
Tap Strap 2	Tap Strap SDK, Dotnet SDK 7.0.4, Keras 2.15.0
NLP	DeepSpeech 0.9.13, Mozilla TTS 0.22.0
DMCS	Docker 24.2.1, PahoMQTT 2.1.0, MySQL 8.0.39, Node 20.16.0

Table 4. Relationship between gestures and components for each zone (refer to Figure 2 and Figure 3).

	Gestures
	Cobot	One	Two	Three	Four	Five
Zone 1	UR3	RACK₁	LPKF	CLEAN	RACK₂
Zone 2	UR5	RACK₂	INSPECTION₁	RACK₃	PICK and PLACE
Zone 3	UR3	RACK₃	RACK₄	INSPECTION₂	OVEN	RACK₅

Table 5. Hyperparameters configured for training the LSTM network used in gesture recognition using TS2 devices.

	Learning Rate	Layers	Activation Function	Optimizer	Epochs
value	0.001	50	softmax	Adam	50

Table 6. Hyperparameters for fine-tuning of the DeepSpeech pre-trained model.

	Learning Rate	Batch Size	Epochs	Optimizer	Learning Rate Decay
value	0.0001	16	50	Adam	0.1 each 10 epochs

Table 7. Description of user roles. Each role is associated with a level of access and allowed commands within the system.

Role	Access Level	Allowed Commands
Admin	ALL	Every single command across all machines and systems.
Superuser	ALL	Every single command across all machines and systems.
User	Medium-Importance Machines	Operate machines with moderate impact, like starting/stopping processes or executing coordinated tasks.
User	Low-Importance Machines	Perform basic actions like opening/closing doors, positioning cobots, or moving items between stations.

Table 8. Management tasks within the DMCS, covering access validation, command processing, MQTT communication, and feedback handling. Relevant system components are identified for each task to ensure execution, user interaction, and traceability.

Management Task	Description	Involved
Access validation	Checks user permissions before commands execute.	DMCS, DBMC
Command processing	Adds commands to processing queue.	DMCS
MQTT broker connection	Maintains MQTT communication for commands.	DMCS, MQTT broker
Command status	Monitors MQTT topics for task status updates.	MQTT broker, DMCS
User feedback	Sends task success/failure messages to users.	DMCS
Action traceability	Logs actions and statuses for auditability.	DMCS, DBMC

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Salinas-Martínez, Á.-G.; Cunillé-Rodríguez, J.; Aquino-López, E.; García-Moreno, A.-I. Multimodal Human–Robot Interaction Using Gestures and Speech: A Case Study for Printed Circuit Board Manufacturing. J. Manuf. Mater. Process. 2024, 8, 274. https://doi.org/10.3390/jmmp8060274

AMA Style

Salinas-Martínez Á-G, Cunillé-Rodríguez J, Aquino-López E, García-Moreno A-I. Multimodal Human–Robot Interaction Using Gestures and Speech: A Case Study for Printed Circuit Board Manufacturing. Journal of Manufacturing and Materials Processing. 2024; 8(6):274. https://doi.org/10.3390/jmmp8060274

Chicago/Turabian Style

Salinas-Martínez, Ángel-Gabriel, Joaquín Cunillé-Rodríguez, Elías Aquino-López, and Angel-Iván García-Moreno. 2024. "Multimodal Human–Robot Interaction Using Gestures and Speech: A Case Study for Printed Circuit Board Manufacturing" Journal of Manufacturing and Materials Processing 8, no. 6: 274. https://doi.org/10.3390/jmmp8060274

APA Style

Salinas-Martínez, Á.-G., Cunillé-Rodríguez, J., Aquino-López, E., & García-Moreno, A.-I. (2024). Multimodal Human–Robot Interaction Using Gestures and Speech: A Case Study for Printed Circuit Board Manufacturing. Journal of Manufacturing and Materials Processing, 8(6), 274. https://doi.org/10.3390/jmmp8060274

Article Menu

Multimodal Human–Robot Interaction Using Gestures and Speech: A Case Study for Printed Circuit Board Manufacturing

Abstract

1. Introduction

2. State of the Art

3. Materials and Methods

Metrics

4. Results

4.1. Hand Gesture Control

4.2. Speech Control

4.3. Fine-Tuning

5. Decentralized Multimodal Control Server

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI