1. Introduction
The printed circuit board (PCB) manufacturing industry plays a vital role in the production of modern electronic devices, from consumer equipment to military and aerospace technology. In this context, the evolution towards more decentralized and flexible manufacturing systems is critical to address challenges such as design complexity, quality standards, and efficiency in production processes [
1]. Today, traditional strategies are being complemented by smart technologies such as the Industrial Internet of Things (IIoT), cyber–physical systems (CPSs), and artificial intelligence (AI) tools. These enable real-time monitoring, control, and manipulation in increasingly decentralized and dynamic production lines, providing a competitive advantage in the market.
As the complexity and miniaturization of PCB designs continues to grow, collaborative robotics (cobots) is emerging as a crucial solution to address accuracy, efficiency, and adaptability challenges in assembly processes. Furthermore, the use of cobots enables mass customization and the transition to smart factories, thereby optimizing workflows. Technological advancements driven by the Fourth Industrial Revolution (4IR) have introduced innovative solutions for integrating industrial ecosystem components with IIoT services. These developments emphasize interoperability between elements like robots, sensors, and control systems to optimize processes in advanced manufacturing cells.
The optimization of cell manufacturing interoperability represents a current research topic [
2]. This is due, among other factors, to the numerous hardware and software architectures and communication protocols. The objective of such optimization is to enhance the efficiency of the processes involved in the transformation of raw materials, thus reducing waste and energy consumption.
As robots become more sophisticated, their autonomy could be extended to learn or retain human-observed behaviors as skills, develop these skills through practice, and then use them in novel task environments (autonomous behaviors) [
3,
4]. As a result, robots and their interactions with humans will become more personalized, interactive, and engaging than ever before, providing assistance in many areas of life, such as manufacturing [
5]. It is expected that robots will be able to perform tasks autonomously in a variety of environments and communicate safely with humans. This can be achieved through various approaches, such as digital twins or the development of sensor-based interaction and communication systems, known as human–robot interactions (HRIs) [
6]. HRI approaches have become a fundamental element of advanced automation, enabling effective human–robot collaboration in industrial settings. HRI ecosystems have integrated several key technologies, including AI, speech recognition, gestures, and visual perception, thereby enhancing communication and efficiency in manufacturing cells.
The advent of novel sensors and edge computing has accelerated the development of HRI systems capable of interpreting human commands through an array of channels or modes (including speech and gesture recognition), leveraging the power of deep learning. This improves operational efficiency and safety in industrial processes. In this context, where precision-requiring tasks are performed, automatic speech recognition (ASR) becomes a crucial complement, serving to overcome the dependence on other types of sensors [
7].
The prevailing trend is toward the creation of more intelligent industrial environments, which allow greater customization and enhanced operational efficiency. In contrast, this trend also presents novel challenges, including those related to security, management, and hierarchical control over PCB manufacturing cells. By granting various manipulation alternatives, there is a latent risk regarding the permissions and privileges that users have over their various components [
1]: what users can or cannot do? HRI management systems must be developed in a way that fosters an efficient collaborative dynamic while simultaneously ensuring the maintenance of security and productivity.
In modern manufacturing cells, CPS control approaches facilitate uninterrupted interaction between machines and humans. In addition to performing automation tasks, robots are also capable of adapting to changing environmental conditions [
8]. The security and privacy risks associated with IoT devices, including denial of service (DoS) attacks and vulnerabilities in communication protocols, are current topics. Various mitigation strategies, such as the use of accurate authentication and encryption protocols, are proposed in [
9].
This paper describes the development of an HRI platform that provides the ability to simultaneously control multiple components of a PCB manufacturing cell (developing at the National Laboratory for Research in Digital Technologies). The HRI platform consists of various devices for controlling components using gesture and speech recognition. The platform allows the integration of new control devices into the system, providing the capability for multiple users to control multiple components simultaneously. The proposed HRI platform is designed to be scalable, thereby facilitating the registration of new gestures and input devices (such as gestures or speech commands) and their subsequent linkage to the components of a manufacturing cell. Furthermore, it offers the capability to simultaneously control all the components of the cell. The manuscript is organized as follows:
Section 2 presents a review of the current literature.
Section 3 outlines the materials and methods used, describes the technical specifications of all devices and software employed, and delineates the methodology implemented. Thereafter, the results obtained in gesture control, speech control, and decentralized multimodal control are presented and discussed. Finally, the conclusions are presented.
2. State of the Art
As stated previously, one of the most prevalent techniques for gesture recognition employs the use of vision-based devices, such as the Leap Motion Controller (LMC) [
10]. This sensor, which is capable of the highly accurate tracking of hand movements, has been the subject of extensive evaluation and implementation in recent years. The LMC device is equipped with the capability to track hands beyond the maximum range of 60 cm (field of view), with some cases exhibiting tracking capabilities up to 100 cm [
11]. In [
12] is demonstrated the control of a Delta robot using the LMC for gesture recognition through robot kinematics. A comparable approach was employed to develop a neural network (NN)-based control system for the DLR-HIT II robotic hand utilizing the LMC [
13]. This resulted in enhanced stability and accuracy in teleoperation through an inverse kinematics methodology.
A combination of Generative Adversarial Network (GAN) and Convolutional Network (CNN) techniques for hand gesture classification is proposed in [
14]. By generating synthetic data, they achieved significant improvement in classification compared to other deep learning models. In contrast, a system composed of a mmWave radar and a TRANS-CNN model for gesture recognition, achieving an accuracy of 98.5%, is presented in [
15]. This analysis demonstrates that the choice of AI model has a significant impact on the accuracy and robustness of a gesture recognition system with the LMC.
Another device that has recently been the subject of study with regard to its potential for use in gesture-based HRI applications is the Tap Strap 2 (TS2). The device comprises a series of inertial measurement units (IMUs), accelerometers, and haptic feedback mechanisms [
16]. The usability of this device is investigated in [
17], which proposes an RNN-based machine learning model to reduce the error rate in virtual keyboard applications on various surfaces. Furthermore, ref. [
17] investigated methods for enhancing the precision of device input readings through the implementation of three long short-term memory (LSTM) recurrent neural network (RNN) models. The models that were constructed were a standard LSTM model, a model with both CNN and LSTM layers, and a convolutional LSTM (ConvLSTM). The LSTM was found to have the highest accuracy, with an average of 97.470%. The two previous studies demonstrate the effectiveness of RNNs, specifically the LSTM variant, in processing accelerometer data. Similarly, a trajectory clustering machine learning model that recognizes gestures resembling the rotation of a dial, using trajectory data from the TS2 optical mouse sensor, is proposed in [
18].
Other devices have been employed in the context of HRI systems, such as Microsoft Kinect. The Kinect device has been employed to control the joints of a robotic head via hand gestures, with a success rate exceeding 90% [
19]. Alternatively, electromyographic sensors have been employed in conjunction with algorithms such as K-Nearest Neighbors (K-NN) and Dynamic Time Warping (DTW) [
20].
Recent advancements in AI, particularly in automatic speech recognition (ASR) and Large Language Models (LLMs), are increasingly being adopted in the manufacturing sector. ASR technologies facilitate hands-free speech control and real-time data capture, improving operational efficiency and worker safety in industrial environments. Meanwhile, LLMs enable advanced processing of unstructured textual data, supporting applications such as predictive maintenance through log analysis. The integration of these technologies is accelerating the transition toward smarter, more adaptive HRI frameworks aligned with Industry 4.0 principles [
21,
22].
These approaches are also capable of inferring latent or absent components of given commands based on semantic associations learned from noisy text corpora [
23,
24]. In other words, these systems acquire new knowledge in order to complete incomplete instructions. Accordingly, interfaces for speech command recognition can be constructed, and they can be converted to code for controlling robot movements [
25].
Some study cases for pick-and-place operations are proposed in [
26,
27] using vision–language instructions. The results demonstrate a high degree of accuracy, greater than 99%, allowing the robots to execute tasks accurately and reliably. By incorporating visual information such as lip reading and gesture recognition, audiovisual speech recognition (AVSR) systems can improve the accuracy of human speech interpretation in noisy environments [
28,
29].
In the context of more intricate control scenarios, which on exceptional occasions demand the use of multiple devices, Qi et al. [
30] developed a gesture recognition system based on multiple LMCs, utilizing RNN-LSTM for the teleoperation of surgical robots. The technique demonstrated a high level of recognition accuracy and inference speed. Similarly, Liu et al. [
31] combined radar and vision data, enhancing the system’s adaptability and robustness through a deformable fusion network with LSTM-RNN. This device fusion approach is particularly advantageous in scenarios where environmental conditions are variable, effectively addressing the constraints associated with systems that rely exclusively on a single device. In the construction industry, a multimodal control system that employs TS2 for gesture recognition and incorporates a gaze awareness layer with
Tobii Pro Glasses 3 is reported [
16]. This system demonstrated an accuracy rate of 93.8% in tasks involving construction machinery.
The implementation of access control and privilege strategies for control and manipulation tasks in CPS-based manufacturing approaches presents a significant challenge. Inadequate access control can result in malicious activities, including the alteration of system behavior, the theft of sensitive data, or the infliction of physical damage. It also preserves the integrity of the cyber (software and data) and physical (hardware and machinery) components of the system. CPS regulatory standards vary from one industry to another, but several frameworks and guidelines have been established to ensure security and interoperability [
32,
33]. The system approach proposed by Mthetwa [
34] is based on blockchain technology for credential management and access control in HRI systems.
The use of Lightweight Directory Access Protocol (LDAP)-based systems has proven to be an effective solution for managing user identities and controlling access to devices connected to CPSs [
35]. In a particular use case, the Mycros system serves for LDAP-based enterprise identity management, not only centralizing user administration but also improving security by applying an access hierarchy [
36]. There are also alternatives for encapsulating services and applications, which are based on containers (e.g., Docker and Kubernetes), which allow decentralized management of the connected devices in the HRI. They optimize both resource allocation and latency in data transmission between the devices and the control system [
37].
Although container-based HRIs are flexible, they can create a security breach if not managed properly. A solution based on network isolation mechanisms implemented as well as network hardening using a container firewall and using the communication bridge via
Open vSwitch is proposed in [
38]. On the other hand, the centralization of control through systems entirely based on the encapsulation of services and applications within a virtual system provides flexibility and security in resource management. These platforms optimize resource allocation and data transmission latency between devices and the control system, enabling decentralized management of connected devices [
37].
In the context of the increasing prevalence of CPSs and accelerated HRI development, it is of critical importance to ensure the security and privacy of interactions between physical and digital components. An additional layer of security is provided when the CPS incorporates biometric recognition systems [
39].
Biometrics has emerged as a crucial tool for authentication, utilizing distinctive features such as fingerprints and facial and iris patterns [
40,
41]. These technologies facilitate the implementation of robust control systems to prevent unauthorized access and reinforce security through the authentication of unique and non-transferable characteristics [
42,
43]. Furthermore, developments in artificial intelligence have enhanced the precision and versatility of these biometric systems, leading to novel prospects in automation.
3. Materials and Methods
The proposed methodology delineates the development of an accurate HRI platform, integrating multiple devices within a PCB manufacturing cell. Furthermore, it introduces a comprehensive strategy for managing users, including their permissions, control privileges, and the implementation of biometric security protocols. This user management framework facilitates efficient request handling on the multi-user platform while enabling intelligent manipulation of system components.
Figure 1 illustrates the proposed methodology for developing an HRI platform, which comprises five levels. These levels are as follows: (1) user, (2) data acquisition and processing, (3) decentralized multimodal control server (DMCS), (4) MQTT broker, and (5) cell components.
At the user level, instructions are provided for manipulating the cell components. These instructions can be given in different modes, including speech or hand-based gestures. Furthermore, these instructions can be accessed by multiple devices, even simultaneously.
At the next level, the acquisition of user instructions is conducted via the implementation of input devices, such as standard microphones, LMC devices, and TS2 devices (for technical specifications, refer to
Table 1). The LMC device serves as an interface that enables users to control components through gestures and hand/finger movements in the air, obviating the necessity for physical contact with the device. The TS2 is a device placed on the hand as a set of rings connected to the fingers. Its purpose is to facilitate interaction with electronic devices through gestures. The data obtained from both devices are subjected to processing in order to facilitate the recognition of gestures through the utilization of deep learning (DL) techniques.
The LMC and TS2 devices offer several advantages in the context of HRI platforms, including real-time responsiveness and touchless interaction. These features enhance the usability and functionality, which in turn facilitate a more natural and intuitive HRI. Both devices provide high-resolution tracking of hand movements with sub-millimeter accuracy, which is essential for applications requiring precise inputs, such as robotic manipulation or teleoperation. Furthermore, they have low latency and fast processing, ensuring user-friendly interactions. It is also important to mention that software development kits (SDKs) are available for both devices, allowing for customized interactions.
The data obtained from the input devices are then transferred to processing units, from now on referred to as data transcoders (DTs). The LMC and TS2 data are processed using an Intel NUC 11. In contrast, a Jetson AGX Orin was employed to process the data obtained from the audio devices.
Table 2 presents the technical specifications of the two DTs. The DMCS also requires specific user data for identification purposes, such as biometric data, sensor index, and interaction name. Furthermore, specialized software was implemented in the DTs for device communication, data processing (e.g., hand gesture recognition/interpretation and natural language processing), and the construction of the DMCS.
Table 3 shows the software specifications.
DTs were selected on the basis of their high computational performance and their efficacy in processing large amounts of data in real time, due to their edge AI capabilities and their capacity to optimize latency. It is also noteworthy that these devices consume considerably less energy than traditional architectural solutions, which is an additional advantage.
The next level, referred to as decentralized multimodal control, oversees all interactions between users and components within the manufacturing cell. The manipulation requests from the speech and gesture recognition systems are then processed. The requests are then filtered according to the level of permissions and privileges that the users have been granted. Furthermore, requests are addressed to prevent complications during the execution of tasks. Conversely, a container network was constructed using TCP sockets (through port 5000) for communication between the DTs and the containerized server, thereby enabling these devices to establish a connection with the Docker server. The server platform is based on Node.js technology, which serves as the central processing point with a relational database. Asynchronous queries are employed to facilitate efficient access for real-time verification of user permissions. All interactions, requests, and actions are recorded in an event log.
As HRI systems frequently comprise a multitude of components (e.g., vision processing, NLP, and control logic), the Docker platform enables the execution of these as isolated microservices, thereby enhancing maintainability and scalability. Moreover, it eliminates issues that arise from discrepancies in the environment, such as differences in dependency versions. Furthermore, it supports continuous integration and deployment pipelines, which facilitate seamless integration of updates, testing of functionality, and deployment of improvements to HRI systems. In contrast, Node.js is particularly adept at managing asynchronous, real-time communication. This is highly advantageous for HRI systems, where robots frequently require the processing of inputs such as sensor data, commands, or feedback in real time. Additionally, their capacity to handle concurrent tasks with a non-blocking I/O model enables the development of scalable and responsive applications, even when managing multiple robotic systems.
At the following level, the MQTT broker enables bidirectional communication via messages between the components of the manufacturing cell and the DMCS. The broker is responsible for distributing the orders received from the server to the corresponding components through a tree of predefined topics. Furthermore, this level has the capacity to manage several topic changes simultaneously, allowing multiple cell components to receive and respond to commands in parallel, which improves system synchronization and efficiency.
At the final level, there are the components of the PCB manufacturing cell. These components are subscribed to specific topics through the MQTT broker, which enables them to receive operational data and perform the requisite movements and commands. Upon receipt of a command, these components, which include collaborative robots, automation actuators, and specific components for PCB manufacturing, execute the corresponding action and transmit a confirmation back to the central server. The response is then processed and stored in the database, thereby updating the status of the task in progress and allowing the system to make decisions in parallel.
Metrics
Several metrics were calculated to validate the accuracy of our system as well as the fine-tuning processes.
Confusion matrices were generated to evaluate the accuracy of gesture recognition and NLP. These matrices were constructed by organizing the results obtained in the validation stage of the systems in a table. The rows represent the real classes of gesture and speech recognition and the columns the predicted classes. Each cell indicates the number of times a real class was correctly or incorrectly classified into a predicted class. The system’s overall accuracy is obtained by dividing the total number of correct predictions (the sum of the diagonal elements of the matrix) by the total number of examples evaluated.
Two common metrics were used to evaluate the fine-tuning applied to the ASR model. The Word Error Rate (WER) metric was employed for the assessment of the ASR’s performance. The accuracy of the transcriptions generated by the system was evaluated by comparing the output text with the correct reference text [
44]. The Character Error Rate (CER) metric is analogous to the WER metric, but it is calculated at the level of individual characters rather than whole words [
8].
4. Results
The following will present and discuss the results obtained in developing the multimodal human–robot interaction platform. First, the results of gestural control are presented and discussed, then the results of speech control, and finally the proposed decentralized multimodal control server are presented and discussed.
The proposed HRI platform is integrated into a PCB production cell (see
Figure 2). To facilitate the conveyance of raw materials to the various stages of PCB assembly, the cell is divided into three zones, each of which is equipped with a cobot. The sequence of the manufacturing process is illustrated by the red line. It is noteworthy that our proposal allows for the integration of multiple devices for gesture recognition. In the case of TS2s, each user who operates the device can send manipulation commands, which are then handled by the DMCS. The subsequent sections will provide further details on this process. In the case of LMCs, a procedure must be followed prior to the transmission of requests to the DMCS. The objective of establishing a network of LMCs is to expand the field of view. As each LMC is capable of detecting both hands, it is feasible to detect as many pairs of hands as there are number of connected devices. Consequently, the distance of each hand is measured in relation to each LMC, and only the data corresponding to the shortest distance for each hand are processed. This guarantees that only one pair of hands is identified.
4.1. Hand Gesture Control
Gesture control represents a fundamental component of the proposed HRI platform, as it allows for user interaction through the use of specific hand gestures, which are tracked by the LMC and/or TS2 devices. Initially, a gesture dictionary was established with the aim of standardizing gesture control, thereby ensuring that any user is able to interact with the cell components, regardless of the peripheral device being used.
Figure 3 shows the proposed gesture dictionary. This approach was devised to ensure a straightforward and efficient learning curve for all system users. The definition strategy for this dictionary was based on the assignment of a unique gesture to each component of the manufacturing cell (see
Figure 2), as well as to each action that can be performed [
14,
31]. To represent the order in which the cell components interact according to the production sequence of a PCB, gestures were assigned to each component. With regard to the actions, each one is associated with a gesture that reflects the action it triggers (e.g., move or pick). Furthermore, a gesture has been defined to enable access to and exit from the different zones of the cell. Consequently, in order to control a component, it is first necessary to access the zone in which it is located.
Table 4 shows the gesture–component relationship for each zone.
The gesture control algorithms were implemented on the DT:NUC. The Leap Motion software (4.1 version) development kit (SDK) was employed to identify the extended fingers for LMC. Furthermore, functionality was incorporated to quantify the number of active fingers in each interaction, considering both hands. Additionally, the distance between fingers was measured to detect more complex gestures such as Pick. This information was utilized to recognize the gestures. Listing 1 illustrates the algorithm for hand gesture recognition with the LMC. Alternatively, given TS2 devices’ capability for transmitting extensive amounts of data, resulting from their incorporation of multiple accelerometers and IMUs, LSTM models were configured for gesture recognition. These architectures are particularly well suited to processing continuous data inputs, as evidenced by their efficacy in the context of such applications [
17].
Listing 1. The execution of gesture recognition (TS2 and LMC), speech recognition, and facial
recognition is described in general terms, as well as the sending of instructions to the DMCS via TCP. |
![Jmmp 08 00274 i001]() |
To assess the precision of gesture recognition, a series of experimental trials were conducted with 10 end-users of the HRI platform. Each subject was instructed to perform each gesture with each hand 100 times. A dataset comprising 6000 samples (per hand) was employed. In total, 70% of the samples were allocated for training, while 30% were reserved for testing and validation. The Optuna framework was employed for optimizing the hyperparameters [
45]. The configuration is presented in
Table 5. The sequence length was one time step, with 15 features per sample, which aligns with the nature of the TS2 data. Moreover, the
Adam optimizer was employed to adaptively adjust the learning rate and the cross-entropy categorical loss function for multiclass classification. The models were trained for 50 epochs, during which the weights were adjusted and performance improved.
Figure 4 illustrates the degree of accuracy achieved for each device, (LMC and TS2). Concerning the LMC, an accuracy of 95.17% and 95.66% was obtained for the left and right hands, respectively. As illustrated in the confusion matrix in
Figure 4a (left hand), the
Open and
Close gestures exhibit reduced accuracy. These gestures involve the little finger, and it is possible that due to anatomical differences in the hands of the users, there is an impact on the recognition of the gestures. A similar phenomenon occurs for gestures
Three and
Four,
Figure 4b (right hand). In this case, because the gestures involve hiding either the thumb or the little finger, variations related to anatomical differences of users are negligible. This demonstrates the accuracy of the proposal for gesture recognition tasks.
In the case of TS2, the accuracy was found to be 97.59% and 97.56% for the left and right hands, respectively. The LSTM architecture has been demonstrated to be an accurate and stable approach to gesture recognition.
Figure 4c,d, illustrate the confusion matrices for each hand. Nevertheless, the high degree of accuracy is contingent upon the specific characteristics of the training dataset. Furthermore, due to the error propagation inherent to accelerometers and IMUs, it is plausible that new users may experience a decrease in gesture recognition capacity when utilizing TS2. To mitigate the impact of this error, a calibration algorithm was devised for new users, whereby correction factors are calculated for the data acquired with the TS2. The correction factors for each accelerometer are derived from the measurement uncertainty (type A and B) with a coverage factor of 2.57 [
46] and are stored in a JSON format. The correction factors are adjusted to the accelerometer signals prior to being introduced to the pre-trained LSTM models for gesture prediction.
The potential for misuse of gesture control devices by any individual represents a significant security risk. In this regard, a security level was defined to ensure that only authorized users are able to take control of the components of the manufacturing cell. A facial recognition system was developed with the objective of validating the identity of the user who performs the gestures. Furthermore, this system oversees the allocation of permissions and privileges, as detailed in
Section 5. The Local Binary Pattern Histogram (LBPH) algorithm was incorporated into the DT:NUC, thereby conferring upon the system the capacity to withstand fluctuations in illumination and facial appearance [
47]. Consequently, upon the recognition of a gesture, the facial recognition system calculates the pertinent biometric information, which is then transmitted via TCP along with the corresponding instruction to the DMCS.
4.2. Speech Control
Acoustic and language models operate in synchrony to drive automatic speech recognition (ASR) systems; these two intertwined pillars are the foundation upon which ASR is built. The DeepSpeech engine can perform speech recognition in real time, thereby ensuring low-latency responses and facilitating efficient operation on edge devices. This is a crucial attribute for fluid interactions between humans and robots. Furthermore, it can run on local hardware without requiring an internet connection, which ensures privacy, security, and consistent performance. Additionally, the DeepSpeech engine is robust to background noise, which is a common occurrence in real-world HRI scenarios. The Common Crawl Large Multilingual Text-to-Voice (CCLMTV) dataset was used [
48].
Figure 5 shows the audio processing flow for speech recognition. The signals are processed through convolutional layers (Conv-BN-ReLU), which are essential for capturing spatial hierarchies in the input signal, reducing internal covariate shift, and introducing non-linearities that enhance feature learning [
49]. These are followed by repeated blocks of time and channel separable convolutions (TCSConv-BN-ReLU), a configuration that reduces parameter complexity and improves the model’s capacity to capture temporal dependencies [
50]. These blocks include batch normalization and ReLU activations, employed to extract temporal features effectively. The implemented model incorporates residual connections and deep convolutions, thereby enhancing efficiency and accuracy. Subsequently, the output is subjected to a Connectionist Temporal Classification (CTC) layer, which aligns variable-length input signals with their textual outputs without requiring pre-segmented data, optimizing the transcription process [
51]. The returned text must be included within the corpus. A corpus is defined as a set of textual data comprising the phrases or commands that a system is expected to identify and assimilate. To address the aforementioned issue, a corpus oriented towards the tasks of robot and device control within a production environment was developed.
The proposed corpus has been designed with specific instructions for interfacing with the components of the manufacturing cell. It is important to mention that the language defined in the ASR is Spanish. Now, each call-to-action phrase (utterance) comprises three fundamental components: actions (as illustrated in the bottom line of
Figure 3), the specific devices or components involved, and the zone or destination where the action is to be executed (see
Figure 2).
An intent represents the intention behind an utterance issued by the user. Each intent groups elements into different slots, such as action, component, and destination. This enables the system to ascertain the intention behind the interaction and to execute the appropriate action by capturing all elements of an utterance.
Figure 6 illustrates an instance of an utterance, delineating the constituent slots. As previously stated, the language utilized is Spanish. To allow a more effective understanding, the rounded gray boxes describe the corresponding English translations of the actions, components, and destinations. The wake word “
Cellya” activates the system, “
mueve” (move) is the action slot, “Cobot” is the component slot, and “
zona de limpieza” (clean zone) is the destination slot. The associated intent, designated as
MoveUR3CleanZone, aggregates all of the aforementioned elements to facilitate the desired interaction. The corpus incorporates not only the textual phrases themselves but also the various ways in which users might express them, thus enabling the speech recognition system to adapt to different jargon. To illustrate, the corpus includes alternative forms of expression for the command “
Cellya, mueve Cobot a la zona de limpieza” (
Cellya, move Cobot to the cleaning zone). Such variations permit the system to adapt more effectively to the inherent flexibility of human language.
4.3. Fine-Tuning
A process of fine-tuning was performed on the DeepSpeech ASR model in order to adapt it to the requirements of the HRI platform. For this process, a dataset of 850 audios recorded between 10 different users, each interacting with the different components of the system, was constructed. The audios were normalized and processed at 16 kHz in mono format, to ensure optimal compatibility with the DeepSpeech model. This dataset was then divided for training, validation, and testing tasks (70/15/15%, respectively). The hyperparameters configured in the fine-tuning are shown in
Table 6.
To evaluate the accuracy of the ASR, WER and CER metrics were calculated. The WER was calculated and employed as a metric for evaluating the precision of the model in the speech-to-text task. Fifty audios and a total of 500 words were used. A total of 111 errors were calculated, comprising 40 substitutions, 35 insertions, and 36 deletions. The WER was 22.22% (), indicating that a proportion of 22.22% of the words were transcribed incorrectly. Conversely, the CER quantified discrepancies at the character level. A total of 3000 characters were examined, and 357 errors were identified. This comprised a total of 130 errors by substitution, 110 errors by insertion, and 117 errors by deletion. This resulted in a CER = 11.90% ().
The ASR model pre-trained with the CCLMTV dataset has a WER = 44.44% and a CER of 19.05%. In this sense the fine-tuning process, using our own dataset, improves the WER and CER metrics by 50% and 62.27%, respectively. In the context of text-to-speech tasks, the model achieves WER and CER values that are an acceptable upper limit: 20–25% [
52,
53].
Nevertheless, the WER and CER metrics provide insight into the capacity of the ASR model for speech-to-text translation. However, these metrics do not provide insight into the system’s ability to interpret instructions in a real-world setting. To evaluate the performance of the model, an experiment was designed whereby users interacted with different components of the manufacturing cell, issuing instructions at random from a defined corpus. A total of 10 users provided 100 valid instructions per action, in addition to 100 supplementary instructions that had a similar phonological structure but did not correspond to concrete instructions (labeled as None).
Figure 7a shows the results of the accuracy obtained by the ASR model for open/close instructions for two randomly selected components (LPKF and oven). The figure displays the confusion matrix for the LPKF component, with an accuracy of 93.07%. A trend of false positives is observed for the label class ‘None’.
Figure 7b shows the accuracy of the oven component. For this test, an accuracy of 92.47% was achieved. Finally,
Figure 7c shows the confusion matrix, which evaluates the interaction of the cobots performing different tasks such as “
mover, tomar y colocar” (move, pick, and place), achieving an overall accuracy of 92.85%. False positives in general terms may be due to phonetic similarity between valid and undefined instructions, as well as to environmental noises or variations in users’ pronunciation. Finally, the model achieved an accuracy of 100 for instructions that should not be recognized, even though they are phonetically similar.
5. Decentralized Multimodal Control Server
The objective of the DMCS is to obtain control and supervision of the interactions between users and system components. This interaction is based on the hierarchy of permissions and the priority that each user has in accordance with their access level.
Priority given to users in the DMCS enhances the efficiency and safety since it lets users interact with the M.C. according to their expertise, High-priority users are enabled to perform tasks right away, and users with lower priority have limited access granted only to specific functions, given that the user has gone through scant training, so they focus on operations with a low risk of incidents. It assigns medium priority to regular users with moderate training to deal with the management of medium-risk interactions; this keeps the workflow organized and avoids access conflicts.
Table 7 illustrates the roles, access level, and priority level associated with each user created within the system. Each user has the ability to control some component of the manufacturing cell; however, with the exception of the
admin and the
super_user, none of them has full control of all the components of the cell. Furthermore,
Table 7 illustrates the commands that are available to users. Each command is associated with a script that initiates the designated action in accordance with the system’s predefined logic. Subsequently, Listing 2 describes the logic underlying the functionality of each command.
Listing 2. DMCS pseudocode, describes a task processing system that handles user requests, validates
permissions, processes tasks based on access level, and logs results, using a queue for user validation. |
![Jmmp 08 00274 i002]() |
Figure 8 displays a sequence diagram (SD:Main) of the complete interaction within the CMC. It delineates the communication and message-sending behaviors exhibited by all elements of the system, along with the interactions between users and system components. The section entitled “Data Acquisition and Processing” outlines how the user interacts with the platform, including the use of gestures and speech commands, and describes the processes employed by the DTs to process the data. The instructions are identified and conveyed to the DMCS via a JSON, which is appended to the request queue. These JSON-like structures contain data that the DMCS requires in order to identify the user, including biometric data, sensor index, and interaction name. The DMCS then performs a search in the DBMS for permission and privilege (see the SD:DMCS sequence diagram in
Figure 9).
The interactions between the DMCS and the DBMS are performed using the TCP protocol (port 3306), and are located within a network of containers within the Docker platform.
The sequence of DMCS interactions is illustrated in
Figure 9. The process begins with the DMCS verifying the permissions and privileges that the user has within the database. In accordance with the assigned priority level, the instruction is then inserted into the process queue. In the event that the instruction is requested by a user with a high priority level, it is executed without delay. The DMCS establishes a connection with the broker via the MQTT protocol and subsequently publishes the instruction. The broker employs Transport Layer Security (TLS), which guarantees the confidentiality and integrity of communications between the client and the broker through the use of encryption. A connection is not feasible in the absence of particular user credentials, see Listing 3.
Listing 3. Pseudocode for task management using MQTT, including status verification, command
execution, and result reporting. |
![Jmmp 08 00274 i003]() |
Subsequently, the broker requests that the component execute the instruction. Upon completion of the instruction, the broker returns the result (see
Figure 9). The
resultwork is then processed in the section entitled ’Disconnection and Tracker Resume’. At this point, the DMCS disconnects from the broker. Finally, the DMCS notifies the user of the outcome of the instruction, indicating whether it was completed successfully or if there was a failure via audio (see Listing 2).
All interactions and state changes are recorded in the DBMS, thereby integrating an event log. The log serves to facilitate the monitoring and identification of errors, thereby ensuring the traceability of events through the documentation of execution irregularities. The instruction cycle persists in accordance with the process queue.
In general, the DMCS allows critical administrative tasks to be performed in a way that ensures efficiency and security. The scope of these management tasks encompasses a range of activities, including access validation, instruction queue processing, connectivity with the MQTT broker, monitoring of the DMCS, feedback to the user, and system traceability.
Table 8 provides a description of the tasks that are managed and the DMCS subsystems that are involved.
One of the principal attributes of the proposed HRI is to afford users the capacity to interact with the cell through a multiplicity of devices. It is also possible for multiple users to interact with several cell components simultaneously. The DMCS facilitates the integration of additional DTs and devices (LMC, TS2, and audio). This enables flexibility and modularity in scaling the platform.
Figure 10 illustrates a deployment diagram for this capability. For example, if a new DT is required to be integrated, the process would be only to install the specified packages within the speech and gesture control subsystems and add their respective peripheral devices (LMC, TS2, and audio). In other words, in a test case involving a cell with
N components, the system successfully supported concurrent operation by
N users, each employing a distinct input device. This functionality not only illustrates the system’s robustness but also underscores its potential to enhance productivity and adaptability in multi-operator industrial environments.
6. Conclusions
In recent years, human–robot interaction (HRI) technologies have advanced significantly, enabling more intuitive, secure, and efficient collaborations between humans and machines. The manufacturing sector, in particular, has benefited from innovations in gesture recognition, speech control, and biometric authentication, which have the potential to streamline operations, improve accuracy, and enhance safety within industrial environments. This paper examines the development and application of several HRI tools—specifically, the Leap Motion Controller (LMC), Tap Strap 2 (TS2), and the Cellya speech control system—each uniquely contributing to a multimodal, secure, and adaptable system for controlling manufacturing processes.
In conclusion, both the Leap Motion Controller (LMC) and Tap Strap 2 (TS2) have proven to be reliable for gesture recognition in human–robot interaction (HRI) settings. By employing multiple LMC devices in a daisy-chained configuration, we effectively expanded the gesture recognition field beyond a limited viewpoint, while the TS2 calibration tailored recognition accuracy to individual user anatomy and gesture variations. Moreover, integrating a facial recognition module for credential management added a security layer, verifying user identity before granting access.
The gesture recognition accuracy achieved with TS2, supported by the LSTM architecture, reached 97.59% for left hand gestures and 97.56% for right hand gestures, underscoring the model’s pattern recognition effectiveness with IMU data. LMC results were similar, with 95.17% and 95.66% accuracy for left and right hands, respectively.
The speech control system, Cellya, demonstrated effective performance in industrial settings, with fine-tuning significantly lowering the Word Error Rate (WER) from 44.44% to 22.22% and the Character Error Rate (CER) from 19.05% to 11.90%, resulting in more accurate command recognition for manufacturing cell control.
The DMCS integrates these advanced HRI and biometric components to improve operational accuracy, reduce unauthorized access, and ensure efficient task execution. This system is highly adaptable and scalable, with significant potential for enhancing productivity and security in various manufacturing environments. Future research could explore DMCS’s scalability in more complex manufacturing scenarios and its integration with advanced AI and IoT technologies, further enhancing real-time data processing, operational efficiency, and system performance in smart manufacturing contexts.
This multimodal, multi-device approach demonstrates high suitability for PCB manufacturing, especially using gesture and speech recognition.
The results obtained confirm the effectiveness of our HRI platform within smart manufacturing environments. This integrated multimodal system with the proposed platform provides accurate adaptability and faster processing to effectively execute complex manufacturing commands with high precision. Both the gesture recognition systems considered, LMC and TS2, demonstrated acceptable levels of precision to enhance user interaction with the components. Error rates decreased further following adjustments, especially in the speech recognition embedded by Cellya, which now shows improved speech processing capabilities. Task management is enhanced by the DMCS, which, based on priority-based access control, ensures that users perform tasks aligned with their expertise levels and granted security clearances.
The proposed HRI platform demonstrates significant scalability, allowing for the effortless incorporation of new input methods—such as gestures or speech commands—and their dynamic association with manufacturing cell components. This flexibility extends to enabling simultaneous multi-component control, which is particularly advantageous in complex manufacturing scenarios. For instance, a manufacturing cell with N components can be efficiently operated by N users in parallel, each utilizing a dedicated input device. These capabilities underline the platform’s potential to enhance adaptability, streamline operations, and accommodate diverse industrial requirements, positioning it as an accurate and robust solution for modern manufacturing environments.