MDPI - Publisher of Open Access Journals

28 pages, 1791 KB

Open AccessArticle

Speech Recognition-Based Wireless Control System for Mobile Robotics: Design, Implementation, and Analysis

by Sandeep Gupta, Udit Mamodiya and Ahmed J. A. Al-Gburi

Automation 2025, 6(3), 25; https://doi.org/10.3390/automation6030025 - 24 Jun 2025

Cited by 2 | Viewed by 2438

This paper describes an innovative wireless mobile robotics control system based on speech recognition, where the ESP32 microcontroller is used to control motors, facilitate Bluetooth communication, and deploy an Android application for the real-time speech recognition logic. With speech processed on the Android [...] Read more.

This paper describes an innovative wireless mobile robotics control system based on speech recognition, where the ESP32 microcontroller is used to control motors, facilitate Bluetooth communication, and deploy an Android application for the real-time speech recognition logic. With speech processed on the Android device and motor commands handled on the ESP32, the study achieves significant performance gains through distributed architectures while maintaining low latency for feedback control. In experimental tests over a range of 1–10 m, stable 110–140 ms command latencies, with low variation (±15 ms) were observed. The system’s voice and manual button modes both yield over 92% accuracy with the aid of natural language processing, resulting in training requirements being low, and displaying strong performance in high-noise environments. The novelty of this work is evident through an adaptive keyword spotting algorithm for improved recognition performance in high-noise environments and a gradual latency management system that optimizes processing parameters in the presence of noise. By providing a user-friendly, real-time speech interface, this work serves to enhance human–robot interaction when considering future assistive devices, educational platforms, and advanced automated navigation research. Full article

(This article belongs to the Section Robotics and Autonomous Systems)

► Show Figures

Figure 1

20 pages, 5632 KB

Open AccessArticle

Filtering Unintentional Hand Gestures to Enhance the Understanding of Multimodal Navigational Commands in an Intelligent Wheelchair

by Kodikarage Sahan Priyanayana, A. G. Buddhika P. Jayasekara and R. A. R. C. Gopura

Electronics 2025, 14(10), 1909; https://doi.org/10.3390/electronics14101909 - 8 May 2025

Viewed by 595

Abstract

Natural human–human communication consists of multiple modalities interacting together. When an intelligent robot or wheelchair is being developed, it is important to consider this aspect. One of the most common modality pairs in multimodal human–human communication is speech–hand gesture interaction. However, not all [...] Read more.

Natural human–human communication consists of multiple modalities interacting together. When an intelligent robot or wheelchair is being developed, it is important to consider this aspect. One of the most common modality pairs in multimodal human–human communication is speech–hand gesture interaction. However, not all the hand gestures that can be identified in this type of interaction are useful. Some hand movements can be misinterpreted as useful hand gestures or intentional hand gestures. Failing to filter out these unintentional gestures could lead to severe faulty identifications of important hand gestures. When speech–hand gesture multimodal systems are designed for disabled/elderly users, the above-mentioned issue could result in grave consequences in terms of safety. Gesture identification systems developed for speech–hand gesture systems commonly use hand features and other gesture parameters. Hence, similar gesture features could result in the misidentification of an unintentional gesture as a known gesture. Therefore, in this paper, we have proposed an intelligent system to filter out these unnecessary gestures or unintentional gestures before the gesture identification process in multimodal navigational commands. Timeline parameters such as time lag, gesture range, gesture speed, etc., are used in this filtering system. They are calculated by comparing the vocal command timeline and gesture timeline. For the filtering algorithm, a combination of the Locally Weighted Naive Bayes (LWNB) and K-Nearest Neighbor Distance Weighting (KNNDW) classifiers is proposed. The filtering system performed with an overall accuracy of 94%, sensitivity of 97%, and specificity of 90%, and it had a Cohen’s Kappa value of 88%. Full article

► Show Figures

Figure 1

21 pages, 2476 KB

Open AccessArticle

Enhancing Human–Agent Interaction via Artificial Agents That Speculate About the Future

by Casey C. Bennett, Young-Ho Bae, Jun-Hyung Yoon, Say Young Kim and Benjamin Weiss

Future Internet 2025, 17(2), 52; https://doi.org/10.3390/fi17020052 - 21 Jan 2025

Viewed by 1483

Abstract

Human communication in daily life entails not only talking about what we are currently doing or will do, but also speculating about future possibilities that may (or may not) occur, i.e., “anticipatory speech”. Such conversations are central to social cooperation and social cohesion [...] Read more.

Human communication in daily life entails not only talking about what we are currently doing or will do, but also speculating about future possibilities that may (or may not) occur, i.e., “anticipatory speech”. Such conversations are central to social cooperation and social cohesion in humans. This suggests that such capabilities may also be critical for developing improved speech systems for artificial agents, e.g., human–agent interaction (HAI) and human–robot interaction (HRI). However, to do so successfully, it is imperative that we understand how anticipatory speech may affect the behavior of human users and, subsequently, the behavior of the agent/robot. Moreover, it is possible that such effects may vary across cultures and languages. To that end, we conducted an experiment where a human and autonomous 3D virtual avatar interacted in a cooperative gameplay environment. The experiment included 40 participants, comparing different languages (20 English, 20 Korean), where the artificial agent had anticipatory speech either enabled or disabled. The results showed that anticipatory speech significantly altered the speech patterns and turn-taking behavior of both the human and the agent, but those effects varied depending on the language spoken. We discuss how the use of such novel communication forms holds potential for enhancing HAI/HRI, as well as the development of mixed reality and virtual reality interactive systems for human users. Full article

(This article belongs to the Special Issue Human-Centered Artificial Intelligence)

► Show Figures

Figure 1

17 pages, 1064 KB

Open AccessReview

Vocal Communication Between Cobots and Humans to Enhance Productivity and Safety: Review and Discussion

by Yuval Cohen, Maurizio Faccio and Shai Rozenes

Appl. Sci. 2025, 15(2), 726; https://doi.org/10.3390/app15020726 - 13 Jan 2025

Cited by 3 | Viewed by 2016

Abstract

This paper explores strategies for fostering efficient vocal communication and collaboration between human workers and collaborative robots (cobots) in assembly processes. Vocal communication enables the division of attention of the worker, as it frees their visual attention and the worker’s hands, dedicated to [...] Read more.

This paper explores strategies for fostering efficient vocal communication and collaboration between human workers and collaborative robots (cobots) in assembly processes. Vocal communication enables the division of attention of the worker, as it frees their visual attention and the worker’s hands, dedicated to the task at hand. Speech generation and speech recognition are pre-requisites for effective vocal communication. This study focuses on cobot assistive tasks, where the human is in charge of the work and performs the main tasks while the cobot assists the worker in various peripheral jobs, such as bringing tools, parts, or materials, and returning them or disposing of them, or screwing or packaging the products. A nuanced understanding is necessary for optimizing human–robot interactions and enhancing overall productivity and safety. Through a comprehensive review of the relevant literature and an illustrative example with worked scenarios, this manuscript identifies key factors influencing successful vocal communication and proposes practical strategies for implementation. Full article

(This article belongs to the Special Issue Artificial Intelligence Applications in Industry)

► Show Figures

Figure 1

25 pages, 6212 KB

Open AccessArticle

Qualitative Analysis of Responses in Estimating Older Adults Cognitive Functioning in Spontaneous Speech: Comparison of Questions Asked by AI Agents and Humans

by Toshiharu Igarashi, Katsuya Iijima, Kunio Nitta and Yu Chen

Healthcare 2024, 12(21), 2112; https://doi.org/10.3390/healthcare12212112 - 23 Oct 2024

Viewed by 1632

Abstract

Background/Objectives: Artificial Intelligence (AI) technology is gaining attention for its potential in cognitive function assessment and intervention. AI robots and agents can offer continuous dialogue with the elderly, helping to prevent social isolation and support cognitive health. Speech-based evaluation methods are promising as [...] Read more.

Background/Objectives: Artificial Intelligence (AI) technology is gaining attention for its potential in cognitive function assessment and intervention. AI robots and agents can offer continuous dialogue with the elderly, helping to prevent social isolation and support cognitive health. Speech-based evaluation methods are promising as they reduce the burden on elderly participants. AI agents could replace human questioners, offering efficient and consistent assessments. However, existing research lacks sufficient comparisons of elderly speech content when interacting with AI versus human partners, and detailed analyses of factors like cognitive function levels and dialogue partner effects on speech elements such as proper nouns and fillers. Methods: This study investigates how elderly individuals’ cognitive functions influence their communication patterns with both human and AI conversational partners. A total of 34 older people (12 men and 22 women) living in the community were selected from a silver human resource centre and day service centre in Tokyo. Cognitive function was assessed using the Mini-Mental State Examination (MMSE), and participants engaged in semi-structured daily conversations with both human and AI partners. Results: The study examined the frequency of fillers, proper nouns, and “listen back” in conversations with AI and humans. Results showed that participants used more fillers in human conversations, especially those with lower cognitive function. In contrast, proper nouns were used more in AI conversations, particularly by those with higher cognitive function. Participants also asked for explanations more often in AI conversations, especially those with lower cognitive function. These findings highlight differences in conversation patterns based on cognitive function and the conversation partner being either AI or human. Conclusions: These results suggest that there are differences in conversation patterns depending on the cognitive function of the participants and whether the conversation partner is a human or an AI. This study aims to provide new insights into the effective use of AI agents in dialogue with the elderly, contributing to the improvement of elderly welfare. Full article

► Show Figures

Figure 1

23 pages, 4654 KB

Open AccessArticle

Effective Acoustic Model-Based Beamforming Training for Static and Dynamic Hri Applications

by Alejandro Luzanto, Nicolás Bohmer, Rodrigo Mahu, Eduardo Alvarado, Richard M. Stern and Néstor Becerra Yoma

Sensors 2024, 24(20), 6644; https://doi.org/10.3390/s24206644 - 15 Oct 2024

Cited by 1 | Viewed by 2658

Abstract

Human–robot collaboration will play an important role in the fourth industrial revolution in applications related to hostile environments, mining, industry, forestry, education, natural disaster and defense. Effective collaboration requires robots to understand human intentions and tasks, which involves advanced user profiling. Voice-based communication, [...] Read more.

Human–robot collaboration will play an important role in the fourth industrial revolution in applications related to hostile environments, mining, industry, forestry, education, natural disaster and defense. Effective collaboration requires robots to understand human intentions and tasks, which involves advanced user profiling. Voice-based communication, rich in complex information, is key to this. Beamforming, a technology that enhances speech signals, can help robots extract semantic, emotional, or health-related information from speech. This paper describes the implementation of a system that provides substantially improved signal-to-noise ratio (SNR) and speech recognition accuracy to a moving robotic platform for use in human–robot interaction (HRI) applications in static and dynamic contexts. This study focuses on training deep learning-based beamformers using acoustic model-based multi-style training with measured room impulse responses (RIRs). The results show that this approach outperforms training with simulated RIRs or matched measured RIRs, especially in dynamic conditions involving robot motion. The findings suggest that training with a broad range of measured RIRs is sufficient for effective HRI in various environments, making additional data recording or augmentation unnecessary. This research demonstrates that deep learning-based beamforming can significantly improve HRI performance, particularly in challenging acoustic environments, surpassing traditional beamforming methods. Full article

(This article belongs to the Special Issue Advanced Sensors and AI Integration for Human–Robot Teaming)

► Show Figures

Figure 1

29 pages, 6331 KB

Open AccessArticle

Multimodal Affective Communication Analysis: Fusing Speech Emotion and Text Sentiment Using Machine Learning

by Diego Resende Faria, Abraham Itzhak Weinberg and Pedro Paulo Ayrosa

Appl. Sci. 2024, 14(15), 6631; https://doi.org/10.3390/app14156631 - 29 Jul 2024

Cited by 10 | Viewed by 4038

Abstract

Affective communication, encompassing verbal and non-verbal cues, is crucial for understanding human interactions. This study introduces a novel framework for enhancing emotional understanding by fusing speech emotion recognition (SER) and sentiment analysis (SA). We leverage diverse features and both classical and deep learning [...] Read more.

Affective communication, encompassing verbal and non-verbal cues, is crucial for understanding human interactions. This study introduces a novel framework for enhancing emotional understanding by fusing speech emotion recognition (SER) and sentiment analysis (SA). We leverage diverse features and both classical and deep learning models, including Gaussian naive Bayes (GNB), support vector machines (SVMs), random forests (RFs), multilayer perceptron (MLP), and a 1D convolutional neural network (1D-CNN), to accurately discern and categorize emotions in speech. We further extract text sentiment from speech-to-text conversion, analyzing it using pre-trained models like bidirectional encoder representations from transformers (BERT), generative pre-trained transformer 2 (GPT-2), and logistic regression (LR). To improve individual model performance for both SER and SA, we employ an extended dynamic Bayesian mixture model (DBMM) ensemble classifier. Our most significant contribution is the development of a novel two-layered DBMM (2L-DBMM) for multimodal fusion. This model effectively integrates speech emotion and text sentiment, enabling the classification of more nuanced, second-level emotional states. Evaluating our framework on the EmoUERJ (Portuguese) and ESD (English) datasets, the extended DBMM achieves accuracy rates of 96% and 98% for SER, 85% and 95% for SA, and 96% and 98% for combined emotion classification using the 2L-DBMM, respectively. Our findings demonstrate the superior performance of the extended DBMM for individual modalities compared to individual classifiers and the 2L-DBMM for merging different modalities, highlighting the value of ensemble methods and multimodal fusion in affective communication analysis. The results underscore the potential of our approach in enhancing emotional understanding with broad applications in fields like mental health assessment, human–robot interaction, and cross-cultural communication. Full article

(This article belongs to the Special Issue Future Human-Technology Interactions and Their Intelligent Applications)

► Show Figures

Figure 1

18 pages, 4295 KB

Open AccessArticle

Deep Learning-Based Cost-Effective and Responsive Robot for Autism Treatment

by Aditya Singh, Kislay Raj, Teerath Kumar, Swapnil Verma and Arunabha M. Roy

Drones 2023, 7(2), 81; https://doi.org/10.3390/drones7020081 - 23 Jan 2023

Cited by 110 | Viewed by 8270

Abstract

Recent studies state that, for a person with autism spectrum disorder, learning and improvement is often seen in environments where technological tools are involved. A robot is an excellent tool to be used in therapy and teaching. It can transform teaching methods, not [...] Read more.

Recent studies state that, for a person with autism spectrum disorder, learning and improvement is often seen in environments where technological tools are involved. A robot is an excellent tool to be used in therapy and teaching. It can transform teaching methods, not just in the classrooms but also in the in-house clinical practices. With the rapid advancement in deep learning techniques, robots became more capable of handling human behaviour. In this paper, we present a cost-efficient, socially designed robot called ‘Tinku’, developed to assist in teaching special needs children. ‘Tinku’ is low cost but is full of features and has the ability to produce human-like expressions. Its design is inspired by the widely accepted animated character ‘WALL-E’. Its capabilities include offline speech processing and computer vision—we used light object detection models, such as Yolo v3-tiny and single shot detector (SSD)—for obstacle avoidance, non-verbal communication, expressing emotions in an anthropomorphic way, etc. It uses an onboard deep learning technique to localize the objects in the scene and uses the information for semantic perception. We have developed several lessons for training using these features. A sample lesson about brushing is discussed to show the robot’s capabilities. Tinku is cute, and loaded with lots of features, and the management of all the processes is mind-blowing. It is developed in the supervision of clinical experts and its condition for application is taken care of. A small survey on the appearance is also discussed. More importantly, it is tested on small children for the acceptance of the technology and compatibility in terms of voice interaction. It helps autistic kids using state-of-the-art deep learning models. Autism Spectral disorders are being increasingly identified today’s world. The studies show that children are prone to interact with technology more comfortably than a with human instructor. To fulfil this demand, we presented a cost-effective solution in the form of a robot with some common lessons for the training of an autism-affected child. Full article

(This article belongs to the Topic Artificial Intelligence in Sensors)

► Show Figures

Figure 1

28 pages, 2109 KB

Open AccessReview

Deep Learning for Intelligent Human–Computer Interaction

by Zhihan Lv, Fabio Poiesi, Qi Dong, Jaime Lloret and Houbing Song

Appl. Sci. 2022, 12(22), 11457; https://doi.org/10.3390/app122211457 - 11 Nov 2022

Cited by 102 | Viewed by 24003

Abstract

In recent years, gesture recognition and speech recognition, as important input methods in Human–Computer Interaction (HCI), have been widely used in the field of virtual reality. In particular, with the rapid development of deep learning, artificial intelligence, and other computer technologies, gesture recognition [...] Read more.

In recent years, gesture recognition and speech recognition, as important input methods in Human–Computer Interaction (HCI), have been widely used in the field of virtual reality. In particular, with the rapid development of deep learning, artificial intelligence, and other computer technologies, gesture recognition and speech recognition have achieved breakthrough research progress. The search platform used in this work is mainly the Google Academic and literature database Web of Science. According to the keywords related to HCI and deep learning, such as “intelligent HCI”, “speech recognition”, “gesture recognition”, and “natural language processing”, nearly 1000 studies were selected. Then, nearly 500 studies of research methods were selected and 100 studies were finally selected as the research content of this work after five years (2019–2022) of year screening. First, the current situation of the HCI intelligent system is analyzed, the realization of gesture interaction and voice interaction in HCI is summarized, and the advantages brought by deep learning are selected for research. Then, the core concepts of gesture interaction are introduced and the progress of gesture recognition and speech recognition interaction is analyzed. Furthermore, the representative applications of gesture recognition and speech recognition interaction are described. Finally, the current HCI in the direction of natural language processing is investigated. The results show that the combination of intelligent HCI and deep learning is deeply applied in gesture recognition, speech recognition, emotion recognition, and intelligent robot direction. A wide variety of recognition methods were proposed in related research fields and verified by experiments. Compared with interactive methods without deep learning, high recognition accuracy was achieved. In Human–Machine Interfaces (HMIs) with voice support, context plays an important role in improving user interfaces. Whether it is voice search, mobile communication, or children’s speech recognition, HCI combined with deep learning can maintain better robustness. The combination of convolutional neural networks and long short-term memory networks can greatly improve the accuracy and precision of action recognition. Therefore, in the future, the application field of HCI will involve more industries and greater prospects are expected. Full article

(This article belongs to the Special Issue Virtual Reality, Digital Twins and Metaverse)

► Show Figures

Figure 1

23 pages, 8614 KB

Open AccessArticle

Multimodal Interface for Human–Robot Collaboration

by Samu Rautiainen, Matteo Pantano, Konstantinos Traganos, Seyedamir Ahmadi, José Saenz, Wael M. Mohammed and Jose L. Martinez Lastra

Machines 2022, 10(10), 957; https://doi.org/10.3390/machines10100957 - 20 Oct 2022

Cited by 14 | Viewed by 5135

Abstract

Human–robot collaboration (HRC) is one of the key aspects of Industry 4.0 (I4.0) and requires intuitive modalities for humans to communicate seamlessly with robots, such as speech, touch, or bodily gestures. However, utilizing these modalities is usually not enough to ensure a good [...] Read more.

Human–robot collaboration (HRC) is one of the key aspects of Industry 4.0 (I4.0) and requires intuitive modalities for humans to communicate seamlessly with robots, such as speech, touch, or bodily gestures. However, utilizing these modalities is usually not enough to ensure a good user experience and a consideration of the human factors. Therefore, this paper presents a software component, Multi-Modal Offline and Online Programming (M2O2P), which considers such characteristics and establishes a communication channel with a robot with predefined yet configurable hand gestures. The solution was evaluated within a smart factory use case in the Smart Human Oriented Platform for Connected Factories (SHOP4CF) EU project. The evaluation focused on the effects of the gesture personalization on the perceived workload of the users using NASA-TLX and the usability of the component. The results of the study showed that the personalization of the gestures reduced the physical and mental workload and was preferred by the participants, while overall the workload of the tasks did not significantly differ. Furthermore, the high system usability scale (SUS) score of the application, with a mean of 79.25, indicates the overall usability of the component. Additionally, the gesture recognition accuracy of M2O2P was measured as 99.05%, which is similar to the results of state-of-the-art applications. Full article

(This article belongs to the Special Issue Intelligent Factory 4.0: Advanced Production and Automation Systems)

► Show Figures

Figure 1

16 pages, 875 KB

Open AccessArticle

When Robots Fail—A VR Investigation on Caregivers’ Tolerance towards Communication and Processing Failures

by Kim Klüber and Linda Onnasch

Robotics 2022, 11(5), 106; https://doi.org/10.3390/robotics11050106 - 7 Oct 2022

Cited by 3 | Viewed by 3097

Abstract

Robots are increasingly used in healthcare to support caregivers in their daily work routines. To ensure an effortless and easy interaction between caregivers and robots, communication via natural language is expected from robots. However, robotic speech bears a large potential for technical failures, [...] Read more.

Robots are increasingly used in healthcare to support caregivers in their daily work routines. To ensure an effortless and easy interaction between caregivers and robots, communication via natural language is expected from robots. However, robotic speech bears a large potential for technical failures, which includes processing and communication failures. It is therefore necessary to investigate how caregivers perceive and respond to robots with erroneous communication. We recruited thirty caregivers, who interacted in a virtual reality setting with a robot. It was investigated whether different kinds of failures are more likely to be forgiven with technical or human-like justifications. Furthermore, we determined how tolerant caregivers are with a robot constantly returning a process failure and whether this depends on the robot’s response pattern (constant vs. variable). Participants showed the same forgiveness towards the two justifications. However, females liked the human-like justification more and males liked the technical justification more. Providing justifications with any reasonable content seems sufficient to achieve positive effects. Robots with a constant response pattern were liked more, although both patterns achieved the same tolerance threshold from caregivers, which was around seven failed requests. Due to the experimental setup, the tolerance for communication failures was probably increased and should be adjusted in real-life situations. Full article

(This article belongs to the Special Issue Communication with Social Robots)

► Show Figures

Figure 1

11 pages, 4223 KB

Open AccessArticle

Expressing Robot Personality through Talking Body Language

by Unai Zabala, Igor Rodriguez, José María Martínez-Otzeta and Elena Lazkano

Appl. Sci. 2021, 11(10), 4639; https://doi.org/10.3390/app11104639 - 19 May 2021

Cited by 22 | Viewed by 5194

Abstract

Social robots must master the nuances of human communication as a mean to convey an effective message and generate trust. It is well-known that non-verbal cues are very important in human interactions, and therefore a social robot should produce a body language coherent [...] Read more.

Social robots must master the nuances of human communication as a mean to convey an effective message and generate trust. It is well-known that non-verbal cues are very important in human interactions, and therefore a social robot should produce a body language coherent with its discourse. In this work, we report on a system that endows a humanoid robot with the ability to adapt its body language according to the sentiment of its speech. A combination of talking beat gestures with emotional cues such as eye lightings, body posture of voice intonation and volume permits a rich variety of behaviors. The developed approach is not purely reactive, and it easily allows to assign a kind of personality to the robot. We present several videos with the robot in two different scenarios, and showing discrete and histrionic personalities. Full article

(This article belongs to the Special Issue Social Robotics: Theory, Methods and Applications)

► Show Figures

Figure 1

23 pages, 1829 KB

Open AccessArticle

Integration of Industrially-Oriented Human-Robot Speech Communication and Vision-Based Object Recognition

by Adam Rogowski, Krzysztof Bieliszczuk and Jerzy Rapcewicz

Sensors 2020, 20(24), 7287; https://doi.org/10.3390/s20247287 - 18 Dec 2020

Cited by 11 | Viewed by 3124

Abstract

This paper presents a novel method for integration of industrially-oriented human-robot speech communication and vision-based object recognition. Such integration is necessary to provide context for task-oriented voice commands. Context-based speech communication is easier, the commands are shorter, hence their recognition rate is higher. [...] Read more.

This paper presents a novel method for integration of industrially-oriented human-robot speech communication and vision-based object recognition. Such integration is necessary to provide context for task-oriented voice commands. Context-based speech communication is easier, the commands are shorter, hence their recognition rate is higher. In recent years, significant research was devoted to integration of speech and gesture recognition. However, little attention was paid to vision-based identification of objects in industrial environment (like workpieces or tools) represented by general terms used in voice commands. There are no reports on any methods facilitating the abovementioned integration. Image and speech recognition systems usually operate on different data structures, describing reality on different levels of abstraction, hence development of context-based voice control systems is a laborious and time-consuming task. The aim of our research was to solve this problem. The core of our method is extension of Voice Command Description (VCD) format describing syntax and semantics of task-oriented commands, as well as its integration with Flexible Editable Contour Templates (FECT) used for classification of contours derived from image recognition systems. To the best of our knowledge, it is the first solution that facilitates development of customized vision-based voice control applications for industrial robots. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

26 pages, 17277 KB

Open AccessArticle

Fusing Hand Postures and Speech Recognition for Tasks Performed by an Integrated Leg–Arm Hexapod Robot

by Jing Qi, Xilun Ding, Weiwei Li, Zhonghua Han and Kun Xu

Appl. Sci. 2020, 10(19), 6995; https://doi.org/10.3390/app10196995 - 7 Oct 2020

Cited by 5 | Viewed by 3259

Abstract

Hand postures and speech are convenient means of communication for humans and can be used in human–robot interaction. Based on structural and functional characteristics of our integrated leg-arm hexapod robot, to perform reconnaissance and rescue tasks in public security application, a method of [...] Read more.

Hand postures and speech are convenient means of communication for humans and can be used in human–robot interaction. Based on structural and functional characteristics of our integrated leg-arm hexapod robot, to perform reconnaissance and rescue tasks in public security application, a method of linkage of movement and manipulation of robots is proposed based on the visual and auditory channels, and a system based on hand postures and speech recognition is described. The developed system contains: a speech module, hand posture module, fusion module, mechanical structure module, control module, path planning module and a 3D SLAM (Simultaneous Localization and Mapping) module. In this system, three modes, i.e., the hand posture mode, speech mode, and a combination of the hand posture and speech modes, are used in different situations. The hand posture mode is used for reconnaissance tasks, and the speech mode is used to query the path and control the movement and manipulation of the robot. The combination of the two modes can be used to avoid ambiguity during interaction. A semantic understanding-based task slot structure is developed by using the visual and auditory channels. In addition, a method of task planning based on answer-set programming is developed, and a system of network-based data interaction is designed to control movements of the robot using Chinese instructions remotely based on a wide area network. Experiments were carried out to verify the performance of the proposed system. Full article

(This article belongs to the Special Issue Novel Approaches and Applications in Ergonomic Design)

► Show Figures

Figure 1

23 pages, 2483 KB

Open AccessArticle

Generation of Head Movements of a Robot Using Multimodal Features of Peer Participants in Group Discussion Conversation

by Hung-Hsuan Huang, Seiya Kimura, Kazuhiro Kuwabara and Toyoaki Nishida

Multimodal Technol. Interact. 2020, 4(2), 15; https://doi.org/10.3390/mti4020015 - 29 Apr 2020

Cited by 4 | Viewed by 3827

Abstract

In recent years, companies have been seeking communication skills from their employees. Increasingly more companies have adopted group discussions during their recruitment process to evaluate the applicants’ communication skills. However, the opportunity to improve communication skills in group discussions is limited because of [...] Read more.

In recent years, companies have been seeking communication skills from their employees. Increasingly more companies have adopted group discussions during their recruitment process to evaluate the applicants’ communication skills. However, the opportunity to improve communication skills in group discussions is limited because of the lack of partners. To solve this issue as a long-term goal, the aim of this study is to build an autonomous robot that can participate in group discussions, so that its users can repeatedly practice with it. This robot, therefore, has to perform humanlike behaviors with which the users can interact. In this study, the focus was on the generation of two of these behaviors regarding the head of the robot. One is directing its attention to either of the following targets: the other participants or the materials placed on the table. The second is to determine the timings of the robot’s nods. These generation models are considered in three situations: when the robot is speaking, when the robot is listening, and when no participant including the robot is speaking. The research question is: whether these behaviors can be generated end-to-end from and only from the features of peer participants. This work is based on a data corpus containing 2.5 h of the discussion sessions of 10 four-person groups. Multimodal features, including the attention of other participants, voice prosody, head movements, and speech turns extracted from the corpus, were used to train support vector machine models for the generation of the two behaviors. The performances of the generation models of attentional focus were in an F-measure range between 0.4 and 0.6. The nodding model had an accuracy of approximately 0.65. Both experiments were conducted in the setting of leave-one-subject-out cross validation. To measure the perceived naturalness of the generated behaviors, a subject experiment was conducted. In the experiment, the proposed models were compared. They were based on a data-driven method with two baselines: (1) a simple statistical model based on behavior frequency and (2) raw experimental data. The evaluation was based on the observation of video clips, in which one of the subjects was replaced by a robot performing head movements in the above-mentioned three conditions. The experimental results showed that there was no significant difference from original human behaviors in the data corpus and proved the effectiveness of the proposed models. Full article

(This article belongs to the Special Issue Multimodal Conversational Interaction and Interfaces)

► Show Figures

Figure 1

Search Results (18)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (18)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI