Robot Control Platform for Multimodal Interactions with Humans Based on ChatGPT

Qu, Jingtao; Jarosz, Mateusz; Sniezynski, Bartlomiej

doi:10.3390/app14178011

Open AccessArticle

Robot Control Platform for Multimodal Interactions with Humans Based on ChatGPT

by

Jingtao Qu

^1,†,

Mateusz Jarosz

^2,†

and

Bartlomiej Sniezynski

^2,*,†

¹

Engineering School of Digital Technologiese, EFREI, Paris-Panthéon-Assas University, 30-32 Avenue de la République, 94800 Villejuif, France

²

Institute of Computer Science, Faculty of Computer Science, AGH University of Krakow, al. A. Mickiewicza 30, 30-059 Krakow, Poland

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(17), 8011; https://doi.org/10.3390/app14178011 (registering DOI)

Submission received: 26 July 2024 / Revised: 1 September 2024 / Accepted: 5 September 2024 / Published: 7 September 2024

(This article belongs to the Special Issue Human–Artificial Intelligence (AI) Interaction: Latest Advances and Prospects)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents the architecture of a multimodal human–robot interaction control platform that leverages the advanced language capabilities of ChatGPT to facilitate more natural and engaging conversations between humans and robots. Implemented on the Pepper humanoid robot, the platform aims to enhance communication by providing a richer and more intuitive interface. The motivation behind this study is to enhance robot performance in human interaction through cutting-edge natural language processing technology, thereby improving public attitudes toward robots, fostering the development and application of robotic technology, and reducing the negative attitudes often associated with human–robot interactions. To validate the system, we conducted experiments measuring negative attitude robot scale and their robot anxiety scale scores before and after interacting with the robot. Statistical analysis of the data revealed a significant improvement in the participants’ attitudes and a notable reduction in anxiety following the interaction, indicating that the system holds promise for fostering more positive human–robot relationships.

Keywords:

human–robot interaction; ChatGPT; robot control platform

1. Introduction

With the progressive evolution of artificial intelligence (AI), our lives have witnessed unprecedented convenience and enrichment. The discourse on human–robot interaction has a rich history, dating back to the emergence of Eric, the world’s first robot in 1928 [1], and the showcasing of Elektro in the 1939 World’s Fair in New York City [2]. The exploration of robotics has continuously expanded with technological advancements.

As early as the 21st century, robots began demonstrating the capability to engage in straightforward dialogues and communication with humans. Today, robots find application across diverse domains: industrial robots collaborate with humans in executing manufacturing tasks [3]; medical robots aid in the restoration of hand and finger movement while providing elderly care and companionship [4]. Social robots provide companionship and care for the elderly, as well as aid children with autism [5]; and ubiquitous self-driving vehicles [6], unmanned aerial vehicles (UAVs), and unmanned underwater vehicles (UUVs) exemplify their versatility.

Chat generative pre-training transformer (ChatGPT), introduced in November 2022, has revolutionized natural language processing, catering to natural language dialogues and assisting with tasks, answering questions, and generating content. With over 100 million users [7] and 1.0 billion monthly visits by February the following year [8,9], ChatGPT has become a significant player in the AI landscape.

As we explore the convergence of AI and robotics, we must consider the technological evolution, societal implications, and ethical considerations [1]. The journey from early robots like Eric to the launch of ChatGPT highlights a continuum of innovation that enriches our lives and raises questions about human–robot interaction and the role of intelligent machines in shaping our future.

Despite successes, challenges persist. While contemporary bots excel at standardized questions, their database capacity limits their ability to handle undefined queries. To address this, there’s a vision for the future—the integration of ChatGPT with a robust interactive system. This initiative aims to fuse natural language processing and intelligent dialogue with robotic systems, creating more intelligent and adaptable machine partners.

The goal is to elevate human–robot interaction by leveraging ChatGPT’s potent engine for comprehending and generating natural language at a sophisticated level. This integration aims to enable robots to communicate more effectively, ensuring a more accurate understanding of user inputs and responding in a natural, emotionally resonant, and personalized manner.

As we embark on this transformative journey, it is crucial to delve into the multifaceted dimensions of the convergence between artificial intelligence, natural language processing, and robotics. The ethical considerations inherent in creating intelligent machines that interact seamlessly with humans demand careful scrutiny. Societal implications, ranging from the potential impact on employment to the redefinition of social norms, necessitate thoughtful exploration. Moreover, the evolving role of robots in our daily lives raises questions about autonomy, accountability, and the ethical frameworks that should guide their interactions with humans.

The societal impact of human–robot interaction extends beyond specific domains to shape the way we perceive technology, relationships, and our roles in an increasingly automated world. As intelligent machines become more integrated into our daily lives [10], fostering a sense of trust and understanding between humans and robots becomes paramount. Ethical considerations such as transparency in AI decision-making, the mitigation of biases, and the establishment of clear communication channels between humans and machines are critical in building this trust.

The convergence of AI, natural language processing, and robotics brings forth questions about consciousness, autonomy, and the ethical treatment of intelligent machines. As machines exhibit more sophisticated behaviors and responses, the ethical considerations extend to issues of machine rights, responsibilities, and the potential development of machine consciousness. While these scenarios may currently reside in the realm of speculative fiction, ongoing discussions and ethical frameworks are essential to navigate potential future developments responsibly.

In conclusion, the integration of artificial intelligence, natural language processing, and robotics represents a paradigm shift in our interaction with intelligent machines. The combination of ChatGPT with robotic systems demonstrates the potential for creating more intelligent, adaptable, and emotionally resonant machine partners.

The main contributions of this paper are the architecture of the robot control platform for multimodal interactions with humans based on ChatGPT, its implementation on humanoid robot Pepper, and experimental verification of the system based on analysis of anxiety and attitudes towards robots before and after interaction with the implemented system.

2. Related Research

The research at Opole University of Technology [11] has focused on enhancing Pepper’s capabilities for front-of-house applications. In this study, advanced modules, including an external video module and a speech to text recognition module, were integrated to facilitate human–computer interaction. The research demonstrated how these enhancements improved Pepper’s communication skills and enabled the implementation of an automatic reception control system, addressing challenges such as mechanical noise interference. This work highlights the significance of external speech recognition modules in enabling natural language interactions, emphasizing the need for noise interference mitigation. This study, therefore, lays a foundation for complex human–machine collaboration systems, demonstrating the potential for integrating robots into everyday tasks.

Similarly, a collaboration between Scaled Foundations and Microsoft Autonomous Systems and Robotics Research [12], introduced an innovative strategy that emerges to seamlessly integrate ChatGPT into robotics workflows. This research utilized language label embedded models (LLMs) and multimodal modeling for tasks like visual language navigation and human–robot interaction. The deployment pipeline incorporated diverse prompting technologies supported by a high-level function library, illustrating ChatGPT’s proficiency in various robotic tasks. The iterative development cycle, which involved user feedback, ensured continuous refinement and optimization for effective deployment. This approach suggests a significant leap toward integrating advanced AI tools into practical robotic applications.

Furthermore, a study from Stanford University [13] examined the use of generative agents as platforms for rehearsing interpersonal communication. These agents, which engage with users through avatar-based interactions, were found to improve communication skills and offer significant potential for applications in social science research. By interacting with users in natural language, generative agents can store and recall experiences, enabling them to dynamically plan and adapt their behavior. This dual function—both as a tool for testing social theories and as a practical aid for communication skill development—positions generative agents as valuable resources for studying and navigating complex social dynamics. Their ability to facilitate more natural dialogue in human–computer interactions also suggests broader applications for enhancing communication and understanding between humans and intelligent systems across various contexts.

In the era of Industry 4.0 [14], the integration of AI, IoT, and big data is transforming manufacturing into “smart factories.” However, the adoption of AI in these environments faces significant challenges. One of the main obstacles is access to large, diverse datasets that are necessary to effectively train AI models [15]. The complexity of industrial processes complicates data collection, leading to challenges in data volume and diversity [16]. In addition, the introduction of AI into industrial environments raises concerns about the reliability and robustness of these systems, as modern industrial control systems are inherently complex, less deterministic, and therefore, more difficult to predict and control [17]. These issues highlight the need for rigorous testing and validation to ensure the safe and effective deployment of AI technologies in industrial environments.

ChatGPT can be used on many different levels and different ways in robotics and human–robot interaction. It can be used as natural language processing tool with powerful question and answer (Q&A) capabilities available to end users as an interface, similar to our approach. On the other hand, in [18], Vemprala et al. propose using ChatGPT as a tool for programmers to streamline robot programming and, at the end goal, support non-technical users to program robots. They propose creating a high level library with simple self explaining function names and also explaining them to an LLM model if needed. ChatGPT will use them to control the robot. The authors also suggest a few good ideas surrounding how to talk with ChatGPT and what data to provide, e.g., weight of moving object. They also explain what to do when the generated code is not perfect. Such an approach was tested on several tasks, like controlling the drone with high level loops with an operator, moving colored blocks to form the Microsoft logo, or navigating the robot in a new environment.

Long-term automatic task planning with AI models can be challenging. In [19], the authors proposed using ChatGPT to perform conversion from a high level task description given to a robot by a user in a natural language to a low-level executable piece of code. User command can be classified into two categories: high level task (e.g., “heat me up a diner”) and low level subtask. Low level tasks are created from a base library with reprogrammed functions and a dynamic movement primitive (DMP) library with motion sequences constructed from user input. The proposed framework was tested in simple tasks, such as stacking items, and more complex ones, like roasting an apple, which will be decomposed into ”open the oven”, ”put the apple into the oven”, ”close the oven”, ”power on the oven” and executed in this order. Simple tasks achieved a high execution success rate of more than 80%, while more complex ones—only 56.5%. The results showed a high execution success rate for simple tasks, while more complex tasks achieved a lower success rate. This approach suggests that with further refinement, such systems could significantly enhance the capabilities of autonomous robots. This approach is very interesting and, in the future, can be a great extension of our work.

The integration of large language models (LLMs) has revolutionized robotics, enabling robots to communicate, understand, and reason with human-like proficiency. The recent research has explored the multifaceted impact of LLMs on robotics [20], categorizing and analyzing their applications within core robotics elements—communication, perception, planning, and control. Focusing on LLMs developed post-GPT-3.5, primarily in text-based modalities and multimodal approaches, this research provides actionable insights and guidelines for integrating LLMs into robotic systems. Through tutorial-level examples and structured prompt engineering, the study offers practical guidance for harnessing LLMs in robotics development, serving as a roadmap for researchers navigating this evolving landscape.

One of the significant challenges in human–robot interaction (HRI) is measuring the quality of interactions between humans and robots. Trust in social robots can predict how wiling users are to use robots in certain situations, e.g., in healthcare. Various approaches to this problem have been explored, as discussed in [21,22]. Key among these are the negative attitudes towards robots scale (NARS) [23] and the robot anxiety scale (RAS) [24]. The NARS is widely used to assess the general attitudes towards robots, capturing whether they are perceived positively or negatively. This scale has been validated and translated into multiple languages, including Polish [25], making it a robust tool for cross-cultural studies. The RAS, on the other hand, specifically measures the anxiety individuals may feel towards robots, with its application extending to various contexts, such as healthcare [24,26].

Both scales have demonstrated utility in predicting how user attitudes and anxiety affect their interactions with robots. For instance, studies have shown that high levels of robot anxiety and negative attitudes can significantly impact the effectiveness of HRI [24,26]. These scales have been employed in numerous studies to assess interaction quality, with findings consistently highlighting their value in understanding user reactions and improving robot design [21,22].

This review illustrates the breadth of the existing research in human–robot interaction, emphasizing the need for continued exploration of multimodal interaction and natural language processing theories. We utilized a combination of the NARS and RAS questionnaires administered before and after interactions with the robot. This approach allows us to gauge the influence of the interaction on participants’ attitudes and anxiety towards the robot and to assess the overall quality of the interaction. By building on these theoretical foundations, our study seeks to contribute to the development of more effective and natural human–robot communication systems.

3. Software Architecture

The primary objective of this research is to craft a robot control architecture based on a sophisticated conversational interface, fostering a smooth and natural interaction with humans. It will be an extension of our multi-platform intelligent system for multimodal human–computer interaction proposed in [27], allowing multimodal data inputs and parallel data processing with the possibility to offload computation to external services (e.g., perception data extraction from raw data for emotion recognition or gaze direction detection). Our solution also provides flexible scenario definition and a simple decision module, with more advances on the way in our next publication.

3.1. Multimodal Human–Computer Interaction Framework

In Figure 1, we can see the high-level structure of the multi-platform intelligent system for multimodal human–computer interaction framework. The framework, among others, consists of the following modules: Main, MQTT Broker (using Message Queuing Telemetry Transport), Pepper, Avatar, Microservices close and Microservices in cloud. Avatar module is an Android application that can be deployed on most Android devices. It displays a virtual assistant that behaves similarly to a robot and may be used for testing or performing experiments in a virtual environment. Microservices close are services fully developed by us that can be deployed anywhere. They consist of perception modules: Emotion, Gaze, Question detection and utility services for Discovery and application programming interface (API) gateway Edge server for ease of use and development of the framework. Speech recognition can be offloaded to an external service via the Speech recognition module or executed locally in Pepper or Avatar modules. The ChatGPT module represents connection with the ChatGPT large language model. In future, we are considering using local LLMs like large language model meta AI (Llama). The response from ChatGPT is presented to users using two communication channels: audio (speech) and visual, using Pepper’s tablet.

Behavior Planner is a part of the Main module. It is responsible for executing scripted scenarios and choosing appropriate paths in the script using the Decision module according to the data from perception components. Other functions of the Main module are communication and coordination of other modules and preparing the robot or avatar for scenario execution. The perception components are responsible for extracting higher level data about the user interaction with the robot based on the raw data from the robot sensors or (in future) external ones. It consists of components like face tracking, gaze direction estimation, question detection, and speech recognition. MQTT Broker is responsible for communication between Pepper, Avatar modules, and the Main module; some communication also happens using the native pepper API. The Pepper module consists of the pepper native application to handle robot movements and data sources (cameras, microphones, etc.) and the Android application working on Pepper’s tablet that allows us to use Android services like TextToSpeech and speech to text for speech processing and also for displaying conversation or other important messages to the user during interaction. Reverse data flow with collected data from tablet sensors is also possible. See [27] for more details. Our framework allows for offloading heavy computational procedures from the Main module executed on a robot platform to a local or global cloud depending on the available resources. A great example of such approach is the perception modules that work as Close microservices but have fallback implementation in the Main module. The framework is also capable of execution in many operating systems and can handle different types of robots or virtual avatars seamlessly. By utilizing robot abstraction, adding new robot types to our system is relatively simple. It requires implementing the robot interface using a robot native solution and creating a minimal configuration.

An essential element of the user experience is conversation with the robot and articulation of responses through the robot, epitomizing the convergence of state-of-the-art technology and tangible human–robot interaction. The application architecture delineates distinct modules for speech input, text transfer, ChatGPT interaction, result processing, and speech output, each contributing to the overall effectiveness of the conversational paradigm. Stringent security measures are implemented to safeguard the integrity and confidentiality of user interactions, while meticulous error-handling strategies preempt unforeseen situations.

The procedural delineation unfolds across four discernible components: the human user, the Pepper robot, the specialized application, and the ChatGPT interface, see Figure 1.

The solution will be implemented and tested on the Pepper robot. Therefore, achieving this involves intricate integration with the ChatGPT API using Kotlin on Android Studio. The application’s foundational structure revolves around a robust voice input mechanism, meticulously designed to capture user queries with unparalleled precision. This mechanism seamlessly converts spoken input into text, ensuring a fluid and responsive experience. The recognized text is then seamlessly transmitted to the ChatGPT API, initiating dynamic interactions capable of comprehending complex natural language and generating coherent contextual responses. These insights seamlessly circulate back into the project, enhancing overall communication.

Concurrently, the MQTT queue is employed to invoke Naoqi’s API, enabling control over the Pepper robot to execute specified instructions, and allowing additional ways of control for the person conducting the experiment such as starting, stopping the experiment, and invoking user speech input in case of error. Such functionality is provided by the MQTT Broker module.

This structural stratification is orchestrated to synergistically harmonize the constituent elements, culminating in the establishment of a comprehensive framework for human–computer interaction. Within this framework, the complex integration of the Pepper bot and the ChatGPT system is achieved seamlessly. The multifaceted collaboration among these elements affords a sophisticated and dynamic environment, fostering a fluid exchange between the user, the robotic interface, and the advanced conversational capabilities facilitated by the ChatGPT interface. Upon initiation of the application installed on the robot, the user engages in spoken dialogue with the robot by posing inquiries. The application’s speech-to-text system captures and transcribes the user’s question, presenting it on the interface for visibility. Subsequently, the transcribed query is checked against a list of known instructions that robot can perform if the instruction is recognized e.g., “Please play me a music” or “can you dance”. The robot executes such actions and then continues the conversation. The instruction list contains predefined robot actions. The action execution can be short, like play the saxophone, which takes about 15 s, or long and complex, e.g., magic trick, in which the robot performs a simple perception trick for the user that takes about 300 s. If the instruction is not recognized, the query is dispatched to ChatGPT via the API. In the event of the API functioning optimally, the response to the user’s question is acquired. The obtained results are then showcased on the application’s interface and concurrently broadcasted audibly through voice output. This iterative process persists until such time as the user opts to cease further inquiries, thereby concluding the ongoing conversation. Interaction flow can be observed in Figure 2.

3.2. Pepper Robot Description

In the rapidly evolving landscape of humanoid robotics, robot Pepper emerges as a pioneering force, seamlessly blending state-of-the-art technology with a thoughtful design ethos. This section delves into the intricate details of Pepper’s introduction, design principles, safety considerations, affordability strategies, interactive capabilities, and autonomy features, providing a comprehensive view of its multifaceted nature.

Robot Pepper, initially conceptualized to address B2B (business-to-business) needs, later ventured into the B2C (business-to-consumer) market [11]. This humanoid robot boasts a diverse range of capabilities, including the ability to exhibit body language, sense and interact with its surroundings, and navigate through space. What distinguishes Pepper is its proficiency in analyzing people’s expressions and tone of voice, achieved through advancements in speech and emotion recognition, supported by proprietary algorithms that form the foundation for meaningful human interactions.

The core design principles of robot Pepper draw inspiration from the wealth of experience and insights gathered from previous robotic ventures. These principles encompass a dedication to a visually appealing appearance, safety, affordability, interactivity, and robust autonomy. The intricacies of appearance characteristics, such as size, shape, and sound, are meticulously considered, creating a robot that not only performs efficiently but also resonates aesthetically. Informed by user feedback on predecessors like Nao, Pepper’s design deliberately avoids an overly human-like appearance to steer clear of the eerie “uncanny valley” [28]. Infused with Japanese influences, Pepper’s design incorporates large, comic-like eyes and hip joints that allow for respectful bows, establishing a gender-neutral and non-stereotypical visual identity. Even the voice, intentionally crafted to be childlike and androgynous, eliminates potential stereotypes and fosters a universally appealing presence [12].

In our work, we use the Pepper robot to run our framework and perform preplanned simple movements and more complex scenarios.

3.3. Implementation

The Android application running on Pepper’s tablet is written in Kotlin and meticulously structured into three distinct components: speech recognition, voice broadcast, and communication with the main module that handles ChatGPT and robot communication. The underlying design philosophy of the user interface revolves around transcending age and regional barriers, ensuring universal accessibility. The UI design is thoughtfully crafted to be simplistic and user-friendly, ensuring that users from diverse backgrounds can seamlessly navigate and utilize the technology. For every user message, there is a two-step process to make the conversation with logic: first, the message is added to the conversation history, and then, an API call is initiated. Upon receiving responses from the API, they are seamlessly integrated into the conversation history and displayed in the chat (see Figure 3).

The intricacies of TextToSpeech posed unique challenges. The default behavior of broadcasting in the system’s default language led to nuanced issues. In scenarios where the default language was English, but the input text was in French, the resulting voice announcement adhered to English grammar, creating a discordant experience. This challenge extended to various language combinations, exemplified by the awkward intonation and pronunciation mismatches observed when the input text was in Chinese. The integration of the Google ML Kit, a machine learning algorithm, emerged as a transformative solution. This advanced algorithmic framework became the cornerstone of language recognition within the application. The shift from manual keyword identification to an automated, machine-learning-driven approach not only enhanced accuracy but also paved the way for accommodating a broader spectrum of languages. By extracting the recognized language as a parameter, the TextToSpeech function was empowered to deliver voice playback that was not only linguistically accurate but also imbued with the richness and nuance of diverse languages. The decision to employ the Google speech-to-text APK for speech-to-text conversion underscored a commitment to versatility, enabling the application to recognize and convert an array of languages—English, Spanish, Polish, Chinese, and more. Unfortunately, the robot uses an older version of android that cannot handle automatic language changes for the speech-to-text operation as such language is set at the beginning of the conversation for system language (or any installed language set by the experimenter). The application was tested on a newer device that can handle language change and works correctly. However, if a user manually input a message in any of the installed languages on the device, the application language would change and said language would be used further into conversation.

In Algorithm 1, we can see how the Android application main loop is working. The application works in an event-driven manner as such a loop is indirect and works by a series of callbacks and events. The loop starts when the

s t a r t I n t e r a c t i o n

function is called. It can be called by the

m q t t C l i e n t

in

h a n d l e I n c o m i n g M e s s a g e

callback when a message comes from the framework or by a user pressing a button visible on the tablet screen. Then, noise suppression is initiated. It helps in speech-to-text functionality. Next, a speech input prompt is created using standard Android intent

R e c o g n i z e r I n t e n t . A C T I O N_R E C O G N I Z E_S P E E C H

. When the speech recognition results are ready,

o n S p e e c h R e s u l t

is called by the event, then the recognized message is handled by

h a n d l e I n p u t

. If it is a known command, e.g., “Thank you, goodbye”, a message is sent to mqttBroker and the appropriate action is executed by the framework. In this case, the conversation would end. Otherwise, the ChatGPT API is called in

c a l l C h a t G P T A P I

and the response is handled when ready in

c a l l b a c k G P T

. If we receive a successful response, the application displays it on the screen as a conversation and, using TextToSpeech, says it. Then, it waits three seconds and starts the next cycle of the loop by calling

p r o m p t S p e e c h I n p u t ()

.

During application start,

s t a r t M q t t C l i e n t

is called. It connects the android application with the main framework via mqttBroker. mqttClient maintains the connection and handles incoming and outgoing messages. Incoming ones are handled by another callback function,

h a n d l e I n c o m i n g M e s s a g e

. It recognizes the command and invokes the appropriate functions. For example, on command “start interaction”, received on command MQTT topic, a

s t a r t I n t e r a c t i o n

function is called. It works similarly in other cases.

Assertions and error cases were not included in Algorithm 1 for simplicity. They work by logging debug messages and, if the user is expecting some reaction from the robot also, showing and saying to the user the appropriate simplified message. For example, if the user wants the robot to play saxophone and, for some reason, the framework cannot perform this action, the message “Sorry I can’t play saxophone at this moment” would be presented to the user.

During implementation, we encountered a few interesting challenges. The three most important ones were: noise, latency, and user privacy. Environmental noise can make the speech recognition process difficult. To mitigate such a risk, we used the noise suppressor available in the Android library. Potential delays in processing voice input pose a real risk to the seamless flow of conversations. By using coroutines for asynchronous processing, where it was possible, and minimizing the number of calls in the conversation loop, we minimize latency time. Next, the most time-consuming action is a call to the ChatGPT API. The delay can be lowered by using local LLMs, which we are planning to do in our next study. However, it requires a powerful GPU installed locally. User privacy security is an important challenge; we handle it by storing user data in a minimal number of places, only for duration of conversation and transmitting it via a secured channel like the hypertext transfer protocol secure (HTTPS) connection. For detailed information about framework implementation, please refer to our previous article [27].

Algorithm 1 Android application main algorithm.

procedure startInteraction
$i n i t N o i s e S u p p r e s s i o n ()$
$p r o m p t S p e e c h I n p u t ()$
procedure onSpeechResult( $r q u e s t C o d e, r e s u l t C o d e, i n t e n t$ )
if $r e s u l t C o d e$ ==RESULT_OK then
$i n p u t \leftarrow i n t e n t . g e t R e s u l t ()$
$h a n d l e I n p u t (i n p u t)$
procedure handleInput( $i n p u t$ )
$d a t a \leftarrow i n p u t . t r i m {i t < =$ ’ ’ $} . t o L o w e r C a s e ()$
if $c o m m a n d = c h e c k I f K n o w n C o m m a n d (d a t a)$ then
$m q t t C l i e n t \leftarrow g e t M q t t C l i e n t ()$
$m q t t C l i e n t . s e n d M e s s a g e T o F r a m e w o r k (c o m m a n d)$
$s h o w T e x t T o U s e r (“ l e t m e t r y t o p e r f o r m ” + c o m m a n d)$
else
$l a n g u a g e \leftarrow d e t e c t L a n g u a g e$
$s e t L a n g u a g e (l a n g u a g e)$
$c a l l C h a t G P T A P I (d a t a)$
procedure callChatGPTAPI( $d a t a$ )
$h e a d e r \leftarrow p r e p a r e C h a t G P T H e a d e r ()$
$b o d y \leftarrow p r e p a r e C h a t G P T B o d y (d a t a)$
$c l i e n t \leftarrow O k H t t p C l i e n t . B u i l d e r () . c o n n e c t T i m e o u t (30, T i m e U n i t . S E C O N D S) . b u i l d ()$
$r e q u e s t = R e q u e s t . B u i l d e r () . u r l (A P I U R L) . h e a d e r (h e a d e r) . p o s t (b o d y) . b u i l d ()$
$c l i e n t . n e w C a l l (r e q u e s t) . e n q u e u e (c a l l b a c k G P T)$
procedure callbackGPT( $r e s p o n s e$ )
if response.isSucessful() then
$c l e a n R e s u l t \leftarrow g e t C l e a n R E s u l t (r e s p o n s e . b o d y)$
$s h o w T e x t T o U s e r (c l e a n R e s u l t)$
procedure showTextToUser( $m e s s a g e$ )
$t t s \leftarrow g e t T e x t T o S p e e c h ()$
$m e s s a g e L i s t \leftarrow g e t M e s s a g e L i s t ()$
$m e s s a g e L i s t . a d d (m e s s a g e)$
$s h o w M e s s a g e L i s t (m e s s a g e L i s t)$
$t t s . s p e a k (m e s s a g e)$
$d e l a y (3000)$
$p r o m p t S p e e c h I n p u t ()$
procedure startMqttClient
$m q t t C l i e n t \leftarrow M q t t H e l p e r (c o n t e x t)$
$m q t t C l i e n t . s e t C a l l b a c k (h a n d l e I n c o m i n g M e s s a g e)$
$m q t t C l i e n t . c o n n e c t (B R O K E R_U R L)$

4. Experiments

To verify the influence of the developed robot control platform on users, an experiment with users was designed. A group of users interacted with the Pepper robot. Their negative attitudes towards robots scale [23] and robot anxiety scale [24] scores were measured before and after interaction using appropriate questionnaires. Said questionnaires were translated to Polish to accommodate users who do not know English or do not feel comfortable using it. In both cases we made our own translation, which, in the case of NARS, is almost identical to [25].

In Figure 2, the interaction flow used in the experiments is presented. Initially, an experimenter welcomed the participant and seated them at the counter where the participant filled out the before questionnaires and was informed about the experiment procedure, following which, they were taken to a robot. Next, the participant was asked to sit in front of Pepper and wait a moment for the experiment to start. At the beginning, the purely technical part of the experiment was explained to the participant, i.e., how to handle robot interaction. After everything was prepared and the participant was ready, the experimenter left the robot and human alone, and the interaction began shortly thereafter. In the opening sequence, the Robot states its name, purpose and capabilities, i.e., what interactions are possible. Then the robot waits for user input, transforms it into text using the speech recognition module, and acts according to Figure 4 by executing recognized action on the robot using the pepper module or sending a message to ChatGPT. Next, ChatGPT answers or Pepper’s action execution ends, then the robot waits for the next input from the user. Conversation continues in this manner until the user says goodbye or it was ended by the experimenter in the right moment (when the user lost interest in the conversation or forgot how to end it by himself). At the end, the robot says the goodbye sequence, thanking the user for the conversation. The user can freely pick the topic of conversation, and as we are using ChatGPT’s vast knowledge base, we deemed it unnecessary to limit or lead the user to a certain topic of conversation. After the experiment, the participants filled out the after questionnaires and had a short conversation about the experiment and robots with the experimenter.

The tested population consisted of 20 people of both sexes with ages ranging from 18 to 60. The participants were mainly teachers and staff hired for a teachers conference, as we decided to perform an experiment alongside said conference. Unfortunately, due to the uncooperativeness of some of the participants (missing one of the questionnaires or questions) and unpredictable external interactions, we had to strike most of the participants from the final results. In the final data, we could use 10 participants.

The results, which compare the NARS before and after the interaction, are presented in Figure 5. As we can see, our system works well and positively influences user perception of robots. After the conversation with Pepper, people claimed to be more concerned about the robot being alive, and they were less paranoid talking with the robot. After conversation with the robot, users also more often predicted that robots will dominate the future. At the same time, the feelings of unease while given a job that requires cooperation with a robot decreased; as well, the perception of robots that have emotions and the possibility of making friends with them improved. The negative attitude towards the robot increased in cases of robots having a bad influence on children and users being too dependent on robots. The consideration for children might not be connected to robots but with ChatGPT capabilities and how they will influence the learning process of children. A neutral attitude towards the word “robot” suggests no bias towards robots.

Fourteen questions can be divided into three groups. As a result of summing answers in these groups, three factors are obtained: S1, representing the negative attitude toward situations of interaction with robots, S2—“Negative Attitude toward Social Influence of Robots”, and S3—“Negative Attitude toward Emotions in Interaction with Robots”. The results measured using these factors are shown in Figure 6. As we can see, the negative attitude towards interactions with robots and emotions in such interactions decreased after users had a conversation with the robot. More statistical data can be found in Table 1. The last three columns are results of one way ANOVA analysis, which confirms the statistical significance for the S1 result. The S2 and S3 results cannot be confirmed as statistically significant in our experiment. By statistically significant, we mean

p < 0.05

. The effect size, calculated as eta squared (

η^{2}

), was 0.64, indicating a large effect size [29], meaning that conversation with the robot had a significant and positive impact on users.

The results of the second survey RAS conducted before and after interaction with the robot are presented in Figure 7. As we can see, anxiety towards the robot dropped in all cases. The biggest improvement we can see is in the case of the user being afraid of talking about irrelevant things with a robot and about smoothness of conversation with a robot. The flexibility and significance of the conversation were highly rated by users. Furthermore, anxiety about understanding the robot and communication dropped. Likewise, concern significantly dropped about the robot having too much power or the speed at which robots move themselves.

Here also, three factors can be defined: S1—“Anxiety toward Communication Capability of Robots”, S2—“Anxiety toward Behavioral Characteristics of Robots”, S3—“Anxiety toward Discourse with Robots”. The results measured using these factors are presented in Figure 8. We have conducted statistical analysis for results of this survey similar to NARS. The details are presented in Table 2. According to ANOVA analysis, we can confirm that results for S1 and S2 are statistically significant (

p < 0.05

). The effect size, calculated as eta squared (

η^{2}

), was 0.51 and 0.69, indicating a large effect size. This means that the level of anxiety related to the robot’s communication capabilities and its behavioral characteristics decreased significantly after conversations with the robot. Generally, it is accepted that

p < 0.05

is enough to reject the null hypothesis, and

η^{2} > 0.14

is considered a large effect size.

5. Conclusions and Further Research

As technology continues to advance and programming continues to improve, the integration of robotics into everyday life has gone beyond the traditional domain of simple information retrieval and processing. In the future, it will expand into the realm of complex human behavior, positioning itself within the realm of human–robot interaction simulation. The quest for augmented interaction applications has led to iterative improvements in dynamic prototyping. This paradigm shift in robotics marks the departure of robots from their traditional roles and their emergence as entities capable of executing complex, task-specific instructions. No longer limited to simple queries and conversations, the users can now command robots to engage in a range of complex activities, including garbage disposal, precise plant watering, cooking tasks of varying degrees of complexity, thorough sweeping, and nuanced responsibilities such as walking the dog. As their comprehension and execution capabilities improve, robots are evolving into intelligent, multifunctional entities adept at executing a wide range of behavioral commands, thus relieving users of repetitive and menial daily tasks.

The increased capabilities of robots not only represent a major advancement in technological progress but also signal a future where the seamless integration of robots with all aspects of daily life will become the norm. The combination of techniques of artificial intelligence, immersive environments, and dynamic prototyping establishes a holistic framework that transforms robots from mere tools to interactive partners capable of recognizing and executing complex behavioral commands.

In our work, we designed the architecture of the robot control platform for multimodal interactions with humans based on ChatGPT. It was successfully implemented on humanoid robot Pepper. In our experiments, we confirmed that using large language models with a humanoid robot can decrease negative attitude towards the robot and anxiety towards conversation with robots. Thus, a more complex integration between LLMs, robots, and other cloud services is in high demand and should be an area for further study.

In future work, we would like to explore more advanced applications of LLMs like ChatGPT in robotics, especially in task planning and automatic code writing for robotic behaviors, repeating the experiment with a bigger population would be important to strengthen the results of the current study.

Author Contributions

Conceptualization, B.S.; Software, J.Q. and M.J.; Writing—original draft, J.Q. and M.J.; Writing—review & editing, M.J. and B.S.; Supervision, B.S. All authors have read and agreed to the published version of the manuscript.

Funding

The research presented in this paper received support from the funds assigned by Polish Ministry of Science and Technology to AGH University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Riskin, J. The Restless Clock: A History of the Centuries-Long Argument Over What Makes Living Things Tick; University of Chicago Press: Chicago, IL, USA, 2016. [Google Scholar] [CrossRef]
Nocks, L. The Robot: The Life Story of a Technology; Johns Hopkins University Press: Baltimore, MD, USA, 2008. [Google Scholar] [CrossRef]
Sheridan, T.B. Human–Robot Interaction: Status and Challenges. Hum. Factors 2016, 58, 525–532. [Google Scholar] [CrossRef] [PubMed]
Balasubramanian, S.; Klein, J.; Burdet, E. Robot-assisted rehabilitation of hand function. Curr. Opin. Neurol. 2010, 23, 661–670. [Google Scholar] [CrossRef] [PubMed]
Sawik, B.; Tobis, S.; Baum, E.; Suwalska, A.; Kropińska, S.; Stachnik, K.; Pérez-Bernabeu, E.; Cildoz, M.; Agustin, A.; Wieczorowska-Tobis, K. Robots for Elderly Care: Review, Multi-Criteria Optimization Model and Qualitative Case Study. Healthcare 2023, 11, 1286. [Google Scholar] [CrossRef] [PubMed]
Petillot, Y.; Antonelli, G.; Casalino, G.; Ferreira, F. Underwater Robots: From Remotely Operated Vehicles to Intervention Autonomous Underwater Vehicles. IEEE Robot. Autom. Mag. 2019, 26, 94–101. [Google Scholar] [CrossRef]
Wu, T.; He, S.; Liu, J.; Sun, S.; Liu, K.; Han, Q.L.; Tang, Y. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA J. Autom. Sin. 2023, 10, 1122–1136. [Google Scholar] [CrossRef]
SEO.AI. How Many Users Does ChatGPT Have? Statistics & Facts (2024). Available online: https://seo.ai/blog/how-many-users-does-chatgpt-have (accessed on 28 June 2024).
Exploding Topics. Number of ChatGPT Users (Jun 2024). Available online: https://explodingtopics.com/blog/chatgpt-users (accessed on 28 June 2024).
Sołtysik, M.; Gawłowska, M.; Sniezynski, B.; Gunia, A. Artificial Intelligence, Management and Trust; Routledge: London, UK, 2024. [Google Scholar] [CrossRef]
Pandey, A.K.; Gelin, R. A mass-produced sociable humanoid robot: Pepper: The first machine of its kind. IEEE Robot. Autom. Mag. 2018, 25, 40–48. [Google Scholar] [CrossRef]
Siegel, M.; Breazeal, C.; Norton, M.I. Persuasive robotics: The influence of robot gender on human behavior. In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, St Louis, MO, USA, 11–15 October 2009; pp. 2563–2568. [Google Scholar] [CrossRef]
Gardecki, A.; Podpora, M.; Beniak, R.; Klin, B. The Pepper humanoid robot in front desk application. In Proceedings of the 2018 Progress in Applied Electrical Engineering (PAEE), Koscielisko, Poland, 18–22 June 2018; pp. 1–7. [Google Scholar] [CrossRef]
Bécue, A.; Praça, I.; Gama, J. Artificial intelligence, cyber-threats and Industry 4.0: Challenges and opportunities. Artif. Intell. Rev. 2021, 54, 3849–3886. [Google Scholar] [CrossRef]
Tao, F.; Qi, Q.; Liu, A. Data-driven smart manufacturing. J. Manuf. Syst. 2018, 48, 157–169. [Google Scholar] [CrossRef]
Wang, S.; Wan, J.; Zhang, D.; Li, D.; Zhang, C. Towards smart factory for industry 4.0: A self-organized multi-agent system with big data based feedback and coordination. Comput. Netw. 2016, 101, 158–168. [Google Scholar] [CrossRef]
Al Balushi, N. A Review of the Reliability Analysis of the Complex Industrial Systems. Adv. Dyn. Syst. Appl. 2021, 16, 257–297. [Google Scholar] [CrossRef]
Vemprala, S.H.; Bonatti, R.; Bucker, A.; Kapoor, A. Chatgpt for robotics: Design principles and model abilities. IEEE Access 2024, 12, 55682–55696. [Google Scholar] [CrossRef]
Liu, H.; Zhu, Y.; Kato, K.; Tsukahara, A.; Kondo, I.; Aoyama, T.; Hasegawa, Y. Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration. IEEE Robot. Autom. Lett. 2024, 9, 6904–6911. [Google Scholar] [CrossRef]
Kim, Y.; Kim, D.; Choi, J.; Park, J.; Oh, N.; Park, D. A Survey on Integration of Large Language Models with Intelligent Robots. Intell. Serv. Robot. 2024. [Google Scholar] [CrossRef]
Krägeloh, C.U.; Bharatharaj, J.; Sasthan Kutty, S.K.; Nirmala, P.R.; Huang, L. Questionnaires to measure acceptability of social robots: A critical review. Robotics 2019, 8, 88. [Google Scholar] [CrossRef]
Naneva, S.; Sarda Gou, M.; Webb, T.L.; Prescott, T.J. A systematic review of attitudes, anxiety, acceptance, and trust towards social robots. Int. J. Soc. Robot. 2020, 12, 1179–1201. [Google Scholar] [CrossRef]
Nomura, T.; Kanda, T.; Suzuki, T.; Kato, K. Psychology in human-robot communication: An attempt through investigation of negative attitudes and anxiety toward robots. In Proceedings of the RO-MAN 2004, 13th IEEE International Workshop on Robot and Human Interactive Communication (IEEE Catalog No. 04TH8759), Kurashiki, Japan, 20–22 September 2004; pp. 35–40. [Google Scholar] [CrossRef]
Nomura, T.; Suzuki, T.; Kanda, T.; Kato, K. Measurement of anxiety toward robots. In Proceedings of the ROMAN 2006—15th IEEE International Symposium on Robot and Human Interactive Communication, Hatfield, UK, 6–8 September 2006; pp. 372–377. [Google Scholar] [CrossRef]
Pochwatko, G.; Giger, J.C.; Różańska-Walczuk, M.; Świdrak, J.; Kukiełka, K.; Możaryn, J.; Piçarra, N. Polish version of the negative attitude toward robots scale (NARS-PL). J. Autom. Mob. Robot. Intell. Syst. 2015, 9, 65–72. [Google Scholar] [CrossRef]
Nomura, T.; Kanda, T.; Yamada, S.; Suzuki, T. Exploring influences of robot anxiety into HRI. In Proceedings of the 6th International Conference on Human-Robot Interaction, Lausanne, Switzerland, 8–11 March 2011; pp. 213–214. [Google Scholar] [CrossRef]
Jarosz, M.; Nawrocki, P.; Sniezynski, B.; Indurkhya, B. Multi-Platform Intelligent System for Multimodal Human-Computer Interaction. Comput. Inform. 2021, 40, 83–103. [Google Scholar] [CrossRef]
Mori, M.; MacDorman, K.F.; Kageki, N. The uncanny valley [from the field]. IEEE Robot. Autom. Mag. 2012, 19, 98–100. [Google Scholar] [CrossRef]
Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Routledge: London, UK, 2013. [Google Scholar] [CrossRef]

Figure 1. Robot Control Platform for Multimodal Interactions with Humans based on ChatGPT.

Figure 2. Sequence of interactions in the proposed architecture, highlighting envisioned actions.

Figure 3. Application working on Pepper robot, user view of the robot during conversation.

Figure 4. Interaction flow used in experiments.

Figure 5. Results before and after experiment with NARS survey.

Figure 6. Results before and after experiment with NARS survey, grouped into three factors: S1—negative attitude towards interaction with robots, S2—negative attitude towards social influence of robots, and S3—negative attitude toward emotions in interaction with robots.

Figure 7. Results before and after experiment with RAS survey.

Figure 8. Results before and after experiment with RAS survey, grouped into three factors: S1—anxiety towards communication capability of robots, S2—anxiety towards behavioral characteristics of robots, S3—anxiety towards discourse with robots.

Table 1. NARS survey statistical analysis results, grouped into three factors: S1—negative attitude towards interaction with robots, S2—negative attitude towards social influence of robots and S3—negative attitude towards emotions in interaction with robots.

	Before				After
	Mean	Standard Deviation	Minimum	Maximum	Mean	Standard Deviation	Minimum	Maximum	f	p	$η^{2}$
S1	16.71	5.15	9	26	12.86	5.79	7	25	10.62	0.02	0.64
S2	15	5.92	5	23	17	4.51	12	23	3	0.13	0.33
S3	8	4.58	3	15	6.43	4.08	3	15	3.08	0.13	0.34

Table 2. RAS survey statistical analysis results, grouped into three factors: S1—anxiety towards communication capability of robots, S2—anxiety towards behavioral characteristics of robots, S3—anxiety towards discourse with robots.

	Before				After
	Mean	Standard Deviation	Minimum	Maximum	Mean	Standard Deviation	Minimum	Maximum	f	p	$η^{2}$
S1	9.86	4.88	4	16	5.71	3.40	3	13	6.28	0.05	0.51
S2	13.29	6.05	6	22	10	5.29	4	19	13.56	0.01	0.69
S3	14.86	6.04	9	23	10.29	6.52	5	20	3.96	0.94	0.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qu, J.; Jarosz, M.; Sniezynski, B. Robot Control Platform for Multimodal Interactions with Humans Based on ChatGPT. Appl. Sci. 2024, 14, 8011. https://doi.org/10.3390/app14178011

AMA Style

Qu J, Jarosz M, Sniezynski B. Robot Control Platform for Multimodal Interactions with Humans Based on ChatGPT. Applied Sciences. 2024; 14(17):8011. https://doi.org/10.3390/app14178011

Chicago/Turabian Style

Qu, Jingtao, Mateusz Jarosz, and Bartlomiej Sniezynski. 2024. "Robot Control Platform for Multimodal Interactions with Humans Based on ChatGPT" Applied Sciences 14, no. 17: 8011. https://doi.org/10.3390/app14178011

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robot Control Platform for Multimodal Interactions with Humans Based on ChatGPT

Abstract

1. Introduction

2. Related Research

3. Software Architecture

3.1. Multimodal Human–Computer Interaction Framework

3.2. Pepper Robot Description

3.3. Implementation

4. Experiments

5. Conclusions and Further Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI