The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare

Pashangpour, Souren; Nejat, Goldie

doi:10.3390/robotics13080112

Open AccessPerspective

The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare

by

Souren Pashangpour

^1,*

and

Goldie Nejat

^1,2,3,*

¹

Autonomous Systems and Biomechatronics Laboratory (ASBLab), Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON M5S 3G8, Canada

²

KITE, Toronto Rehabilitation Institute, University Health Newtork (UHN), Toronto, ON M5G 2A2, Canada

³

Rotman Research Institute, Baycrest Health Sciences, North York, ON M6A 2E1, Canada

^*

Authors to whom correspondence should be addressed.

Robotics 2024, 13(8), 112; https://doi.org/10.3390/robotics13080112

Submission received: 30 May 2024 / Revised: 6 July 2024 / Accepted: 18 July 2024 / Published: 23 July 2024

(This article belongs to the Special Issue Robots and Artificial Intelligence for a Better Future of Health Care)

Download

Browse Figures

Versions Notes

Abstract

:

The potential use of large language models (LLMs) in healthcare robotics can help address the significant demand put on healthcare systems around the world with respect to an aging demographic and a shortage of healthcare professionals. Even though LLMs have already been integrated into medicine to assist both clinicians and patients, the integration of LLMs within healthcare robots has not yet been explored for clinical settings. In this perspective paper, we investigate the groundbreaking developments in robotics and LLMs to uniquely identify the needed system requirements for designing health-specific LLM-based robots in terms of multi-modal communication through human–robot interactions (HRIs), semantic reasoning, and task planning. Furthermore, we discuss the ethical issues, open challenges, and potential future research directions for this emerging innovative field.

Keywords:

large language models; healthcare robotics; multi-modal communication; semantic reasoning; task planning

1. Introduction

In healthcare, the need for new technology to maintain the quality and efficiency of care is paramount. This demand has been amplified by an increase in the overall older population of the world. Namely, by 2050, 22% of the global population is expected to be over 65 years old [1]. This demographic shift leads to a rising prevalence of chronic diseases such as dementia, diabetes, and heart disease, which require continuous monitoring and long-term management, further straining healthcare resources [2,3]. In 2022, Canada alone had 143,695 job vacancies for healthcare professionals [4]. In the U.S., it is predicted that by 2026 there will be a shortage of up to 3.2 million healthcare workers [5], highlighting the staggering workforce shortage [5]. Furthermore, the vast amounts of health data generated in this sector require efficient management and use to improve patient outcomes and reduce healthcare costs, a task well suited for generative AI and deep learning models [6,7,8]. For example, generative AI models, such as large language models (LLMs), can use the large EHR (Electronic Health Record) datasets for training to learn patterns between symptoms, diagnoses, and recommendations in order to help in healthcare with the management of data and the retrieval of information, as well as decision-making processes [9].

The need for new technologies in healthcare is multi-faceted, using the following: (1) classical and deep learning methods to facilitate medical imaging analysis [10], (2) deep neural networks (DNNs) to perform automated disease detection and prediction [11], and (3) LLMs for clinical decision-making and teleconsultation [12]. In particular, LLMs have already been integrated into medicine to assist with clinical note-taking by rewriting and summarizing clinicians’ notes for clarity and have also been leveraged through chatbots to assist patients with reminders and general questions about medications [12]. Furthermore, LLMs have also been used to personalize treatment plans for individual patients by assisting with EHR data extraction [13]. The use of deep learning and LLMs have many benefits and advancements in the delivery of healthcare.

In addition to the aforementioned software technologies, innovations in robotics have allowed their integration into healthcare, including the following: (1) surgical robots being used to perform various minimally invasive surgical procedures for brain, spine, and neck surgeries, where accuracy and reliability are paramount [14,15], (2) rehabilitation robots in the form of robotic assistive wheelchairs [16], prostheses [17], and robotic arms and exoskeletons for lower and upper limb rehab [18,19], (3) mobile medication delivery robots utilized as automated medication carts [20], and (4) humanoid robots monitoring patient vital signs [21]. The need for robots to assist in healthcare was especially evident during the COVID-19 pandemic, where the healthcare sector turned to robotics in response to public health emergencies for various tasks to minimize person-to-person contact and the spread of the virus [22]. This included autonomous disinfecting robots using UV-C irradiation for surface decontamination in hospitals [23] and service robots used to facilitate social distancing measures in hospitals and long-term care homes by conducting initial screening of COVID-19 symptoms and detecting face masks [23,24,25,26]. Social robots have also been used to provide companionship to older adults in healthcare environments (i.e., long-term homes) by engaging in conversations, such as sharing stories or telling jokes, thus creating a more personal and comforting presence by interacting with users [27,28]. In general, the integration of robots into healthcare aims to enhance patient experiences and outcomes, support skill augmentation, and improve the overall quality of care while reducing the workload of care providers [29].

Despite the use of both healthcare robots and LLMs in healthcare, the integration and deployment of these two technologies still remains unexplored. To the authors’ knowledge, there have only been three instances where robots and LLMs have been applied directly for healthcare applications: two instances involving social robots [30,31] and one instance involving a surgical robot [32]. However, this potential marriage of robotics and LLMs for healthcare presents an untapped opportunity, as the combination of a robot’s physical capabilities with the understanding and generative abilities of LLMs has the potential to provide person-centered care, as well as streamline operational workflows (e.g., logistical tasks), and reduce the workload of healthcare professionals. This integration will result in the utilization of vast healthcare data to further refine diagnostic, therapeutic, and predictive healthcare services. The potential of healthcare robots using LLMs to provide such services underlines the urgent need for research and development in this emerging area.

In this paper, we present the first investigation into the emerging field of healthcare robots using LLMs. Namely, we explore the innovative developments which aim to address the challenge of enhancing the quality and efficiency of patient care during a time where the demographic is shifting towards an older population and there is a shortage of healthcare professionals. Our objective is to identify the needed system requirements for multi-modal communication through human–robot interactions (HRIs), systematic reasoning, and task planning for designing health-specific LLM-based robotic solutions. We will also discuss the ethical issues associated with the potential development and utilization of healthcare robots leveraging LLMs and the open challenges and potential future research directions for this field.

2. Large Language Models (LLMs) for Healthcare

The versatility of generative AI is apparent in its ability to be trained on an array of data types from textual (i.e., EHRs) [33] and visual content (i.e., medical imaging data) [34] to genetic sequences [29] in order to learn and capture the underlying patterns and distributions from such data. This makes it especially valuable for healthcare tasks that require adaptability and continuous learning [30]. LLMs represent a significant advancement in generative AI. They primarily utilize the transformer architecture [35] and contain key features including multi-head attention for parallel processing, positional encodings for sequence awareness, layer normalization, and feedforward networks for data refinement [36]. The encoder–decoder structure in many LLMs facilitates complex tasks such as language translation and content generation [37]. The transformer architecture provides LLMs with the potential to be instrumental in healthcare robotic development due to the proficiency in generating human-like text [38], understanding of the lexical semantics of the physical world [36,39], and making decisions regarding appropriate robot behaviors to implement [40].

Table 1 presents an overview of five specialized LLMs designed for healthcare: (1) BiomedBERT [41], (2) MED-PaLM 2 [42], (3) DRG-LLaMA [43], (4) GPT2-BioPt [44], and (5) Clinical-T5 [45]. It also consists of seven general LLMs which have been incorporated into robotic tasks for HRIs, semantic reasoning, and planning: (1) GPT-3 [46], (2) GPT-3.5 [46], (3) GPT-4 [47], (4) T5 [48], (5) DialogLED [49], (6) PaLM [50], and (7) the Stanford Alpaca [51]. Furthermore, Table 1 includes two popular general-use LLMs: (1) BERT [52] and (2) PaLM 2 [53]. To date, the specialized LLMs have not yet been deployed in robotics for healthcare applications. However, they have the potential to be applied to tasks such as diagnostic assistance, patient data analysis, and treatment and procedure recommendations. These specialized LLMs are created by fine-tuning the model parameters of their respective base models, namely, BERT [52], PaLM [50], PaLM 2 [53], LLaMA 2 [54], GPT2 [55], and T5 [48], on healthcare datasets such as MedQA [56], MedMCQA [57], and PubMedQA [58]. It is important to note that these base models are considered foundational models, which can be either open-source or closed-source. Open-source models are those whose underlying code and training procedures are freely available for use, modification, and distribution by anyone, facilitating transparency and collaboration in model development [59]. In contrast, closed-source models are proprietary, with their code, methodologies, and data often kept confidential by the organization that developed them, limiting access and modification by external entities [60]. This distinction is crucial, as it influences the breadth of application and customization potential of each model in healthcare settings. Moreover, these base models are built from the ground up, having unique architectures, and they are trained on expertly curated datasets to give them general knowledge about the world [61]. The training data size is often reported in gigabytes, while the context length (prompt length) is reported as tokens which represent the variable numbers of characters depending on the tokenization process used. The size and the source of the training data (i.e., chat forums, scholarly journals) are what gives breadth (general knowledge) to the model, while the context length dictates the size of the input to the model and constitutes the context for what the model generates [62,63]. The general LLMs reported in Table 1 are the models that will be discussed in the following sections in this paper with respect to their current use in robotics and their potential implications for healthcare. We will present two design studies in Section 8 showcasing how these three types of LLMs can be potentially used for healthcare robotics applications.

There are several common prompting techniques that are essential for the interaction between robotic systems and LLMs which can also be extended to healthcare settings. The prompts serve as the conduit through which queries or tasks are communicated to LLMs through agents (people or robots), with the model generating responses based on the given prompt structure. Outputs from LLMs are probabilistic, leading to ‘prompt engineering’, where different prompt structures are used to bias the output of the LLM towards a desired outcome [64]. Moreover, the context window of an LLM refers to the contiguous sequence of tokens considered by the LLM in a forward pass (i.e., generation time) [65]. This window, typically limited by memory constraints, determines the amount of preceding and succeeding text data the model utilizes to maintain coherence and accuracy in tasks such as word prediction [66]. This is evident in few-shot prompting, where input–output pairs provided to the model fill its context window with consistent structures, reducing variations in the input text [67,68]. This uniformity in the context window influences the attention mechanism to focus more narrowly, enhancing the relevance of certain parts of the input to the generated output [69,70]. Therefore, the effectiveness of a prompt and, consequently, the utility of the model’s response are highly contingent upon the prompt’s design. A prompting method is critical for achieving specific objectives. Table 2 describes key prompting strategies, detailing their operational frameworks, exemplary applications we have identified within healthcare, and their respective advantages and limitations. It is worth noting that prompts can be multi-modal, including text, images, or audio. In order for LLMs to respond to a variety of informational inputs [71], the non-textual components are converted into textual descriptions through preprocessing models such as CLIP which aligns images and text in a joint embedding space [72] or SCANREF which aligns point clouds and sentences in a joint embedding space [73]. This multi-modal integration broadens the scope of LLM applications, enhancing their adaptability and effectiveness in complex healthcare environments where diverse data types are prevalent, such as medical imaging data, recorded conversations between patients and healthcare professionals, and dedicatory images such as maps of healthcare environments or medication labels. The prompting techniques presented in Table 2 integrate LLMs into robotics, as discussed in the literature in the subsequent sections. The example prompts provided by us in Table 2 highlight their potential applications in healthcare.

Table 1. LLMs with potential applications in healthcare.

Model Name	Parameter Size	Context Length	Type of Training Data	Foundational Model	Papers That Use the Model
MED-PaLM-2 [42]	540 B	8196	MedQA, MedMCQA, HealthSearchQA, LiveQA, and MedicationQA	PaLM 2 [53]
DRG-LLaMA [43]	7 B, 13 B, 70 B	4096	236,192 MIMIC-IV discharge summaries	LLaMa 2 [54]
GPT2-BioPT [44]	124 M, 770 M,	1024	PorTuguese-2 with biomedical literature [44]	GPT-2 [55]
Clinical-T5 [45]	220 M, 770 M, 3 B, 11 B	Variable, memory-constrained	Approximately 2 million textual notes from MIMIC-III	T5 [48]
BiomedBERT [41]	110 M, 340 M	512	BREATHE containing research articles and abstracts from different sources (BMJ, arXiv, medRxiv, bioRxiv, CORD-19, Springer Nature, NCBI, JAMA, and BioASQ)	BERT [52]
BERT [52]	110 M, 340 M	512	BookCorpus (800 M words) and English Wikipedia (2500 M words)	NA
T5 [48]	60 M, 220 M, 770 M, 3 B, 11 B	Variable Length	750 GB of Colossal Clean Crawled Corpus (C4)	NA	[31,74,75]
GPT-3 [46]	175 B	4096	Not provided	NA	[76,77,78,79,80]
GPT-3.5 [46]	175 B	4096	Not provided	NA	[31,32,81,82,83,84]
GPT-4 [47]	1.8 T	128,000	Not provided	NA	[74,85,86]
PaLM [50]	8 B, 62 B, 540 B	2048	Social media conversations (multilingual): 50%; filtered webpages (multilingual): 27%; books (English): 13%; GitHub (code): 5%; Wikipedia (multilingual): 4%; news (English): 1%	NA	[87]
PaLM 2 [53]	Not available	Not available	Not available	NA
Stanford Alpaca [51]	7 B	4096	A mix of publicly available online data and synthetic data generated by GPT-3	NA	[88]
DialogLED [49]	41 M	4096	Books, English Wikipedia, real news, and stories	NA	[31]

Table 2. Prompting techniques for LLMs for healthcare.

Prompting Method	Prompt Description	Example Prompt	Advantages	Disadvantages	Papers That Use Method
Direct questioning	Direct questioning about a topic of interest.	‘What are the primary symptoms of Type 2 diabetes?’	Straightforward, clear, and easy to understand. Effective for factual inquiries.	May not elicit detailed or nuanced responses; limited to the user’s knowledge to ask the right questions.
Chain of thought	A problem is presented, followed by a step-by-step reasoning process to solve it.	‘To determine the Body Mass Index (BMI), first divide the weight in kilograms by the height in meters squared.’	Breaks down complex processes into understandable steps; useful for teaching and clarification. This approach can help the model in complex problem-solving tasks.	Can be time-consuming; requires accurate initial logic to be effective.	[75,86,87]
Zero-shot	Providing little to no context (zero-shot) to guide the LLM on how to respond or what format to follow.	‘Describe the process of cellular respiration in human cells.’	Tests the model’s ability to respond based on its pre-trained knowledge.	Responses may lack context or specificity; dependent on the model’s existing knowledge.	[31,74,75,81,84,85]
Few-shot learning	Giving a few examples (few-shot) to guide the LLM on how to respond or what format to follow.	‘[Example 1: ‘An apple is a fruit that can help with digestion’.] [Example 2: ‘A treadmill is a device used for physical exercise’.] What is an ultrasound?’	Provides context through examples; improves the accuracy and relevance of responses.	The quality of the response depends on the quality of the examples provided.	[31,32,76]
EmotionPrompt [89]	Incorporating emotional cues to prompts and/or asking the LLM to emphasize emotion stimulus in its output.	‘It’s crucial for my family’s well-being. Can you provide advice on maintaining a balanced diet for heart health?’	Result in more engaging and less generic LLM outputs.	Overexaggeration in emotional stimuli and indication to excessive gestures.	[82,83,86]
Multi-modal prompting	Incorporating more than just text in the prompts, like images or data, especially in models that can process multi-modal inputs. This is useful for tasks that require interpretation across different types of information.	‘Here is an MRI image of a knee. Can you explain the common injuries indicated by this type of scan?’	Incorporates different data types for a more holistic understanding; useful for diagnostics and treatment planning.	Requires LLM models capable of processing and interpreting multiple data modes effectively.	[77,85]
Task-oriented prompting	Combination of previous prompting methods with the addition of primitive robot actions and feedback from the robot and its operating environment, as well user request.	‘Your role is to generate robotic plans in a X embodied robot capable of <primitive actions: moveTO (location, grasp (Object), scan()> The current state of the robot is ${state}, generate robotic plans by generating pythonic code with the use of primitive action functions. The user is requestioned ${userRequest).’	Enables LLMs for integration with robotic systems. Enables LLMs to be used to generate robotic plans taking into consideration the abilities of a robot and the conditions in the environment.	Requires expert programming to integrate into an autonomous system. Limited by context length of LLM models.	[32,76,78,79,80,84,86,87,88]

3. Human–Robot Interaction (HRI) and Communication

The field of HRI for healthcare is focused on the development of appropriate design and implementation strategies for robots to assist with different tasks for a wide range of users from healthcare professionals [23] to patients [89]. Therefore, HRI approaches require robotic systems to understand and adapt to the needs of users. The types of HRIs used in healthcare applications include the following: (1) teleoperation, e.g., through graphical user interfaces for socially assistive robots to provide therapy and cognitive interactions [90,91]; (2) social interactions using natural communication modes, for example, for companionship [92]; and (3) the use of mechanical interfaces used, for example, for surgical robots [93]. Surgical robots require mechanical interfaces (joysticks, switches, etc.) which can result in high cognitive workloads for surgeons [94]. On the other hand, social HRI utilizes natural language processing (NLP) to recognize user verbal requests and commands in order to provide reminders and to engage in conversations [95]. These requests and commands are dependent specifically on pronunciation and word choices, which can potentially result in incorrect command calls being detected [96].

In general, it is important that interactive healthcare robots have intelligent communication abilities using multiple modalities such as spoken natural language, gaze, facial expressions, and illustrative gestures in order to be able to recognize the intent of a user and effectively convey their own intent using these modes. Current healthcare robots are capable of both single- and multi-modal interactions; however, they still have not yet been widely adopted in hospitals, clinics, and/or long-term care homes due mainly to their linguistic limitations, which can lead to critical procedural mistakes (misinterpretation of user requests) and/or frustration and lack of trust in the technology [97].

LLMs can address the aforementioned challenges in order to improve user experiences and the intent to use the technology [98]. The ability of LLMs to generate dynamic multi-modal communication and detect and utilize emotional cues through the use of emotional prompting techniques such as EmotionPrompt [99] choreographed by an LLM will result in natural HRIs that will be similar to human–human interactions [100]. The introduction of LLMs in healthcare robots will aim to improve HRI via the cohesion between multiple communication modes; however, to date, the use of LLMs in HRI has mainly only considered a single mode [101].

3.1. Single-Modal Communication

Single-modal robot communication in healthcare applications has mainly consisted of either textual or verbal information exchange [102,103]. A primary challenge for such single-mode communication is ensuring that a robot does not become monotonous or repetitive, as this can reduce user engagement and trust and also negatively impact the adoption of healthcare robots [104].

In [76], GPT-3 was integrated into the ‘Mini’ small character-like social robot to provide companionship to older adults with mild cognitive impairment. The robot engaged in cognitive stimulation games and general conversations with the older adults. The Babbage and Davinci versions of GPT-3 were used to generate user-adapted semantic descriptions and to paraphrase prewritten texts in Spanish. This integration was tailored to achieve a single consistent mode of communication, enhancing the robot’s capabilities in natural and adaptive dialogues. The research highlights the potential of streamlined adaptable interactions in social robotics made possible by the capabilities of GPT-3 to create tailored responses. The use of GPT-3, despite requiring a translation step from English, was justified by its high performance and adaptability to the specific requirements of the application compared to other models including T5 multilingual [105], PEGASUS [106], and BERT2BERT [107].

In [32], a natural language interface using GPT-3.5 was implemented into a daVinci surgical robot [108] to provide a user-friendly interface for surgeons. The aim was to minimize the cognitive load of surgeons and improve efficiency. A surgeon inputs a verbal command using a microphone, which is then preprocessed using an off-the-shelf text-to-speech model and prompted to GPT-3.5. This prompt also includes a dictionary of possible actions that the daVinci robot can perform. GPT-3.5 is asked to match the natural language input to a robot action through generating an output based on the contents of its context window. The specific actions for the robot to execute include camera position settings, video and picture recording, and finding and tracking surgical tools based on the surgeon’s command. A Robot Operating System (ROS) [109] node structure was used to directly provide execution commands to the daVinci robot. The system usability was tested in a laboratory where 275 natural language commands were given to the system. It was able to correctly identify and execute the intended robot action with a success rate of 94.2%. A time delay existed between the command request and the execution of the robotic action, which was attributed to computation time to respond of the GPT-3.5 model.

In [31], GPT-3.5-Turbo was used to identify the intent of individuals in multi-party conversations between the social robot ARI [110] and patients and their companions in a memory clinic in a hospital. ARI provided directions and responded to visitor/patient questions. Namely, a dataset of multi-party interactions was obtained and annotated for the intention and goal of each speaker. This dataset was created using a Wizard of Oz procedure where the operator would choose the response of the ARI robot. The T5-large, DialogLED, and GPT-3.5-Turbo LLMs were tested on these tasks using the dataset. The models were evaluated using zero-shot and few-shot approaches due to the limited dataset of patients. The ANOVA and Tukey HSD statistical significance tests were used to evaluate the model performance. Namely, the ANOVA test determined a significant difference in the percentage of correct annotations for intent slot recognition between GPT-3.5-Turbo and both T5-Large and DialogLED. The Tukey HSD validated that GPT-3.5-Turbo significantly outperformed all other models in terms of exact match and partial correctness of annotations in few-shot settings. It was found that GPT-3.5-Turbo used in a few-shot setting with a ‘reasoning’-style prompt, where the reasoning for the desired output was explained, had the highest recognition and tracking correctness. Namely, it was correct in 69.57% of intent recognition and 62.3% of goal tracking. However, all models were prone to hallucination when the prompt was phrased as a story. This can be an issue in healthcare settings, as incorrect information can be provided to patients and healthcare professionals.

In [81], the ChatGPT web interface was used for speaker diarization of HRIs between people and the social robot Furhat [111], which has human-like expressions and conversational capabilities, to investigate the capabilities of LLMs for improving HRI experience. Namely, diarization concepts such as ‘who’ spoke and ‘when’ they spoke were explored in multi-party human interactions. The robot would be able to address specific users in group interactions and allow for a less resource-intensive system which is capable of diarization. The developed system used ChatGPT in zero-shot settings to determine if the model could identify different speakers within a transcribed conversation based on the linguistic patterns of each participant involved in the conversation. A test dataset was created by using the whisper model of OpenAI [112] to transcribe a video where a discussion between two hiring managers, a candidate, and a Furhat robot took place. The model achieved 77% in exact matches, 0.88 in sentence level annotation accuracy, 0.92 in word-level annotation accuracy, and 0.18 in Jaccard similarity, demonstrating that large language models can be used for diarization. Currently, the time required to generate speaker labels is too slow for real-world implementation. However, ChatGPT integrated into healthcare robots for diarization tasks has the potential to reduce the administrative burden on caregivers by assuming the role of an assistant and generating structured clinical notes based on caregiver–patient interactions. Furthermore, its attention head mechanism can identify the caregiver and the patient by analyzing the linguistic patterns of each and subsequently documenting the interaction into diagnosis, symptom, and treatment sections.

3.2. Multi-Modal Communication

In HRI, multi-modal communication is far more engaging and effective in building trust [113] when compared to single-modal communication, providing a more interactive and nuanced patient experience through the use of gestures (e.g., animated speech) and body poses and varying speech intonations [114,115]. Multi-modal HRI emulates human interaction patterns closely [116], and LLMs can generate emblems (non-verbal gestures or body language that have specific meanings) for a robot to display in a zero-shot approach. LLMs are also adaptable to variations in user interactions due to word-level annotation accuracy and their attention head mechanisms which dynamically increase the importance of relevant parts of the input text. These mechanisms allow a contextual understanding of user interactions which is required while generating an appropriate speech response and emblem. The ability of LLMs to generate contextually accurate responses and emblems in HRI without the need for training every possible interaction makes LLMs suitable for facilitating HRI in healthcare settings, especially considering the significant variance in demographics and the diverse nature of interactions encountered in such settings.

LLM frameworks applied in multi-modal HRI have the potential to make HRI in healthcare settings more engaging. For example, by aligning non-linguistic commands to natural language and generating dynamic responses to user queries, they can potentially provide more natural communication during robotic-guided therapy sessions, breaking down language barriers and providing emotional support. For example, in [82], GPT-3.5 was used to align non-linguistic communication cues with the natural language responses of any robot capable of multi-modal communication. The Empathetic Social Robot Design Framework utilized consisted of the following modules: Speech, Action, Facial Expression, and Emotion (SAFE), alongside the user request. For Speech, GPT-3.5 considered seven types of speech, from ‘high and fast speech’ to ‘slow speech in neutral tones’. Action encompassed seven gestures, such as turning the head towards the speaker, nodding, shaking the head, interlocking hands on the table, and eye contact. Facial Expressions included ten options including frown, light smile, pout, no expression, bright smile, raised eyebrows, grin, lowered eyebrows, jaw drop, and widened eyes. For Emotion, the framework provided GPT-3.5 with ten emotional states for the robot to display: joy, liveliness, sadness, surprise, anger, worry, calmness, indifference, absence of emotion, and disgust. In a user study analogous to the Turing test, GPT-3.5 was prompted with a specific problem according to the SAFE prompt structuring, such as ‘I am too nervous for the upcoming internship interview’. It was also provided with an example of how it should respond. The response generated by the GPT-3.5 was then compared to a response of a human presented with the same problem. The average alignment score for speech, action, facial expression, and emotion was 26%, 10%, 31%, 32%, and 25%, respectively. Such a robot system can be potentially useful for Reminiscence/Rehabilitation Interactive Therapy and Activities [117] for those living with dementia. A social robot can engage older adults in reminiscence activities such as music, TV shows, and movies. It can interpret users’ non-verbal responses and adapt its interactions to suit their emotional states, providing cognitive engagement and fostering emotional well-being.

In [83], the text-davinci-003 model of GPT-3.5 was used to generate the dynamic responses of the Furhat robot to visitor questions in regards to news and research being conducted at the National Robotarium in the UK. The integration of GPT-3.5 into Furhat was to enable human-like speech and contextually accurate gestures while creating consistency between these modes of communication. The Furhat SDK provided the following: (1) an automatic speech recognition (ASR) module to transcribe speech to text, (2) a natural language understanding (NLU) module to identify user intent, (3) a dialogue manager (DiaL) to maintain conversational flow, and (4) natural language generation (NLG) using GPT-3.5. More specifically, the NLG module generated responses based on engineered prompts containing the user request from the NLU module, Furhat’s personality, and past dialogue histories from the DiaL module. The responses generated by the NLG module had an associated emblem which was parsed by the Furhat SDK and presented by the robot using both verbal and non-verbal communication.

3.3. Summary and Outlook

The integration of LLM frameworks for HRI into healthcare robots can potentially improve the cohesiveness and engagement of interactions between healthcare robots and patients, visitors, and stakeholders. LLMs have been embedded into social robots to improve HRI by (1) providing non-repetitive single-mode (verbal) [76] and multi-modal communication, where the latter consists of embedding non-verbal communication into the verbal responses of healthcare robots (i.e., gestures, eye contact, and facial expressions) to users [82,83], and (2) identifying the linguistic patterns of the user [31,81]. Moreover, LLMs have the potential to provide a more efficient interaction interface through the use of natural language to control surgical robots such as the daVinci robot [32].

The use of LLMs for multi-modal communication can augment robots in order to extend their use in healthcare environments by accommodating non-verbal communication with individuals living with cognitive impairments, autism spectrum disorder, and/or learning disabilities, where verbal communication alone is not always feasible. Bidirectional multi-modal communication can be used when a healthcare robot needs to effectively convey and recognize various non-verbal cues such as gestures/body language, facial expressions, and vocal intonations. For example, a healthcare robot should recognize when a patient is stressed or upset using these non-verbal cues and respond with an appropriate emotional behavior (i.e., concerned) rather than being cheerful or happy, thereby aligning its responses with the patient’s emotional states.

4. Semantic Reasoning

Significant amounts of information must be processed in healthcare, including images, data, and text, in order to minimize errors and improve the efficiency of personalized care techniques [118]. The semantic reasoning of healthcare information requires experts to identify relationships between different factors such as genetic predispositions, lifestyle choices, environmental exposures, and/or social determinants of health which may all influence the health of an individual [119,120]. Therefore, semantic reasoning encompasses the understanding of meanings, concepts, and relationships between data and medical knowledge. In particular, ontologies and knowledge graphs created from patient EHRs are currently used to interpolate how various symptoms, diseases, and treatments are interrelated and influence one another in order to make predictions for clinical decision-making and patient care [121].

COVID-19 increased the demand for telehealth by 367% in adults aged 55–65 and by 406% for adults aged 65 years and older [122]. However, frameworks such as the eCoach personalized referral program to help people stay active and achieve physical activity goals [123] and the Babylon Chatbot which provides healthcare consolations through a mobile app [124] both require complex ontologies and access to large amounts of contextual information to generate personalized recommendations. Therefore, it is not only a matter of creating healthcare-focused datasets and ontologies to train deep learning and NLP models to be able to provide predictions and inference; there needs to also be an effective approach to creating such inferences in order to (1) facilitate the seamless exchange of semantic reasoning frameworks among healthcare institutions [125] and (2) simplify the creation of frameworks that can effectively capture and convey the complex semantics of medical datasets, terminologies, and environments (hospitals, clinics, etc.) [126].

Traditional robotic semantic reasoning frameworks consist of three core components [127]: (1) knowledge resources (raw data), (2) computational frameworks (mathematical models), and (3) world representations (scene/environment representations). Knowledge resources include the data from which semantic knowledge is extracted (i.e., EHRs, clinical trial results). These data are used to train the LLMs. Computational frameworks consist of models such as transformers (LLMs) [128], probabilistic models (Bayesian networks) [129], or deep learning models (long short-term memory (LSTM) networks) used to capture temporal dependencies in data [130]. These computational frameworks are the models that encode the relationships between concepts [131]. They then use the encoded knowledge to perform inference [127]. World representations are used by robots to model their surrounding environments and their own behaviors [132]. In the case of LLMs, the world representation provides an LLM with a scene description of a robot’s working environment in the context window and thereby ‘grounds the LLM’ to its current environment [20]. The aforementioned core components allow robots to perceive, understand, and generalize semantic knowledge in order to improve performance in real-world tasks. The incorporation of LLMs for semantic reasoning in healthcare robots offers multiple advantages. Namely, healthcare robots need semantic reasoning to identify relationships between a task, an environment, and a user command. Semantic reasoning can be used (1) when generating object manipulation plans for surgical tools, equipment, instrumentation, and medical supplies, (2) to navigate to specific areas and regions in the environment such as the OR and/or patient rooms, (3) to interact with people (to provide assistive HRI) and other robots, and (4) to complete task management functionalities or specific tasks and services (patient education, guidance, informatics).

To date, only a handful of robots have been incorporated with LLMs for the purpose of semantic reasoning. For example, in [77], the semi-humanoid robot NICOL (Neuro-Inspired COLlaborator) used GPT-3 to reason about multi-modal inputs (sound, haptics, visuals, weight, texture) in order to improve robotic perception in object manipulation to help differentiate between visually similar objects. Namely, GPT-3 was used to perform interactive multi-modal perception and robot behavior explanations. The MATCHA (multi-modal environmental chatting) prompting technique provided specific action prompts to GPT-3 which included stored information about the robot’s environment. Namely, these action prompts included descriptions of the actions, as well as an example of an expected response. GPT-3 provided instructions to the robot to determine the target block by knocking on and weighing the blocks one by one. The robot then reported the findings (stored information) of each step back to GPT-3. This process was repeated until GPT-3 was able to determine which was the correct block with high accuracy (>90%). The system has been tested only in simulation by prompting NICOL to pick up a specific block made of metal out of three blocks, where the characteristics of the blocks were provided rather than perceived by the robot. Even though this task has not been implemented directly in a healthcare setting, the potential of MATCHA can be explored particularly for improving operational efficiency and decision-making for patient care in terms of multi-modal information gathering and reasoning for robot manipulation tasks such as retrieving and handing over (1) medical supplies on shelves in a supply room or (2) surgical tools on a table in the OR.

In [87], the ‘SayCan’ method integrated the PaLM 540 B parameter model, PALM-SayCan, into a mobile manipulator [133] for semantic reasoning in the context of human-centered environments (i.e., a kitchen in an office). The manipulator provided the perception and manipulation capabilities, while PaLM provided high-level semantic knowledge about tasks to promote successful task completion. The system was trained on a mock kitchen and tested in an office kitchen environment. SayCan is not only prompt-based but also uses a temporal difference reinforcement learning (RL) approach. This approach learns the rewards of each action (completing the objective vs. course of action) and therefore prioritizes the executable actions based on the robot’s current environment. PaLM 540B is provided with prompts that include the robot’s capabilities and their descriptions and the actions the robot can take expressed as a function (e.g., ‘goto()’). PaLM 540 B is given a task, such as find object X, and asked to generate multiple possible actions paired with the likelihood of each prediction. The action probabilities are then multiplied by the probability of success (acquired through the temporal difference method in RL). The action function with the highest probability which is the output of the LLM is then executed by the robot’s low-level control system. This process is repeated for all subsequent steps in the generated plan. The approach was benchmarked by obtaining the plan success rate and the execution success rate in completing a task such as bringing a bag of rice chips from a drawer. The SayCan development improved task execution in human-centered environments by leveraging semantic reasoning. The potential healthcare applications of ‘SayCan’ can be extended to mobile social robots directing patients/visitors to specific rooms or departments in a hospital and/or mobile manipulators fetching or localizing essential medical supplies.

In [85], the LLM-Grounder, an open-vocabulary zero-shot LLM-based 3D visual grounding pipeline, was introduced. LLM-Grounder integrated GPT-4 and a robotic simulated agent to reason about the semantic relationships between high-level commands given by a user (i.e., find the grey monitor on top of the smaller curved desk) and the working environment (simulated office) for object localization tasks. GPT-4 was used to break down natural language commands into their semantic constituents, the target and a landmark. The ability of GPT-4 to (1) identify the landmark and the target and (2) differentiate between the two provides an efficient method to generate from high-level user commands to robot actions to perform in order to complete a goal. Namely, OpenScene [134], which is a 3D visual grounding method that uses the transformer architecture to generate 3D scene layouts based on textual descriptions, was used. GPT-4 provided OpenScene with (1) the target name and its attribute (monitor, light grey) and (2) the landmark name and relation (curved desk, small). OpenScene returned to GPT-4 the respective bounding box volumes and distances. GPT-4 was then used to reason about the size of objects and their relative location to the landmark, deciding on which found target and landmark pair(s) has the highest likelihood to be correct. Therefore, GPT-4 is used to provide an efficient method for robots to understand the semantics of high-level user commands and for them to act on these commands. LLM-Grounder was evaluated on the ScanRefer [73] benchmark, which is a standard dataset for assessing 3D vision–language grounding capabilities. The ScanRefer dataset consists of scenes ranging from wildlife to home environments, where each point cloud and image has a text description [73]. LLM-Grounder demonstrated state-of-the-art performance when used for zero-shot open-vocabulary grounding, excelling in complex language (increased # of nouns in command) query understanding over approaches that do not use LLMs and rely solely on CLIP. The ability of a robot to correctly understand user commands in the context of locating objects in human-centered environments is crucial for the successful implementation of healthcare robots. A robot deployed in a healthcare environment must be able to identify the correct target (out of many) in cases where items are not easily distinguishable, such as identifying a specific medication on a medication cart where labels are not visible. This can only be achieved if the healthcare robot is able to understand the pill bottle description and semantic relationship to the landmark specified by the user in natural language.

In [74], the ‘Lang2LTL’ method integrated the SPOT quad-pedal robot [135] with GPT-4 and the T5 base to provide SPOT with the semantic reasoning capabilities required to understand and act on user speech commands in the context of navigating environments ranging from offices to city streets. Lang2LTL used these LLMs to break down navigational commands (i.e., ‘Go to the store on Main Street but only after visiting the bank’) to Linear Temporal Logic (LTL) in the form of sequential objectives, such as (1) go to the bank and (2) go to the store on Main Street after. User commands in the form of natural language were given to GPT-4 to identify the referring expressions (RFs) such as the store on Main Street or the bank. After identifying the RFs, the RFs were grounded to known physical locations (retrieved from a database) by being compared to known proposition embeddings which are location description/coordinate pairings using cosine similarity. The grounded RFs were used as an input into the fine-tuned T5 base model to generate LTL formulas. The fine-tuned T5 model was then used to (1) generalize the initial input command (go to ‘a’ but only after visiting ‘b’) and (2) generate the LTL formula required by the planner to facilitate navigation via the robot’s low-level controller. Lang2LTL was tested on SPOT in an indoor environment consisting of bookshelves, desks, couches, elevators, tables, etc. SPOT successfully grounded 52 commands, including 40 satisfiable (executable) and 12 unsatisfiable ones (unsafe to execute). The potential applications of ‘Lang2LTL’ in healthcare are robot delivery tasks within clinical settings. Namely, the added ability to identify unsafe robotic action executions can aid a robot planner. This can be achieved by identifying the semantic importance of user commands and the order of operations, such as retrieve a walking aid for a patient, then visit the patient’s room to provide them with the aid.

Summary and Outlook

In general, the integration of LLM frameworks for semantic reasoning into healthcare robotics has the potential to improve robot autonomy in complex and dynamic healthcare environments. Currently, LLM-based semantic reasoning has increased robot autonomy, as the transformer architecture of LLMs semantically reasons about the entire input at once, facilitating faster and more accurate decision-making in comparison to traditional sequential processing models such as Recurrent Neural Networks (RNNs) [39]. This approach allows robots to understand complex relationships among object characteristics, robotic action outcomes, the spatial importance of target objects versus landmarks, and user-imposed constraints. Healthcare robots can use frameworks such as MATCHA [77] for autonomous capabilities such as the detection and manipulation of medical supplies based on characteristics such as weight and texture. Additional applications include robots augmented with SayCan [87], LLM-Grounder [85], or Lang2LTL [74] frameworks for (1) navigating efficiently within healthcare facilities by understanding spatial layouts and identifying potential unsafe surfaces (e.g., wet floors), (2) executing patient-specific tasks by interpreting natural language commands given by care providers such as retrieving a vital signs monitor cart before visiting a patient in their room, or (3) increasing the surgical team’s situational awareness by alerting them to equipment needs and preemptively managing robotic tool positions, thus ensuring sustained operational focus and efficiency.

5. Planning

An existing challenge for healthcare robotics is autonomously planning safe and effective behaviors in real time for healthcare tasks. These tasks can include (1) navigating through complex hospital environments to find healthcare professionals and patients and to escort visitors, (2) managing and delivering medications and medical supplies, (3) assisting in surgery by tool handovers and supporting precise tool movements, and (4) facilitating both physical and cognitive rehabilitation with different user groups. Each of these tasks requires a robot to perceive and interpret its surroundings and make real-time decisions that ensure safety and task effectiveness. For example, surgical robots operating within soft tissues and navigating curved paths face a particularly uncertain and dynamic environment due to the complex and variable nature of human anatomy. Soft tissues can shift or deform during procedures, altering expected pathways and requiring real-time adjustments in the robot’s movements. Additionally, the inherent variability in patient anatomy means that pre-planned paths may not always apply precisely, necessitating continuous sensory feedback and adaptive control strategies to accurately guide surgical tools without causing unintended damage. This environment demands high levels of precision and adaptability from surgical robots to ensure safety and effectiveness in their operations.

In general, healthcare robotic task plans need to be adaptable to different situations and people. Currently, the majority of surgical robots are teleoperated by a surgeon [136,137]. An autonomous surgical robot needs to adjust to changes in patient anatomy in real time [138]. Presently, the lack of real-time 3D sensing is a significant constraint [139]. This constraint prevents surgical robots from (1) operating in realistic conditions (i.e., lighting changes, occlusions) and (2) operating on non-planar surfaces [139]. While minimally invasive methods such as the multi-camera CARET system proposed in [140] attempt to enhance the surgical field of view without making additional incisions, they do rely on complex intra-camera tracking to maintain an expanded view. Namely, this system determines the correspondence between different cameras at the initialization stage and updates the expanded view frequently when there is enough overlap between views. However, this can still lead to inaccurate mosaicking results due to error accumulation over time [141]. Although the visual field provided by instruments like laparoscopes is not always optimal, surgeons are able to effectively conduct surgeries due to their continuous learning and ability to interpret surgical situations and semantically reason about the current state of the operation. Integrating LLMs with surgical robots will allow for this decision-making process by semantically reasoning about the current state of the surgery and generating robot action plans from the limited information about the operating environment provided by imaging sensors such as laparoscopes and the multi-camera CARET system.

Existing robotic planning frameworks that have incorporated autonomy have mainly used heuristic and DL models. Heuristic methods, such as (1) genetic algorithms [142], (2) Greedy Best-First Search [143], and (3) Simulated Annealing [144], aim to identify the sequence of robot actions for task planning. Additionally, DL methods, such as (1) the path planning and collision network (PPCNet) [145], (2) DRL models [146], and (3) LSTM networks [147], learn from training examples and episodes in order to autonomously plan robotic actions and generalize to real-world situations. The unpredictable nature of healthcare environments, with sudden medical emergencies and changing patient conditions, requires a robot to autonomously react to new or quickly changing scenarios. Heuristic models are based on a set of rules and tend to yield satisfactory rather than optimal solutions [148]. Furthermore, the efficacy of DL models in generalizing to unseen data is significantly influenced by the diversity and size of training datasets [149,150]. Ensuring that robotic planners can adapt to new and varying conditions beyond their training data is crucial for their effective deployment in healthcare settings. Consequently, there is a need for healthcare robotics planning methods that can manage extensive datasets encompassing immense amounts of high-quality data describing human-centered scenes, medical and general knowledge, and various medical procedures to enable robots to determine optimal plans for complex tasks. These include robot-guided surgery for precise operational assistance, real-time diagnostic analysis during various medical procedures to inform decisions, and robot-led interventions and rehabilitation that adjust treatment plans based on ongoing patient evaluations across different healthcare settings.

LLMs can interpret complex instructions and patient data, facilitating robots to make informed decisions in real time. Unlike heuristic models that rely on predefined rules, LLMs take advantage of attention head mechanisms which add bias towards generating outputs related to the contents of the context window which include information from the environment [151]. The contents can be updated in real time to contain the current state of both the robot and the healthcare environment, thereby having the LLM generate robotic plans specific to these states. Furthermore, both the ability of LLMs to dynamically add bias when generating a plan and to semantically reason allows LLMs to adapt already generated robotic plans by supplementing the model input with state changes and iteratively refining the sequence of planned robot actions to ensure successful execution. LLMs mimic human planning in contrast to the aforementioned DL methods. Namely, the latter lack the ability to iteratively and efficiently update specific portions of robotic plans to address run-time issues such as equipment failures in surgical rooms or variations in treatment responses during rehabilitation sessions [152]. Healthcare robots need the adaptable capabilities of LLMs for planning to enhance their effectiveness in diverse and dynamic environments, such as (1) emergency rooms where conditions can change rapidly, (2) ORs that require precision under varying circumstances, and (3) rehabilitation settings where patient responses can be unpredictable and varied. Therefore, using LLMs in healthcare robotics planning can provide (1) recovery in cases of sudden environmental changes, for example, an influx of patients due to an outbreak where the robot has to prioritize which patients it should assist first, (2) robotic plans suitable for execution around vulnerable people by inferring about social norms, for example, when escorting frail patients to their rooms, (3) fine-tuned motion plans to facilitate rehabilitation for stoke survivors and physically impaired patients, and (4) dynamically generated robotic plans that can be interpreted by non-expert roboticists (healthcare professionals), for example, in surgical robotics where the surgeon must anticipate the robot’s next planned motion. Furthermore, the integration of Vision–Language Models (VLMs) can provide the utilization of the visually anchored features of VLMs to assist with task planning in healthcare robotics [153]. Namely, VLMs can aid in grounding in the physical world for healthcare robotic applications by integrating visual and textual data into a joint embedding space, which LLMs lack on their own, enabling robots to understand physical contexts and respond more appropriately to environmental cues.

To date, a handful of robots have used LLMs for task planning in human-centered environments. For example, in [88], the LLM-BRAIn method consisted of the Stanford Alpaca 7B parameter LLM in order to generate behavior trees (BTs) for identifying and retrieving objects for mobile manipulator robots. BTs provide a modular behavior structure consisting of nodes (robot execution steps) that can be easily scalable. Initially, GPT-3 was used to create a synthetic training dataset where each example is an XML file containing a randomly generated BT for a mobile manipulator robot and a description of the robot’s movement and object manipulation steps. The synthetic dataset was used to fine-tune the Alpaca model parameters for generating robot BTs. To generate BTs based on user requests, the fine-tuned Alpaca model was prompted with a list of robot actions to execute (e.g., take object, scan area) and the user request (e.g., if object is visible, move towards it, take it, and process it). A ROS2 BT interpreter [149] took the output of the fine-tuned Stanford Alpaca model (XML format) and generated executable actions for the robot. The BRAIn method included the robot’s functionality at all planning steps in an understandable format for the user. To incorporate the LLM-BRAIn method into healthcare robots, the generated BTs used to train LLM-BRAIn should include healthcare-related tasks instead of the general localization and retrieval of objects, thus allowing healthcare professionals to understand a healthcare robot’s intentions and actions and potentially stopping the execution of a task if there are any concerns, ensuring safe interactions when providing patient care.

In [84], GPT-3.5 was used to guide a Franka Emika Panda robot arm operating in a mock kitchen to facilitate the handover of dirty kitchen utensils (i.e., fork, spoon, knife) to a human washing the dishes. The zero-shot capabilities of GPT-3.5 in planning collaborative robotic tasks that align with human social norms were investigated. GPT-3.5 was prompted with examples from the MANNERS-DB [154] dataset which contains HRI scenarios where appropriate robot behaviors are represented. GPT-3.5 was then tested for its ability to (1) establish/maintain human trust while guiding the robot to hand off sharp objects (i.e., knives) and (2) predict cultural and social norms using (a) the Trust-Transfer [155,156] which contains 189 instances of driving and household tasks where participants rated their trust in the robot completing the task on a seven-point Likert scale and (b) the SocialIQA [157] which contains 1954 testing examples, each containing a content, question, three possible answers, and a ground truth, where a robot can be tested on its commonsense reasoning. In a physical experiment consisting of the robot and a user, GPT-3.5 facilitated robot-assisted utensil washing while allowing for user intervention. Success and failure in handovers were reported with only 28.1% of participants trusting the robot with knife handovers. GPT-3.5 can improve patient and care provider HRI by utilizing in-context learning to refine robotic task plans, ensuring they align with user perceptions through analyzing past failures and applying necessary corrections. The potential applications in healthcare-related tasks can include (1) delivering medications/supplies by navigating busy hospital corridors (e.g., giving way to humans and incorporating other social etiquettes) and (2) autonomously cleaning patient rooms while they are still in the room (e.g., keeping a safe distance, smooth navigation).

In [78], the ProgPrompt method integrated GPT-3 into the Franka Emika Panda manipulator robot [158] to facilitate object manipulation in a physical mock kitchen and simulated VirtualHome [159]. The prompts were based on object manipulation and sorting, for example, ‘sort fruits on the plate, and sort bottles in the box’. In the implementation of the ProgPrompt method, two instances of GPT-3 were used to generate and refine robot task plans. The first instance of GPT-3 was given a prompt containing robot primitive actions, a list of available objects in the environment (which were dynamically identified through an open-vocabulary object detection model in real-world applications or predefined in simulations), and example plans illustrating the desired task structure and outcomes. GPT-3 then generated an initial plan with natural language comments detailing each step and logical assertions (e.g., ‘put the banana on the plate’) to provide a method for error recovery in case of execution failures. The second GPT-3 instance iterated over the initial plan with a focus on reasoning about the current semantic state of the environment and the task at hand. The plan was modified by adding or removing steps to successfully achieve the goal provided in the first instance. For example, if the task involved microwaving salmon, the second instance considered the state of relevant objects and actions (such as whether the microwave door is open or the robot is holding the salmon) to determine the next logical step in the plan. This iterative process allowed for dynamic adaptation and refinement of the plan, ensuring it was contextually appropriate and executable within the given environment. The ProgPrompt framework was tested in the VirtualHome environment using GPT-3 to generate task plans across 10 tasks, such as ‘put salmon in the fridge’ or ‘put banana on plate’. The ProgPrompt method excelled in simple tasks. However, it was less effective in complex sequential tasks, such as ‘put the banana on the plate and the pear in the bowl, sort the fruits on the plate and the bottles in the box’. The ProgPrompt method integrated with GPT-3 can potentially be used in healthcare robotics to enable robotic medication dispensing tasks such as precisely dispensing into measuring cups, using logic assertions and semantic reasoning based on the textual description of the scene to determine when the cup is full, thus eliminating the need for specialized hardware such as prescription dispensing systems [160] and also relying on vision to control a robotic manipulator for accurate dispensation.

In [79], the RoboGPT framework utilized three instances of GPT-3 in the Franka Emika manipulator robot [158] to improve robotic manipulation tasks such as word spelling using letter blocks, moving letter blocks, bin packing and pyramid stacking, and house building using cubes. The first instance of GPT-3 was a decision bot, which generated a sequence of robot actions based on a task-oriented prompt. The prompt provided detailed background information, including (1) the robot’s capabilities, (2) a description of the scene, and (3) guidelines on the model’s response format. It also included robot API commands such as ‘envs.pickObject(object_name)’ and the coordinates of objects within the scene. The decision bot iterated on the generated sequence of robot actions to resolve any run-time errors to ensure the robot can perform each step of the plan. The evaluation bot (second instance) was deployed to verify that the sequence of robot actions generated by the decision bot aligned with the task requirements, effectively serving as unit tests to confirm the correctness of each action. Lastly, the corrector bot was given the generated plan and the results from the evaluation bot in order to identify the reason for the failure of the plan. The prompt of the corrector bot included background information, describing its role specifically as a ‘code corrector’, and the coordinates of the objects in the scene. The prompt also outlined a structured analysis process, which included the following steps: guessing the intended spatial relationships between objects from input codes, determining actual spatial relationships from final object states, and analyzing discrepancies to suggest possible reasons for any failures. The evaluation bot validated the decision bot’s plan by verifying the correct stacking of cubes. It used robot API functions such as the ‘self.checkOnTop(object_1_name, object_2_name)’ method to confirm that the objects were accurately placed according to the task requirements and object attributes provided. This collaborative system enabled precise task execution through the iterative refinement and evaluation of the generated plan. After a task was successfully completed using the plan, the plan and the outcome of the plan (i.e., success or failure) served as a demonstration example used to train a DRL model to perform the same tasks, thereby improving the efficiency of generated robot action plans by reducing the reliance on the decision, evaluation, and corrector bots. The RoboGPT framework can be advantageous for healthcare robotics in terms of assisting care providers with repetitive and non-repetitive tasks. For many repetitive tasks, such as organizing medical equipment or retrieving surgical tools in the OR, the framework can enable the robot to rely on the DRL model for robot task execution instead of generating repetitive plans using LLMs, thus improving efficiency. As for non-repetitive tasks, such as triaging, the three bots can provide a healthcare robot with the versatility needed to adapt to changes in the environment and the constraints of the task and patients.

In [80], GPT-3 was used as the robotic planner for the ‘Toyota HSD’ service robot operating in a home environment. The robot fulfilled user requests with respect to general-purpose service tasks, such as ‘pick up the apple from the bookcase and put it on the storage table’. A task instantiation module was used to execute robot actions planned by GPT-3, such as the primitive robot actions ‘move_to()’, ‘grasp()’, ‘pass_to()’, and ‘visual_question_answering()’. An inference module was used to extract information from the environment. The inference module consisted of (1) Google Cloud speech recognition [161] to transcribe user commands, (2) object detection using YOLOv7 [162] to recognize household objects in 2D images, (3) a visual question answering (VQA) model to scan the environment based on RGB images with the purpose of locating a specific object, (4) the open-vocabulary object detection model Detic [163] to obtain textual descriptions of the environment to provide context to GPT-3, and (5) the EZPOSE human pose estimation model [164] to identify people for human guidance through pose estimation. Once GPT-3 is prompted with a task-oriented prompt containing the user’s request (using speech–text to transcribe user requests), it generates a sequence of primitive robot action functional calls using the user-specified targets as arguments, for example, the ‘locate the fruits in the dining room’ argument to the VQA mode would be ‘where are the fruits?’. Furthermore, the locations of the arguments in the generated plan are checked against an object location database before being executed. If the location is found, the plan is executed or else the inference module is queried with images of the environment using point cloud and RGB information. Then, GPT-3 is directly questioned about the possible locations of the object in the context of the current scene description. This framework was tested at the robocup@home Japan open competition, and it won first place. The robot planner can be extended to healthcare environments to generate robot action plans based on all the available information from the scene. The framework developed can be classified as a VLM, as RGB and point cloud data are used in conjunction with an LLM to generate plans which consider all features of the environment (i.e., people, objects, obstacles). In healthcare settings, this integration can effectively provide patient/visitor navigation guidance through social navigation, enabling healthcare robots to autonomously generate and update navigation paths in real time while adapting to dynamic changes such as increasing crowds or new emergency situations. These robots can interact using natural language, adjust their navigation pace and route based on verbal feedback to ensure the user’s comfort, and cater to specific needs such as slower movement for elderly patients or quick access for emergencies. This approach can enhance the overall efficiency and user experience in navigating large hospital environments.

In [86], GPT-4 was integrated into the Alter3 android robot [165] for the self-planning of the robot’s physical actions in order to adapt its own pose to user speech requests in applications which include ‘Take a selfie’, ‘Pretending to be a ghost’, and ‘I was enjoying a movie while eating popcorn in the theater, when I suddenly realized that I was actually eating the popcorn of the person next to me’. Two frameworks were developed. The first framework consisted of task-oriented prompting to plan the limb movements of the Alter3 robot and used EmotionPrompt to ensure that there are expressive gestures included in the generated movement plan. For example, GPT-4 was instructed to generate a high-level plan of how Alter3 should choreograph its limbs in response to, e.g., ‘drink some tea’. The task-oriented prompt used the high-level plan to generate a sequence of robot actions corresponding to each limb of the robot to realize the planned expressions and movements, where the emotional prompt placed emphasis on exaggerating the emotional and facial expressions associated with the user request. Namely, the task-oriented prompt generated a sequence of function calls that were executed by the control system of Alter3, where each function call controls one of Alter3’s 43 motion axes. Each limb (eyebrows, shoulders, index finger, etc.) has multiple axes. The effectiveness of this approach in articulating an android robot was explored in a user study. In the study, participants interacted with Alter3, where GPT-4 generated an action plan based on input prompts. The participants failed to distinguish between GPT-4-generated and robotic-expert-programmed movements. The second framework introduced a closed-loop system that builds onto the first framework by storing generated sequences of robot actions in a JSON database to be used again for similar requests. For example, if the robot has been asked to ‘take a selfie’, the plan which articulates the robot’s limbs will only need to be generated once, and any subsequent request to ‘take a selfie’ will use the already generated plan stored in the JSON database, therefore reducing plan generation time and improving responsiveness. Additionally, this framework was further expanded to incorporate a ‘Social Brain’ mechanism, employing multiple instances of GPT-4, each with a distinct role, to mimic human collaboration in problem-solving. A potential healthcare application for GPT-4 in robotic planning is to autonomously generate detailed rehabilitation plans that can be reviewed by a physiotherapist before they are executed, thereby keeping the therapist involved. The generated plans can be updated in real time based on inputs from patients with respect to their comfort and pain levels during a rehabilitation session. This dynamic adjustment helps in providing personalized care, and the initial plans serve as a clear communication tool for healthcare professionals to understand and oversee the treatment protocol.

In [75], the KnowledgeBot framework used the T5 LLM as the backbone for a robot action planner used in conjunction with a conversational embodied agent in the Alexa Prize SimBot Challenge environment [166]. A T5-Large was used to train an object generator to generate a list of objects of interest based on a user-provided task description, such as ‘faucet, house plant, cup’ for ‘water the plant’. Then, each object was concatenated with the task description and used by the encoder–decoder-based planner (T5 backbone) to generate step-by-step robot actions to execute. A variation of this procedure was developed in order to emulate human cognitive processes in task planning and execution to ensure that each step is informed by the overall task context and the progress made. In order to generate step-by-step robotic action plans, the task and its generated steps are passed to the object generator and an encoder in parallel. Then, the output from the encoder goes to the (1) decoder and (2) is paired with the generated list of objects to be analyzed by the attention mechanism which adds bias to the object relevant to the generation step. Finally, the output from the attention mechanism is also provided to the decoder, where the next step of the plan to complete a task is generated. The generated step is concatenated to the task description (initial input), where the process is repeated until a complete plan is generated. In the Alexa Arena environment [167], the KnowledgeBot framework was used to generate robot actions for an embodied AI in gaming sessions. The framework was evaluated based on the Goal Completion metric, which measures the fraction of game sessions successfully completed. The success rate on unseen scenarios achieved in the Alexa Arena environment was 13.6% on tasks such as ‘pick up the milk in the fridge and place it on the table’ which had a 1.72% improvement over the Alexa Prize Team Baseline. The potential applications of the KnowledgeBot framework in healthcare could involve using a model with domain-specific knowledge such as MED-PaLM 2 [41] acting as the object generator, where the LLM interprets patient conditions or staff instructions to generate lists of relevant objects and actions such as specific medications, medical devices, and procedural steps which are then used by the LLM (i.e., GPT-4)-based planner to generate to plan and execute precise context-aware assistance tasks.

Summary and Outlook

LLMs used for robot planning purposes are able to create high-level plans by generating a sequence of atomic actions based on textual descriptions of the robot’s environment provided to the model in the input prompt. Task plans have been used mainly for object handling tasks, including object (1) localization and retrieval [80,88], (2) handover to humans [84], and (3) sorting [78,79]. Extending these abilities to healthcare robotics through integrating LLMs and VLMs into planning frameworks can potentially improve the efficacy of healthcare robots by enabling real-time decision-making in complex and unpredictable medical environments. These models can autonomously generate and modify task sequences in response to dynamic conditions such as emergency interventions or sudden changes in a patient’s health status. Furthermore, LLMs can be used with surgical robots to generate human-understandable plans and provide reminders and cues based on the current state of the OR (i.e., understanding non-verbal communication of surgeons or OR nurses) to increase the situational awareness of the operating team. This functionality allows surgeons and physicians to stay informed and involved, ensuring that plans are verified and approved before execution in order to maintain high levels of transparency during procedures. For rehabilitation robots, LLMs facilitate the customization of therapeutic exercises in real time and again by generating human-understandable plans so that the therapist is kept informed. They can also adapt treatments to improve the recovery progress of patients. Moreover, the strength of LLMs in generating high-level task plans enables them to serve as frameworks for other deep learning models, such as DRL systems, to execute more detailed low-level commands for the control of robot arms, end effectors, and mobile platforms. This layered approach to task planning ensures that while LLMs handle broad strategic decisions, the finer tactical aspects of robot control are refined through continuous learning models, ensuring the precision execution of complex tasks. This integration not only maximizes the effectiveness of interventions but also ensures that robotic operations are adapted to the immediate needs of the healthcare setting.

6. Ethical Considerations of Robotics Using LLMs in Healthcare

Ethics is a critical aspect of healthcare, encompassing caregiver–patient interactions and the use of technology to enhance patient care [168]. Ethical considerations for the use of healthcare robots embedded with LLMs need to be addressed for the widespread adoption of this emerging technology in order to support patients, healthcare professionals, and advancements in this area. The ethical considerations include accountability, humanizing care, and privacy [169]. A holistic approach is essential, ensuring patient autonomy over their body and medical information with the expectation of improving their health. Maintaining the privacy and protection of personal health information is paramount, emphasizing the importance of informed consent for data usage and increased efforts to stop data commercialization while increasing transparency in how the data are used [170]. Regardless of the robotic care delivery method, maintaining quality and equitable access across all demographics is important.

There have been extensive separate reviews of the ethics of the use of LLMs in healthcare [171,172,173,174,175,176,177,178,179,180] and the use of robots in healthcare [181,182,183,184,185,186,187,188,189,190,191]. Although there exist frameworks such as Ethically Aligned Design from IEEE [192], The Toronto Declaration: Protecting the right to equality and non-discrimination in machine learning systems [193], and the AI Universal Guidelines [194] which provide guidelines on how the ethics of society should be considered during the design phase, they do not provide enough insight into exactly how regulations should be established for generative AI models. Moreover, although the current gaps in regulatory frameworks for AI have been identified and countries such as the United States and China have started the process of establishing regulatory frameworks for the use of AI in healthcare, these frameworks are not applicable to generative AI as the technology is still evolving and it is difficult to develop a framework which covers all the potential impacts of generative AI [195]. Therefore, current guidelines and research do not yet address the problems that arise from combining generative AI (LLMs) and healthcare robots.

This section aims to introduce a discussion on the use of LLM frameworks for healthcare robots for the facilitation of (1) multi-modal communication, (2) semantic reasoning about healthcare environments and patient and care provider needs, and (3) generating and executing safe robot action plans around vulnerable people, including frail and cognitively impaired individuals. We discuss these main points as they pertain to accountability, humanizing care, and privacy. Namely, we discuss these three ethical concerns as they directly relate to the robot multi-modal communication, semantic reasoning, and robotic task planning topics discussed in this paper.

6.1. Accountability

Accountability in healthcare robotics centers on identifying who is responsible for system errors and adverse events [169]. To date, there has been no consensus on how the accountability of healthcare robots embedded with LLMs should be considered and who should be accountable. LLMs are trained on large datasets containing human language, and therefore, LLM frameworks used in robotics for multi-modal communication are often perceived as proficient in understanding human language and kinesics [151]. However, LLM frameworks are not sensitive to changes in sentence structure, word choice, or grammar when identifying user requests [196], thereby promoting the illusion that they understand human communication. Although LLMs appear to understand human language through (1) generating grammatically correct text and (2) propositional reasoning about sentences such as ‘The doctor treated the child with the fever’ or ‘The nurse examined the patient with the burn’, it is a mirage and does not mean that the LLM actually understands the underlying thought and intent being conveyed through language; rather, it is using probabilities to predict the next plausible word in the sentence [197]. Therefore, the robot does not understand the medical advice it is providing or the modes through which it is providing it (i.e., gestures, body posture used). It can be risky if healthcare professionals become over-reliant on LLM-augmented healthcare robots to deliver, for example, patient education. This arises from the fact that the robot will be able to generate coherent and convincing (but incorrect) advice which may mislead patients, resulting in injury or the worsening of the patients’ health conditions.

As semantic reasoning focuses on obtaining new knowledge and associations from existing knowledge, it can help healthcare robots with numerous clinical reasoning tasks from diagnosis to therapy design. However, it is important that healthcare providers understand that the LLM-augmented healthcare robots cannot understand the data which they have embedded and are merely generating/identifying patterns which were present during training [173]. Therefore, a clinician should not base a patient’s diagnosis on only a pre-screening interaction the patient had with an LLM-augmented healthcare robot. While LLMs like GPT-4 achieve performance in the 75th percentile on the Medical Knowledge Self-Assessment Program [47], they are not qualified to function like physicians. Physicians should avoid over-relying on LLM-augmented healthcare robots for interpreting patient EHRs. The division of accountability among stakeholders remains ambiguous due to the lack of legal precedents in this domain.

For healthcare robot planning, it is important to distinguish between a feasible and an optimal plan [198], especially were robots will be operating in close proximity to vulnerable individuals. An optimal plan for a healthcare robot assisting with patient rehabilitation is one that accounts for the ability of a patient to perform the movements planned by the robot. However, even if an LLM-augmented healthcare robot is intelligent, it has no conceptual understanding of the human condition, as AI is not sentient [199], and therefore, the therapist should oversee the plan generated by the healthcare robot and ensure that it is safe for a patient to perform with the robot. Secondly, healthcare robots which generate plans for real-time execution such as surgical robots should always keep the surgeon aware of the generated plan before execution [200]. For example, there should be feedback from the surgical robot either in the form of a graphical interface or audible cues which inform the surgical team of the robot’s next planned action. This will ensure that the surgical team can efficiently maintain their situational awareness, remain in control, and be accountable to prevent near-miss events and errors during operation [200].

6.2. Humanizing Care

Humanizing care in the context of healthcare robotics using LLM frameworks involves ensuring that intelligent robots improve the compassion elements of care, which are integrity, excellence, compassion, altruism, respect, empathy, and service [201,202].

Multi-modal communication facilitated by an LLM framework uses prompt engineering to align the embedding space of an LLM to the context of a conversation [203], a process which is probabilistic and does not always produce repeatable results [204]. Therefore, it cannot be stated with certainty that an LLM model will always respond appropriately to patients due to the probabilistic nature of prompt engineering. Moreover, prompts engineered to facilitate multi-modal communication are analogous to emotion prompting [99], which can result in (1) the overexaggeration of non-verbal cues and (2) the generation of oversimplified language as a result of the emphasis on emotional responses which can omit pertinent information. Both of these can lead to patient infantilization, especially in repeated interactions between vulnerable demographics and LLM-augmented healthcare robots [205]. The infantilization of patients and lack of control over a healthcare robot’s response to patients violates compassion, respect, and empathic aspects of what it means to provide compassionate care. Additionally, the use of LLM-augmented healthcare robots has the potential to worsen the disparity in equal access to healthcare while also increasing the gap in the standard of care between demographics. For example, GPT-4 supports 26 different languages, and when tested on LLM benchmarks such as Massive Multitask Language Understanding (MMLU) [206], its three-shot accuracy ranges from 85.5% (English) to 62% (Telugu) [47]. Therefore, an LLM-augmented healthcare robot developed using GPT-4 as the backbone can differ in the (1) quality of care and (2) access to care in cases of unsupported languages and dialects, ultimately creating inequalities in healthcare environments.

The data used to train foundational LLMs (i.e., GPT-4, GPT-3, PaLM 2) consist of data available from the internet [207]. However, not all demographics have equal opportunities to contribute to this data due to a variety of reasons (access, awareness) [208]. Therefore, healthcare-specialized models such as MED-PaLM 2 [42] derived from fine-tuning foundational models (PaLM 2) on medical information will also be unrepresentative of health conditions prevalent in minority demographics [209]. As a result of these skewed training data, an LLM’s embedding space will not have the knowledge needed to reason based on patient information such as (1) cultural background and (2) patient health conditions in order to provide culturally appropriate medical advice such as suggesting dietary choices. This needs to be addressed so that healthcare robots do not potentially exacerbate equality gaps between minority and majority demographic groups.

6.3. Privacy

The combination of LLM frameworks and healthcare robots amplifies data privacy risks. Traditionally, LLMs require a user to create a prompt and initiate an interaction [210]. However, healthcare robots augmented with LLM frameworks are mobile and continuously use environmental stimuli as input into their models for the purpose of multi-modal communication, semantic reasoning, and robot action planning. These include videos of patients and audios of conversations in an environment. This mobility allows the robots to pick up information from various locations (i.e., waiting and patient rooms, triage stations, etc.), thereby increasing the likelihood of processing confidential information and contributing to potentially exposing this information. Furthermore, during direct communication with patients, healthcare robots should not disclose confidential patient information in a myriad of scenarios. For example, if a malicious user queries a healthcare robot about a recent patient, due to information retained in the context window of an LLM, the robot inadvertently discloses sensitive patient details [211]. Moreover, LLM-augmented healthcare robots, lacking sentience, also have a limited understanding of privacy nuances, which can lead to the inadvertent disclosure of sensitive medical information about a patient they are interacting with [212].

If healthcare robots use a closed-source LLM (i.e., GPT-4) for semantic reasoning, they are also providing access to their developers to use the interaction history to improve their models [213,214]. This can lead to privacy issues, as once an LLM is trained on interaction histories, it incorporates these data into the model’s weights, effectively embedding the learned information within the transformer architecture [215]. Consequently, when used for semantic reasoning about patient information, if the model encounters an input that is similar to a previous interaction (e.g., similar patient descriptions), it is likely to reference the part of the embedding space where this previous interaction was stored. This can potentially lead to the reuse or inadvertent disclosure of specific details from those prior interactions in its output [216]. This method of extracting model training data is referred to as a model inversion attack [217]. In a model inversion attack, an adversary uses prompt engineering to extract sensitive training data details embedded in the model’s weights and revealed through its outputs [217,218]. Furthermore, a model inversion attack can also be used to recover previous prompts provided to the model [219], which in the case of healthcare may include the personally identifiable information of a patient or visitor who has previously interacted with the healthcare robot. By analyzing these responses, the attacker can infer and reconstruct aspects of the original data, especially when the inputs mimic the training. A model inversion attack targeted towards an LLM-augmented healthcare robot can result in a breach of the confidentiality of patients and should be a major privacy concern that needs to be addressed.

Summary and Ethical Outlook

This section has outlined key ethical issues concerning accountability, humanizing care, and privacy related to the potential incorporation of LLM-embedded healthcare robots in healthcare settings. These are important concerns for developers, researchers, and healthcare professionals to consider. It is also important to conduct long-term studies with such technologies to explore their impact directly on the delivery of healthcare. Such studies would need to specifically consider the workload and burden of care staff, patient outcomes, and management of tasks. In particular, these studies will help to better understand how healthcare robots using LLMs will add value to clinical practices and patient interactions over time while autonomously augmenting patient care. Furthermore, they will be crucial in identifying training and deployment strategies to ensure the ethical, effective, and responsible use of this emerging technology.

7. Open Challenges and Future Research Directions in Healthcare Robots Using LLMs

The potential use of LLM frameworks in healthcare robotics can enhance robot intelligence by generating a natural language of semantic knowledge, promoting autonomy through task planning, and enhancing HRI capabilities through multi-modal communication. In this section, we discuss the open technical research challenges and potential research directions of this emerging field.

7.1. Open Research Challenges

There are three main research challenges that need to be addressed before healthcare robots augmented with LLM frameworks can be widely adopted in real-world care environments: (1) the slow response speed of LLMs in real-time healthcare robotics interactions, (2) open- versus closed-source embedded LLMs, and (3) generalizability for healthcare robotics. These challenges are described in the following:

1. Slow response speed: The first technical challenge is due to the time required for an LLM to generate an appropriate output such as a plan or action. For example, the time required to generate a robot plan using remotely hosted LLMs such as GPT series models [46,47] has been found to take anywhere from 36.89 to 220.58 s depending on the task complexity [220], while the time required by locally hosted models such as LLaMa [221] can range from 73 to 234 s [222]. In general, the response time increases as the total number of tokens per query increase, therefore limiting the horizon of robot action plans and communication [223]. However, healthcare robots have real-time constraints and need to be able to generate robot action plans and/or communication behavior in real time and adapt to user and environmental changes. Failure to generate a plan in real time can result in obsolete action plans and/or task incompletion. In comparison, real-time robot plan generation using classical methods such as the Hierarchical Task Network (HTN) [224] or Answer Set Planning (ASP) [225], such as Clingo4 [226], is capable of real-time plan generation. For example, HTN can take approximately 2–17 s for object localization and retrieval tasks [224]. In order to improve the time performance of LLM-augmented healthcare robots, we need to further explore (1) the use of prompt engineering for use in healthcare robots to update the context embedding using as few tokens as possible to reduce computational overhead [227]; (2) optimizing the context window using ‘attention sinks’ to preserve the Key and Value states of the initial tokens, ensuring that the initial instructions to the LLM model are not discarded [228], and thus, additional tokens are not needed to re-align a healthcare robot; and (3) context caching through storing Key–Value activations from previously processed tokens and referencing these cached activations during inference rather than recomputing them for each new token [229], which will reduce the response generation time [229]. These methods aim to reduce memory bandwidth, memory usage, and computation in an attempt to increase the response speeds of the models.

2. Open- versus closed-source models: Both open-source or closed-source models can be considered in the selection of LLMs for healthcare robots. Closed-source models, such as GPT-4 [47], typically outperform open-source alternatives across benchmarks such as MMLU [206] and HellaSwag [230], which have a direct correlation with a model’s ability to be used for multi-modal communication, semantic reasoning, and robot action planning [231]. However, these models have drawbacks, notably, a lack of transparency and restricted control over data usage, raising privacy and compliance issues with standards such as HIPAA [232], GDPR [233], and PIPEDA [234]. On the other hand, open-source models such as PaLM 2 [53], LLaMa 2 [54], GPT-2 [55], T5 [48], and BERT [52] offer full transparency by making their architecture and code publicly accessible, allowing hospitals to host these models on their local servers and save the interaction data locally to ensure control over the data. Despite these benefits, open-source models often underperform compared to closed-source models [235]. This discrepancy is a result of limited research capital, ultimately leading to lower quality training datasets and a lack of computing resources to train bigger and more capable models. Improving open-source LLMs for healthcare robotics involves enriching training data with specific details such as medical procedure execution and patient data processing. Furthermore, adding datasets on robotic path planning in healthcare environments, procedural compliance, staff interaction protocols, and patient safety can significantly enhance a model’s relevance and effectiveness in healthcare settings, leading to more accurate and compliant outcomes. Moreover, collaborative developments between healthcare institutes should be encouraged to share costs and expertise. These steps, while requiring a significant investment of time, money, and expertise, are critical for optimizing open-source LLMs for healthcare applications.

3. Generalizability for healthcare robotics: LLM frameworks need to be adaptable across different types of healthcare robots without being restricted to specific robot models. Hospitals seek long-term investments which often surpass a decade [236], and therefore, LLM frameworks that do not necessitate redevelopment for each new robotic system should be prioritized. To future-proof an LLM framework for healthcare robots, the LLM should implement a modular architecture designed with a general application programming interface, similar to the ROS [109] framework for robotics. This architecture should be designed to facilitate a bridge between the LLM framework and the robot’s perception and control systems, which manage actuator control and sensor data collection. The modular-based approach should utilize current communication protocols to provide the LLM with insights into a robot’s capabilities and allow the robot to receive high-level commands or natural language scripts to be used in HRI. The integration should also provide care providers with a user-friendly GUI to review and make efficient changes to prompts used by the LLM in the background for robot behavior control using natural language and thereby not overwhelming healthcare staff with technical complexities.

7.2. Future Research Directions

The aim of the emerging field of healthcare robots with embedded LLMs is to develop intelligent healthcare robots capable of (1) adapting to and functioning in varying environments from emergency and urgent care departments to surgery and acute care to rehabilitation centers and long-term care facilities; (2) manipulating, fetching, and delivering a wide range of objects including medical instruments and tools, medications, lab specimens, soft goods such as gauze and bandages, food and nutritional supplies, and personal care items like blankets and pillows; and (3) interacting with diverse people from surgeons, doctors, and nurses to patients with various conditions and family members. With respect to the latter, incorporating the perspectives of healthcare professionals and patient users in the deployment of healthcare robotics using LLMs is crucial to bridging research with real-world healthcare applications. In particular, co-design and user-centered design approaches can directly include the insights and experiences of these individuals in the technology development process while maintaining transparency in decision-making in order to closely align such technology with user needs and preferences. Clear and understandable explanations of robotic actions and behaviors to users will further build trust, enhancing the reliability and safety essential for widespread adoption [237].

Regulatory compliance frameworks should be considered to ensure that regulations, guidelines, and/or legislation are met when incorporating LLM-embedded healthcare robots in healthcare settings to ensure safety and security in their use. In particular, the need for such frameworks exists; however, the frameworks themselves have not yet been designed, and there are no universal standards [238,239,240]. Establishing stringent guidelines on data privacy, transparency, the explainability of decisions made by healthcare robots, and the protocols for the human oversight of robot actions will not only promote legal and ethical use but also improve the integration of these technologies into the healthcare sector. This improvement will be a result of clear and enforceable guidelines which help healthcare organizations navigate legal and ethical complexities, thus fostering trust among stakeholders and patients [241], thereby improving patient care while minimizing the dehumanization of patients and, thereby, fostering trust among users and stakeholders.

To date, existing robots have yet to generalize to such a wide range of tasks, environments, and HRI scenarios. However, we believe that healthcare robots augmented with LLMs can provide (1) effective HRI interfaces through intuitive natural language communication to enable smoother interactions across various healthcare settings and (2) versatility in generating high-level robot action plans and semantically reasoning about a myriad of possible scenarios which could take place in ORs, patient wards, and outpatient clinics and also to handle a wide range of tasks from surgical assistance to patient care and administrative duties. Future technical research directions can include the incorporation of VLMs with LLMs in healthcare robots to increase perception capabilities, the use of multilingual LLMs to allow for application with diverse users, the incorporation of automated prompting to handle varying healthcare scenarios, and the development of custom LLMs for healthcare robots. We selected these future research directions as they are unexplored research avenues for LLM architectures, in particular in terms of their applicability to healthcare applications. These future research directions are described in the following:

1. Incorporation of Vision–Language Models (VLMs): VLMs are language models which co-embed image and text data. They are typically trained on extensive datasets, such as MS-COCO [242] and Visual Genome [243], composed of images ranging from natural scenery to common objects alongside their corresponding textual descriptions. VLMs excel at recognizing and narratively describing visual content as a result of the image–text joint embedding space [244]. However, in healthcare, the conventional reliance on RGB images for VLM inputs is an ethical concern due to the potential breach of patient and visitor confidentiality [245]. To address this, healthcare applications may preferentially use point cloud data, which capture three-dimensional spatial information by representing scenes or objects as a collection of vertices in a coordinate system [246]. Adapting VLMs to work with point clouds involves retraining the models using datasets which include labeled 3D spatial data and generating a co-embedding space between point clouds and texts [247]. This adaptation not only helps mitigate privacy concerns but also expands the utility of VLMs in healthcare, offering a new dimension of data. For example, the transition from RGB to point clouds can improve multi-modal communication and patient intent recognition through the analysis of 3D point clouds of human body poses [248], which can in turn provide the LLM with more information to be used during semantic reasoning to generate plans cognizant of the patient psychological state when generating plans to assist patients.

2. Leveraging multilingual LLMs for diverse populations: Closed-source LLMs such as GPT-4 [47] contain multilingual capabilities. For example, GPT-4 [47] supports 27 languages ranging from English to Urdu [47]. However, open-source models such as LLaMa 2 [54] are only trained in English [249], and it is a community project that researchers need to undertake to expand the list of supported languages [250]. Collecting/refining a multilingual training dataset and training a 70 B parameter model is resource-intensive and difficult to carry out by researchers. Furthermore, there is a variety of demographics present in healthcare settings. LLM-augmented healthcare robots should be capable of interacting with all demographics with equal proficiency to maintain the compassionate elements of care. Therefore, it is worthwhile to investigate methods such as transfer learning [251,252] which can increase the efficiency of training LLMs to support new languages. For example, in [251], it was revealed that approximately 1% of the total model parameters of LLaMa 2 corresponds to linguistic competence which represents an LLM’s knowledge of grammatical rules and patterns [253]. Therefore, by holding the weights of the model constant in the specific regions of the embedding space of the LLM during further training (where a region encodes the linguistic knowledge of a particular language), models not only retain previously acquired languages more effectively but also demonstrate a heightened capacity for rapid adaptation to new linguistic environments.

3. Incorporating automated prompting to handle various healthcare scenarios: Healthcare settings provide challenging but realistic real-world environments. For example, for healthcare robots to provide multi-modal communication, prompt engineering can be used to adapt robot assistive responses based on the age, health conditions, and/or cognitive or physical disabilities of patients. The advantage of prompt engineering in healthcare robotics for multi-modal communication is that it facilitates customized context-sensitive interactions tailored to individual patient profiles. Namely, by using prompts to align the LLM to better represent the user, the model is directed into the relevant embedding space region based on the current context. Thereby, the model dynamically modifies robotic responses based on specific patient data such as age, health conditions, and cognitive or physical abilities. Automated prompting in healthcare settings can improve the efficiency of prompt engineering through techniques such as prompt compression which aims to extract only the essential information from prompts using knowledge distillation to teach simpler models to mimic more complex ones with shorter inputs, encoding to reduce prompts into compact vector representations [254]. This thereby condenses complex medical data into actionable prompts. Prompt optimization further refines these interactions using gradient-based optimization, selecting the most effective format [254]. This dynamic modification of robotic responses ensures multi-modal communication that is not only responsive and context-sensitive but also adapts to diverse clinical scenarios to enhance the overall quality of care in real-world healthcare environments.

4. Customized LLMs for healthcare robotics: In healthcare applications, technologies must not only perform optimally but also adhere to strict ethical standards such as those outlined by HIPAA [231]. The adoption of LLMs in healthcare settings requires the potential design of custom LLM models that consider the transparency of how and what data are used and the control of the data and maintain high performance in generating cohesive and appropriate outputs for robot multi-modal communication, semantic reasoning, and task planning. For example, OpenAI has custom GPTs [255], which allow users to fine-tune GPT models based on their own data and not have to share their interactions with the developers of the GPT (i.e., OpenAI). However, using these custom GPTs does not allow healthcare organizations to have full control over the model behavior, as the model is designed and its behavior is fine-tuned using human reinforcement learning by the developers [238] and the data are stored on remote servers limiting the control over data handling protocols by hospitals. Furthermore, current methods used for removing personally identifiable information from LLM models such as manual data scrubbing [256] and retraining models with sanitized datasets [257] can be costly and time-consuming [256,258]. In particular, manual data scrubbing requires extensive human oversight to identify and remove personally identifiable information [259], which is labor-intensive and prone to errors. Namely, datasets used for training LLMs such as Common Crawl contain petabytes of data [260]. Therefore, it is not possible to have complete human oversight over the data cleaning process. Hence, this is why heuristic methods are used to find and replace words in tandem with human oversight [48]. The process of retraining models with sanitized data requires significant computational resources and also involves lengthy cycles of validation where humans interact with the model to ensure satisfactory responses and the integrity of the performance of a model such as GPT-4 [238], which is often referred to as reinforcement learning from human feedback (RLHF) [261]. Therefore, there is a need to develop new efficient differential privacy strategies [262] which can minimize the risks of data contamination and leakage of personally identifiable information (PII) during the initial training of an LLM model. Namely, in [263], it was noted that using differential privacy guidelines such as adding noise to the training data results in a 10% increase in accuracy in the PII inference of the GPT-2 model. However, adding excessive noise and data scrubbing using Named Entity Recognition [264] to further limit the inference of PII significantly degrades the utility of the models. Therefore, it is worthwhile to investigate new methods to scrub PII and prevent the leakage of PII. For example, in [265], a LLaMA 7 B model was fine-tuned on user–LLM interactions with differential privacy using DP-Adam [266], and then, the fine-tuned model was used to generate a synthetic dataset. The synthetic dataset was then resampled using a DP histogram to align the distribution of the synthetic dataset with the real dataset. The resampled dataset was used to train a subsequent model. This method showed promise by producing an 8.6% relative improvement in performance compared to using the initial dataset. The aforementioned models attempt to minimize the risks of leaked PII; however, their accuracy may not meet regulations such as the EU Right To Be Forgotten regulation which dictates the removal of the personal data of users [267]. Therefore, research is still needed to determine how such approaches can be used for cleaning datasets used to train LLMs which are structurally similarly to web crawls (i.e., massive textual corpuses).

8. Design of Potential Healthcare Applications of LLM-Embedded Healthcare Robots

In this section, we further explore in detail specific potential healthcare application designs for LLM-based healthcare robots in terms of multi-modal communication during assistive HRI and semantic reasoning and robot task planning for robotic surgery.

8.1. Design 1: Multi-Modal Communication for a Socially Assistive Robot

Scenario: A socially assistive robot can be used for Reminiscence/Rehabilitation Interactive Therapy [268] with individuals living with dementia. These reminiscence activities can include the recall of past events, including listening and singing to favorite songs, watching and discussing favorite TV shows or movies, and discussing significant historical events. The robot should be capable of interpreting the verbal and non-verbal responses of the users and adapting its interactions to their emotional states to promote engagement and emotional well-being.

To facilitate therapy, a socially assistive robot would require either a generated general knowledge base containing details about historical events, movies, TV shows, and music or the ability to search the web for this information. However, the robot would require dedicated search algorithms to efficiently find relevant information for the assistive HRI scenarios. With the use of datasets, the robot is restricted to only the limited information available in a dataset, while web searches focus on keywords within a webpage and then convert all text within that webpage into speech. LLMs can be used to address these challenges. They do not require a dataset for online knowledge retrieval (only for training purposes), allowing them to obtain new information on the fly from the web or from additional datasets known as vector databases as needed. The latter is known as retrieval-augmented generation (RAG) [269]. Furthermore, traditional web searches do not analyze or summarize the content on webpages, whereas LLMs generate responses containing new content which are based on patterns and associations they have learned and that are available to them on new websites, increasing flexibility and adaptability to new topics and activities [39]. In particular, we consider the utilization of GPT-4 [47] for Personalized Reminiscence Therapy. Below, we outline the design of a potential framework for a socially assistive robot that could leverage GPT-4 to facilitate Reminiscence Therapy, Figure 1.

Dynamic content retrieval: GPT-4 can improve the efficiency of a socially assistive robot in providing reminiscence activities by conducting web searches and generating abstractive summaries, rather than providing verbatim information found on webpages, as traditional web searches do. More specifically, GPT-4 can retrieve information such as historical events, queue movies and TV shows, and play music through API calls to video sharing platforms such as YouTube and SoundCloud. The retrieved information can be transferred into robot speech, and the media can be played through a robot’s display screen and speakers to provide Reminiscence Therapy. To facilitate dynamic content retrieval that is personalized to each user, a therapist can provide an initial spoken language prompt which provides information about the user’s interests and past hobbies to GPT-4. This information is stored within GPT-4’s context window, and future user requests for media will retrieve relevant content based on the user’s interest without the need to provide the specific names of songs, movies, or TV shows.

Emotion recognition: GPT-4v [270], which is the multi-modal variant of GPT-4, can detect the emotions of the user from RGB images taken by the socially assistive robot’s onboard camera. GPT-4v can detect the visual features of a person and use these detected features to influence future textual responses. For example, when prompted with images of a person showing clear signs of distress, GPT-4v engages in empathetic conversations. The capability of GPT-4v to identify the emotional states of a user enables the socially assistive robot to adapt its verbal and non-verbal response during assistive HRI to the user. For example, the robot can change discussion topics, simplify a conversation, play uplifting music, or continue to converse about a topic which is observed to be effective. This functionality enables the robot to interact using multiple modes of communication and to also understand multiple modes of communication, thereby facilitating bidirectional multi-modal HRI without the need for extensive pre-programming and complex designs, which is a current barrier to the adoption of socially assistive robots for use with older adults diagnosed with dementia [271].

Multi-modal interactions: GPT-4 can generate speech and incorporate predefined atomic actions for the socially assistive robot using few-shot learning. These atomic actions can include gestures, body language, and facial expressions. Namely, a textual response is generated based on a natural language prompt from a therapist which includes contextual information about the user such as interests and the cognitive impairment level of the older adult and a transcription of the user request provided to GPT-4 by the speech-to-text service used by the robot. The response is then conveyed verbally to a user using text-to-speech software such as the Google Cloud text-to-speech service [161]. To facilitate non-verbal communication, a separate prompt for GPT-4 is created containing (1) the non-verbal atomic actions, (2) instructions on how to embed the atomic actions into the generated textual response resulting from the first prompt, and (3) the generated textual response itself. The resulting generated text will include the textual response with non-verbal elements (atomic actions) embedded within. Specifically, the attention head mechanisms of GPT-4 apply a bias to each token (word) of the second input prompt, therefore prioritizing atomic actions that are most relevant to the current interaction state that is stored in the context window of the model. The interaction state represents all interactions that have occurred, including all inputs to GPT-4, and the generated outputs of GPT-4 (i.e., atomic actions, user requests, responses). Consequently, the actions which are embedded within the textual response of GPT-4 are the non-verbal elements most appropriate for the current state of the interaction, ensuring a more seamless and contextually appropriate multi-modal communication experience.

Framework integration: Three instances of GPT-4 are created, and each instance refers to a distinct configuration of the model having a distinct role, distinct prompts as outlined in Table 3, and a distinct context. The first instance of GPT-4 is used to determine the user intent, the second is used for the dynamic retrieval of required media from the web, and the third is used to embed atomic actions (non-verbal elements) into the generated textual responses of GPT-4v. Additionally, one instance of GPT-4v is also utilized for emotion recognition. All implementations are set up using the OpenAI API [272] which allows for access to the GPT-4 and GPT-4v models through HTTPS requests.

The first instance of GPT-4 is used in the Decision module in the framework (Figure 1) to determine whether the user is seeking entertainment such as TV shows, movies, or music or if they are looking to discuss historical events or engage in general reminiscence conversations. The Decision module uses Prompt 1 and prompts this instance of GPT-4 with ‘Determine the intent of the user request, does the user seek entertainment? If yes return “entertain(user_request)” or if the user is seeking historical events and conversation return “converse(user_request)”’ and concatenates the transcription of the user request (provided by the speech-to-text service) to the prompt. GPT-4 will generate a textual response which is either the entertain(user_request) function or the converse(user_request) function. The appropriate function that is specified in the response of GPT-4 will be executed, utilizing the corresponding Entertain or Converse module. The second instance of GPT-4 is embedded into the Entertain module of the robot for the purpose of dynamically retrieving content based on the user’s interests which are defined by the natural language prompt to GPT-4 by the user’s therapist. Additionally, Prompt 2 will be used to provide the model with examples on how to search and obtain media from, for example, the YouTube API [273] based on the user’s request. The Entertain module will generate a textual response which is API calls to retrieve and play back media using the robot’s onboard screen controlled through its low-level controller.

The Converse module is used to facilitate multi-modal HRI and uses GPT-4v for emotion recognition and response generation to user requests. The Converse module will concatenate the therapist’s verbal prompt to Prompt 3 to be used with GPT-4v. Prompt 3 outlines GPT-4v’s role and includes RGB images of the user obtained from the Sensor module, for example, ‘you are embedded within a socially assistive robot to provide Reminiscence/Rehabilitation Interactive Therapy to older adults with dementia. You will also be provided with images of the scene, if you detect the user to have negative emotions, start reminiscence therapy based on this information [patient information]’. The user request will also be concatenated to this prompt to enable GPT-4v to generate a contextually accurate textual response. The textual response from GPT-4v is then concatenated to Prompt 4 to be used with the third instance of GPT-4. The third instance of GPT-4 is also within the Converse module and is used to embed atomic actions (non-verbal elements) into the textual response generated by GPT-4v. Namely, Prompt 4 defines the atomic actions (gestures, facial expressions, body poses) and shows examples of how they will be integrated into the response of GPT-4v (verbal response to the user). This prompt will result in GPT-4 generating the output of the Converse module containing the robot’s verbal response to the user, embedded with atomic actions to provide robot emotional interactions and to allow the robot to facilitate multi-modal HRI.

Table 3. Prompts for design 1.

Prompt 1	Determine the intent of the user request, does the user seek entertainment? If yes return “entertain(user_request)” or if the user is seeking historical events and conversation return “converse(user_request)
Prompt 2	You are a part of the Entertain module within a socially assistive robot. Your role is to access and provide entertainment based on the preferences and requests of the user. Given the textual transcription of the user’s spoken request, use the following sequence of function calls to guide your response. Example 1: User Request: ‘I want to watch a documentary about space’. API Call: searchYouTube(‘documentary about space’) Function Calls: 1. video_id = fetchVideoID(‘documentary about space’) 2. video_path = saveVideo(video_id) 3. playMedia(video_path) Example 2: User Request: ‘Play some classical music’. API Call: searchYouTube(‘classical music playlist’) Function Calls: 1. video_id = fetchVideoID(‘classical music playlist’) 2. video_path = saveVideo(video_id) 3. playMedia(video_path) Based on the user’s current request, follow these steps to retrieve the video ID, save it, and then play the media. Use the appropriate API calls to search YouTube and handle the responses effectively.
Prompt 3	you are embedded within a socially assistive robot to provide Reminiscence/Rehabilitation Interactive Therapy to older adults with dementia. You will also be provided with images of the scene, if you detect the user to have negative emotions, start reminiscence therapy based on this information [patient information]
Prompt 4	Our goal is to integrate non-verbal communication into the text-based script that the socially assistive robot will use to respond to older adults with dementia. The robot’s script should include atomic actions to perform specific gestures, body movements and facial expressions, improving its interactions and providing a more comforting presence. These are the atomic actions: “ Yes: nods head downwards; Explain: moves both hands in front of robot and then apart from each other; Confident: robot tilts hip backwards and stands with a wide stance;” you need to take this <script> presentation and match/re-write it to include the appropriate gestures and body movements embedded within the text. Here is an example: r“^start(animations/Stand/Gestures/Explain) Welcome to a fascinating journey into the realm of robotic learning!” r”Just like humans, robots can learn and evolve.^stop(animations/Stand/Gestures/Confident)

8.2. Design 2: Semantic Reasoning and Planning

Scenario: A surgical robot needs to recognize, grab, and hand over specific surgical tools during an operation. Namely, the robot needs to first semantically reason about the current state of the surgical operation to identify the tool required by the surgeon, before generating a plan to localize, retrieve, and hand over the tool to the surgeon. Below, we outline the design of a potential framework for a surgical robot that could leverage LLMs for both semantic reasoning and task planning, Figure 2.

The implementation of semantic reasoning and planning for surgical robots is challenging, as semantic reasoning alone requires (1) a knowledge source specific to surgical procedures, which contains class-level knowledge such as ‘bone saws are typically used for cutting bones’ but also instance-level knowledge such as the intended use of a sternal saw for cutting through the sternum, and (2) a computational framework such as an LLM which can process and compare the tools currently available against world representations acquired through verbal feedback from the surgeon or bag-of-words neural network models such as OpenScene [134] or Detic [163]. However, the implementation and design of each of the aforementioned requirements specifically for surgical operations is labor-intensive and non-trivial and requires cross-domain expertise. LLMs can address these challenges, as they provide an all-in-one solution. In particular, LLMs have a knowledge source which is in their embedding space that can be trained specifically for medical use, such as Med-PaLM 2 [42] presented in Section 2. Furthermore, the implementation of an LLM into a surgical robot for semantic reasoning is beneficial, as an LLM also includes a computational framework, as provided by the transformer architecture, thereby streamlining the implementation. Lastly, the world representations can be provided through prompts which depict or describe the surgical operation in either images or text. Therefore, an LLM has the potential to be incorporated into surgical robots that can be used to intelligently assist surgeons during surgical operations.

Moreover, we consider the use of an LLM for surgical robot task planning, as it provides the following advantages: (1) generated surgical plans are human-readable, (2) surgical staff can input changes to the plan via verbal commands to the surgical robot, without much added complexity, (3) the LLM for surgical robot task planning interacts directly with semantic reasoning modules via natural language prompts, further reducing complexity, and (4) unlike deep learning models such as PPCNet [145] or heuristic models such the genetic algorithms [142], which require retraining or redesign for new environments since they often fail to generalize, LLMs such as PaLM-E [274] are not environment-specific, allowing for better adaptation to unencountered settings.

We consider the use of Med-PaLM 2 and PaLM-E, which are multi-modal LLMs, to provide a surgical robot with the following capabilities:

Automated documentation and reporting: Med-PaLM 2 can automatically document each step of the surgical process, generating detailed reports that include the current state of the surgical procedure, tool usage, and any encountered medical complications. This is achieved by prompting Med-PaLM 2 with (1) a role, (2) RGB images of the surgical operation provided by cameras in the OR, (3) background information about the surgical operation taking place, and (4) the transcription of verbal communication taking place within the OR. Med-PaLM 2 will then generate a procedural report of the surgical operation documenting the actions of the surgical team in the order that they occurred. This functionality aids in post-operative review and quality control, providing a breakdown of the surgery for record keeping and ensuring that all surgical actions are traceable and transparent. This can facilitate easier follow-up and assessment by medical professionals, without the need for overburdening surgical staff with documentation writing tasks.

Human-understandable plan generation: PaLM-E is specifically designed to be used for long horizon planning in embodied robots and supports multi-modal prompts which can include text, images, and state estimations [274]. Moreover, PaLM-E can generate plans in human-understandable formats such as behavior trees [275] and can take into account feedback from surgeons when generating plans. This functionality is crucial for surgical robots for two reasons: (1) it provides the robot with a recovery method in cases where the robot fails to localize the surgical tool, and (2) the recovery method utilizes verbal feedback from the OR staff, providing an intuitive way for humans to guide the robot. This direct communication minimizes delays by enabling the robot to immediately navigate to the correct location for tool retrieval. It avoids the inefficiencies of trial-and-error searches or reliance on external databases to search for the tool, which not only slow down the process but also do not update the database in real time if tools have been misplaced, potentially leading to a failed robot state.

Tool selection and handover: A surgical robot can leverage Med-PaLM 2 to interpret the current state of a surgical procedure in real time based on verbal commands from surgeons and RGB images provided by cameras in the OR. Moreover, Med-PaLM 2 augmented with a RAG module to provide surgery-specific information can infer which surgical tool is needed based on a prompt containing the current step of the surgery from (1) the most recent action described in the surgical report, (2) RGB images of the operation, and (3) the transcription of verbal communication between surgical staff in the OR. Once the surgical tool is identified by MED-PaLM 2, the result can be used as an input to PaLM-E to dynamically generate a robot plan in the form of a behavior tree to localize, grasp, and hand over the surgical tool to a surgeon. Furthermore, by automating both the process of prompting Med-PaLM 2 and PaLM-E, the surgical robot can infer which tool a surgeon may need before the surgeon explicitly requests. The ability of a surgical robot to infer and retrieve the surgical tools in a timely manner can possibly reduce the cognitive load of the surgical team and provide a short-term solution to problems such as shortages of OR staff [276].

Framework integration: Three instances of Med-PaLM 2 are set up through the Google Cloud API platform [277], each with a distinct role, distinct prompts as outlined in Table 4, and a distinct context. The first instance of Med-PaLM 2 is used to generate a detailed surgical procedure, providing a step-by-step breakdown of the surgery, including the necessary surgical tools. The second instance is used to extract key information from the detailed surgical procedure, for example, identifying the surgical tool required for each step of surgery. The third instance is used to generate a textual response which documents the entirety of the surgical operation and provides feedback to PaLM-E. Furthermore, one instance of PaLM-E is utilized to generate surgical robot plans to localize, grasp, and hand over surgical tools. PaLM-E generates a behavior tree in XML format which is then utilized by the surgical robot’s low-level controller to facilitate the handover of surgical tools. The first two instances of Med-PaLM 2 are used for semantic reasoning, while the third instance of Med-PaLM 2 and the only instance of PaLM-E are used for planning as shown in Figure 2.

The first instance of Med-PaLM 2 is used in the Knowledge module (1) to determine ambiguities in the results from the QueryDB module, a vector database of common surgical procedures, and (2) to generate a textual response to address the ambiguities. The first instance of Med-PaLM 2 uses Prompt 1, which is as follows: ‘Are there any ambiguities in the retrieved data with respect to the [surgical operation]?, your response should address the ambiguities or any missing tools or procedural steps relevant to the ongoing surgical operation [surgical operation], only include the surgical procedure and the required tools, do not provide explanations for how the ambiguities are identified’. The generated response from this instance of Med-PaLM 2 is concatenated to Prompt 2 to be used with the second instance of Med-PaLM 2. The second instance is also within the Knowledge module, and its role is to identify the target objects (surgical tools) required by a surgeon for each step of the surgery and to generate a prompt for the Planning module. Prompt 2 is as follows: ‘analyze the current surgical procedure details. For each step, identify the required surgical tool and its location as observed in the accompanying images. Generate a response formatted as a three-part entry for each step, delineated by colons. The format should be: procedure step number, name of the surgical tool, and the tool’s location. For example, ‘Step 1: Bone Saw: Tool Cart’.

The response from the second instance of Med-PaLM 2 is concatenated to Prompt 3 which is to be used with PaLM-E in the Planner module. Prompt 3 includes RGB images from the OR, the output of the Knowledge module, the description of the available atomic actions the surgical robot can perform, and an example of a generated plan in XML format. These atomic actions are graspTool(tool) and navigateTo(location). Prompt 3 also instructs PaLM-E to generate the plan as a behavior tree in XML format which is utilized by the surgical robot’s low-level controller to facilitate the handover of surgical tools to a surgeon.

The third instance of Med-PaLM 2 is within the Evaluation module. The role of Med-PaLM 2 here is to (1) document the surgery for post-operative review and (2) generate feedback for the Planner module on changes that should be made based on a surgeon’s feedback in response to a failure to hand over surgical tools to the surgeon. Prompt 4 is used to instruct Med-PaLM 2 with ‘you are tasked with two functions: (1) Documenting the surgical procedure as a list surrounded by *(documentation)*. Here is an example: “*(1. Start of procedure, 2. Beginning on the rise of nose, …)*”, and (2) if there is failure in handover of surgical tools to a surgeon, identify the issues in the tool handover process based on surgeon feedback, and generate recommendations for PaLM-E to consider in generating the new plan. The suggestion to PaLM-E should be surrounded by &(suggestion)&, and should include the step(s) of the plan which have resulted in errors and the associated target object(s) and location(s) based on the corresponding images of the OR and feedback from a surgeon’. The output from the Evaluation module is then passed to the Planner module, where the cycle repeats until there is successful handover of the surgical tool.

Table 4. Prompts for design 2.

Prompt 1.	Are there any ambiguities in the retrieved data with respect to the [surgical operation]?, your response should address the ambiguities or any missing tools or procedural steps relevant to the ongoing surgical operation [surgical operation], only include the surgical procedure and the required tools, do not provide explanations for how the ambiguities are identified.
Prompt 2	Analyze the current surgical procedure details. For each step, identify the required surgical tool and its location as observed in the accompanying images. Generate a response formatted as a three-part entry for each step, delineated by colons. The format should be: procedure step number, name of the surgical tool, and the tool’s location. For example, ‘Step 1: Bone Saw: Tool Cart.’
Prompt 3	your role is to manage and facilitate the retrieval of surgical tools through generating behavior trees written in XML format. You are designed to interpret surgeon commands and feedback. Functional Capabilities: graspTool(tool): Grasps a specified surgical tool necessary for the procedure. releaseTool(tool): Releases the currently held tool back into the tool tray. navigateTo(location): Moves the robot’s arms to a specified location within the surgical field. reportFailure(): Logs an error and signals for human assistance if a task cannot be completed. Example: <BehaviorTree> <Sequence name = “Tool Retrieval for Surgery Preparation”> <Action function = “navigateTo(‘Tool Cart’)”/> <Action function = “graspTool(‘Scalpel’)” onFailure = “reportFailure”/> <Action function = “navigateTo(‘Surgical Table’)” onFailure = “reportFailure”/> <Action function = “releaseTool(‘Scalpel’)” onFailure = “reportFailure”/> <Action function = “navigateTo(‘Tool Cart’)” onFailure = “reportFailure”/> <Action function = “graspTool(‘Scissors’)” onFailure = “reportFailure”/> <Action function = “navigateTo(‘Surgical Table’)” onFailure = “reportFailure”/> <Action function = “releaseTool(‘Scissors’)” onFailure = “reportFailure”/> <Action function = “navigateTo(‘Tool Cart’)” onFailure = “reportFailure”/> <Action function = “graspTool(‘Suture Kit’)” onFailure = “reportFailure”/> <Action function = “navigateTo(‘Surgical Table’)” onFailure = “reportFailure”/> <Action function = “releaseTool(‘Suture Kit’)” onFailure = “reportFailure”/> </Sequence> </BehaviorTree> <SubTree> <Action name = “reportFailure”> <Log message = “STUCK: Assistance required.”/> <Signal function = “requestHelp”/> </Action> </SubTree>
Prompt 4	You are tasked with two functions: (1) Documenting the surgical procedure as a list surrounded by (documentation). Here is an example: “(1. Start of procedure, 2. Beginning on the rise of nose, …)”, and (2) if there is failure in handover of surgical tools to a surgeon, identify the issues in the tool handover process based on surgeon feedback, and generate recommendations for PaLM-E to consider in generating the new plan. The suggestion to PaLM-E should be surrounded by &(suggestion)&, and should include the step(s) of the plan which have resulted in errors and the associated target object(s) and location(s) based on the corresponding images of the OR and feedback from a surgeon

Author Contributions

Conceptualization, S.P. and G.N.; methodology, S.P. and G.N.; validation, S.P. and G.N.; formal analysis, S.P. and G.N.; investigation, S.P. and G.N.; resources, G.N.; data curation, S.P. and G.N.; writing—original draft preparation, S.P. and G.N.; writing—review and editing, S.P. and G.N.; visualization, S.P. and G.N.; supervision, G.N.; project administration, G.N.; funding acquisition, G.N. All authors have read and agreed to the published version of the manuscript.

Funding

AGE-WELL Inc., Canadian Frailty Network (CFN), Canada Research Chairs (CRC) program, Natural Sciences and Engineering Research Council of Canada (NSERC), and NSERC HeRo CREATE program.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors would like to thank and acknowledge the assistance of Clara Naini in helping to find and organize some of the scholarly papers on LLMs for robots used in the sections of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of data; in the writing of this manuscript; or in the decision to publish the results.

References

World Health Organization. Ageing and Health. Available online: https://www.who.int/news-room/fact-sheets/detail/ageing-and-health (accessed on 3 January 2024).
Hornstein, J. Chronic Diseases in America|CDC. Available online: https://www.cdc.gov/chronicdisease/resources/infographic/chronic-diseases.htm (accessed on 19 January 2024).
Hacker, K.A. COVID-19 and Chronic Disease: The Impact Now and in the Future. Prev. Chronic. Dis. 2021, 18, E62. [Google Scholar] [CrossRef] [PubMed]
Express Entry Targeted Occupations: How Many Healthcare Workers Does Canada Need?|CIC News. Available online: https://www.cicnews.com/2023/10/express-entry-targeted-occupations-how-many-healthcare-workers-does-canada-need-1040056.html (accessed on 19 January 2024).
Fact Sheet: Strengthening the Health Care Workforce|AHA. Available online: https://www.aha.org/fact-sheets/2021-05-26-fact-sheet-strengthening-health-care-workforce (accessed on 25 June 2024).
Tulane University. Big Data in Health Care and Patient Outcomes. Available online: https://publichealth.tulane.edu/blog/big-data-in-healthcare/ (accessed on 19 January 2024).
Gibson, K. The Impact of Health Informatics on Patient Outcomes. Available online: https://graduate.northeastern.edu/resources/impact-of-healthcare-informatics-on-patient-outcomes/ (accessed on 19 January 2024).
Northeastern University Graduate Programs. Using Data Analytics to Predict Outcomes in Healthcare. Available online: https://journal.ahima.org/page/using-data-analytics-to-predict-outcomes-in-healthcare (accessed on 19 January 2024).
Yu, P.; Xu, H.; Hu, X.; Deng, C. Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration. Healthcare 2023, 11, 2776. [Google Scholar] [CrossRef] [PubMed]
Shen, D.; Wu, G.; Suk, H.-I. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef] [PubMed]
Park, D.J.; Park, M.W.; Lee, H.; Kim, Y.-J.; Kim, Y.; Park, Y.H. Development of Machine Learning Model for Diagnostic Disease Prediction Based on Laboratory Tests. Sci. Rep. 2021, 11, 7567. [Google Scholar] [CrossRef] [PubMed]
Webster, P. Six Ways Large Language Models Are Changing Healthcare. Nat. Med. 2023, 29, 2969–2971. [Google Scholar] [CrossRef] [PubMed]
Benary, M.; Wang, X.D.; Schmidt, M.; Soll, D.; Hilfenhaus, G.; Nassir, M.; Sigler, C.; Knödler, M.; Keller, U.; Beule, D.; et al. Leveraging Large Language Models for Decision Support in Personalized Oncology. JAMA Netw. Open 2023, 6, e2343689. [Google Scholar] [CrossRef] [PubMed]
UC Davis Health Minimally Invasive and Robotic Surgery|Comprehensive Surgical Services|UC Davis Health. Available online: https://health.ucdavis.edu/surgicalservices/minimally_invasive_surgery.html (accessed on 25 January 2024).
Robotic Surgery: Robot-Assisted Surgery, Advantages, Disadvantages. Available online: https://my.clevelandclinic.org/health/treatments/22178-robotic-surgery (accessed on 19 January 2024).
Sivakanthan, S.; Candiotti, J.L.; Sundaram, A.S.; Duvall, J.A.; Sergeant, J.J.G.; Cooper, R.; Satpute, S.; Turner, R.L.; Cooper, R.A. Mini-Review: Robotic Wheelchair Taxonomy and Readiness. Neurosci. Lett. 2022, 772, 136482. [Google Scholar] [CrossRef] [PubMed]
Fanciullacci, C.; McKinney, Z.; Monaco, V.; Milandri, G.; Davalli, A.; Sacchetti, R.; Laffranchi, M.; De Michieli, L.; Baldoni, A.; Mazzoni, A.; et al. Survey of Transfemoral Amputee Experience and Priorities for the User-Centered Design of Powered Robotic Transfemoral Prostheses. J. Neuroeng. Rehabil. 2021, 18, 168. [Google Scholar] [CrossRef] [PubMed]
MIT-Manus Robot Aids Physical Therapy of Stroke Victims. Available online: https://news.mit.edu/2000/manus-0607 (accessed on 20 January 2024).
Maciejasz, P.; Eschweiler, J.; Gerlach-Hahn, K.; Jansen-Troy, A.; Leonhardt, S. A Survey on Robotic Devices for Upper Limb Rehabilitation. J. Neuroeng. Rehabil. 2014, 11, 3. [Google Scholar] [CrossRef] [PubMed]
Teng, R.; Ding, Y.; See, K.C. Use of Robots in Critical Care: Systematic Review. J. Med. Internet Res. 2022, 24, e33380. [Google Scholar] [CrossRef] [PubMed]
Abdullahi, U.; Muhammad, B.; Masari, A.; Bugaje, A. A Remote-Operated Humanoid Robot Based Patient Monitoring System. IRE J. 2023, 7, 17–22. [Google Scholar]
Gonzalez, C. Service Robots Used for Medical Care and Deliveries—ASME. Available online: https://www.asme.org/topics-resources/content/are-service-bots-the-new-future-post-covid-19 (accessed on 3 January 2024).
Sarker, S.; Jamal, L.; Ahmed, S.F.; Irtisam, N. Robotics and Artificial Intelligence in Healthcare during COVID-19 Pandemic: A Systematic Review. Robot. Auton. Syst. 2021, 146, 103902. [Google Scholar] [CrossRef] [PubMed]
How Robots Became Essential Workers in the COVID-19 Response—IEEE Spectrum. Available online: https://spectrum.ieee.org/how-robots-became-essential-workers-in-the-covid19-response (accessed on 20 January 2024).
The Clever Use of Robots during COVID-19—EHL Insights|Business. Available online: https://hospitalityinsights.ehl.edu/robots-during-covid-19 (accessed on 20 January 2024).
Getson, C.; Nejat, G. The Adoption of Socially Assistive Robots for Long-Term Care: During COVID-19 and in a Post-Pandemic Society. Healthc. Manag. Forum 2022, 35, 301–309. [Google Scholar] [CrossRef] [PubMed]
Henschel, A.; Laban, G.; Cross, E.S. What Makes a Robot Social? A Review of Social Robots from Science Fiction to a Home or Hospital Near You. Curr. Robot. Rep. 2021, 2, 9–19. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Kim, S.; Kim, S.; Lee, E.; Heo, Y.; Hwang, C.-Y.; Choi, Y.-Y.; Kong, H.-J.; Ryu, H.; Lee, H. Companion Robots for Older Adults: Rodgers’ Evolutionary Concept Analysis Approach. Intell. Serv. Robot. 2021, 14, 729–739. [Google Scholar] [CrossRef] [PubMed]
Denecke, K.; Baudoin, C.R. A Review of Artificial Intelligence and Robotics in Transformed Health Ecosystems. Front. Med. 2022, 9, 795957. [Google Scholar] [CrossRef]
Sevilla-Salcedo, J.; Fernádez-Rodicio, E.; Martín-Galván, L.; Castro-González, Á.; Castillo, J.C.; Salichs, M.A. Using Large Language Models to Shape Social Robots’ Speech. Int. J. Interact. Multimed. Artif. Intell. 2023, 8, 6. [Google Scholar] [CrossRef]
Addlesee, A.; Sieińska, W.; Gunson, N.; Garcia, D.H.; Dondrup, C.; Lemon, O. Multi-Party Goal Tracking with LLMs: Comparing Pre-Training, Fine-Tuning, and Prompt Engineering 2023. In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czechia, 11–15 September 2023. [Google Scholar]
Pandya, A. ChatGPT-Enabled daVinci Surgical Robot Prototype: Advancements and Limitations. Robotics 2023, 12, 97. [Google Scholar] [CrossRef]
Elgedawy, R.; Srinivasan, S.; Danciu, I. Dynamic Q&A of Clinical Documents with Large Language Models. arXiv 2024, arXiv:2401.10733. [Google Scholar]
Hu, M.; Pan, S.; Li, Y.; Yang, X. Advancing Medical Imaging with Language Models: A journey from n-grams to chatgpt. arXiv 2023, arXiv:2304.04920. [Google Scholar]
A Comprehensive Overview of Large Language Models. Available online: https://ar5iv.labs.arxiv.org/html/2307.06435 (accessed on 8 March 2024).
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A Survey of Transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder. Available online: https://ar5iv.labs.arxiv.org/html/2304.04052 (accessed on 8 March 2024).
King, J.; Baffour, P.; Crossley, S.; Holbrook, R.; Demkin, M. LLM—Detect AI Generated Text. Available online: https://kaggle.com/competitions/llm-detect-ai-generated-text (accessed on 8 March 2024).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; 2017; pp. 6000–6010. [Google Scholar]
Burns, K.; Jain, A.; Go, K.; Xia, F.; Stark, M.; Schaal, S.; Hausman, K. Generating Robot Policy Code for High-Precision and Contact-Rich Manipulation Tasks. arXiv 2023, arXiv:2404.06645. [Google Scholar]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. Available online: https://arxiv.org/abs/2007.15779v6 (accessed on 8 March 2024).
Gupta, A.; Waldron, A. Sharing Google’s Med-PaLM 2 Medical Large Language Model, or LLM. Available online: https://cloud.google.com/blog/topics/healthcare-life-sciences/sharing-google-med-palm-2-medical-large-language-model (accessed on 20 January 2024).
Wang, H.; Gao, C.; Dantona, C.; Hull, B.; Sun, J. DRG-LLaMA: Tuning LLaMA Model to Predict Diagnosis-Related Group for Hospitalized Patients. Npj Digit. Med. 2024, 7, 1–9. [Google Scholar] [CrossRef] [PubMed]
Schneider, E.T.R.; de Souza, J.V.A.; Gumiel, Y.B.; Moro, C.; Paraiso, E.C. A GPT-2 Language Model for Biomedical Texts in Portuguese. In Proceedings of the 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), Aveiro, Portugal, 7–9 June 2021; pp. 474–479. [Google Scholar]
Lehman, E.; Johnson, A. Clinical-T5: Large Language Models Built Using MIMIC Clinical Text. PhysioNet 2023. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; p. 159. [Google Scholar]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T. Alpaca: A Strong, Replicable Instruction-Following Mode; Stanford Center for Research on Foundation Models: Stanford, CA, USA, 2023; Available online: https://crfm.stanford.edu/2023/03/13/alpaca.html (accessed on 8 March 2024).
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.R.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2020, arXiv:2307.09288. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. Available online: https://openai.com/index/better-language-models/ (accessed on 26 June 2024).
Jin, D.; Pan, E.; Oufattole, N.; Weng, W.-H.; Fang, H.; Szolovits, P. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Appl. Sci. 2020, 11, 6421. [Google Scholar] [CrossRef]
Pal, A.; Umapathi, L.K.; Sankarasubbu, M. MedMCQA: A Large-Scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering. In Proceedings of the Machine Learning Research (PMLR), ACM Conference on Health, Inference, and Learning (CHIL), Virtual, 7 April 2022; Volume 174, pp. 248–260. [Google Scholar]
Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.W.; Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 2567–2577. [Google Scholar]
Tozzi, C.; Zittrain, J. Introduction. In For Fun and Profit: A History of the Free and Open Source Software Revolution; MIT Press: Cambridge, MA, USA, 2017; Available online: https://ieeexplore.ieee.org/document/8047084 (accessed on 29 May 2024).
Spirling, A. Why Open-Source Generative AI Models Are an Ethical Way Forward for Science. Nature 2023, 616, 413. [Google Scholar] [CrossRef] [PubMed]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2022, arXiv:2108.07258. [Google Scholar]
Self-Influence Guided Data Reweighting for Language Model Pre-Training. Available online: https://ar5iv.labs.arxiv.org/html/2311.00913 (accessed on 8 March 2024).
Solaiman, I.; Dennison, C. Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Sydney, Australia, 6–14 December 2021; p. 448. [Google Scholar]
Prompt Design and Engineering: Introduction and Advanced Methods. Available online: https://ar5iv.labs.arxiv.org/html/2401.14423 (accessed on 8 March 2024).
Ratner, N.; Levine, Y.; Belinkov, Y.; Ram, O.; Magar, I.; Abend, O.; Karpas, E.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. Parallel Context Windows for Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 6383–6402. [Google Scholar]
Chen, H.; Pasunuru, R.; Weston, J.; Celikyilmaz, A. Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading. arXiv 2023, arXiv:2310.05029. [Google Scholar]
Fairness-Guided Few-Shot Prompting for Large Language Models. Available online: https://ar5iv.labs.arxiv.org/html/2303.13217 (accessed on 8 March 2024).
Skill-Based Few-Shot Selection for In-Context Learning. Available online: https://ar5iv.labs.arxiv.org/html/2305.14210 (accessed on 8 March 2024).
Extending Context Window of Large Language Models via Position Interpolation. Available online: https://ar5iv.labs.arxiv.org/html/2306.15595 (accessed on 8 March 2024).
Parallel Context Windows Improve In-Context Learning of Large Language Models. Available online: https://ar5iv.labs.arxiv.org/html/2212.10947 (accessed on 8 March 2024).
MM-LLMs: Recent Advances in MultiModal Large Language Models. Available online: https://ar5iv.labs.arxiv.org/html/2401.13601 (accessed on 8 March 2024).
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Chen, D.; Chang, A.; Nießner, M. ScanRefer: 3D Object Localization in RGB-DScans Using Natural Language. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 12 November 2020; pp. 202–221. [Google Scholar]
Liu, J.X.; Yang, Z.; Idrees, I.; Liang, S.; Schornstein, B.; Tellex, S.; Shah, A. Grounding Complex Natural Language Commands for Temporal Tasks in Unseen Environments. In Proceedings of the 7th Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023. [Google Scholar]
Liu, M.; Shen, Y.; Yao, B.M.; Wang, S.; Qi, J.; Xu, Z.; Huang, L. KnowledgeBot: Improving Assistive Robot for Task Completion and Live Interaction via Neuro-Symbolic Reasoning. In Proceedings of the Alexa Prize SimBot Challenge, Virtual Event, 6 April 2023. [Google Scholar]
Salichs, M.A.; Castro-González, Á.; Salichs, E.; Fernández-Rodicio, E.; Maroto-Gómez, M.; Gamboa-Montero, J.J.; Marques-Villarroya, S.; Castillo, J.C.; Alonso-Martín, F.; Malfaz, M. Mini: A New Social Robot for the Elderly. Int. J. Soc. Robot. 2020, 12, 1231–1249. [Google Scholar] [CrossRef]
Zhao, X.; Li, M.; Weber, C.; Hafez, M.B.; Wermter, S. Chat with the Environment: Interactive Multimodal Perception Using Large Language Models. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1 October 2023; pp. 3590–3596. [Google Scholar]
Singh, I.; Blukis, V.; Mousavian, A.; Goyal, A.; Xu, D.; Tremblay, J.; Fox, D.; Thomason, J.; Garg, A. ProgPrompt: Program Generation for Situated Robot Task Planning Using Large Language Models. Auton. Robots 2023, 47, 999–1012. [Google Scholar] [CrossRef]
Jin, Y.; Li, D.; A, Y.; Shi, J.; Hao, P.; Sun, F.; Zhang, J.; Fang, B. RobotGPT: Robot Manipulation Learning from ChatGPT. IEEE Robot. Autom. Lett. 2023, 9, 2543–2550. [Google Scholar] [CrossRef]
Obinata, Y.; Kanazawa, N.; Kawaharazuka, K.; Yanokura, I.; Kim, S.; Okada, K.; Inaba, M. Foundation Model Based Open Vocabulary Task Planning and Executive System for General Purpose Service Robots. arXiv 2023, arXiv:2308.03357. [Google Scholar]
Murali, P.; Steenstra, I.; Yun, H.S.; Shamekhi, A.; Bickmore, T. Improving Multiparty Interactions with a Robot Using Large Language Models. In Proceedings of the Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 19 April 2023; pp. 1–8. [Google Scholar]
Paiva, A.; Leite, I.; Boukricha, H.; Wachsmuth, I. Empathy in Virtual Agents and Robots: A Survey. ACM Trans. Interact. Intell. Syst. 2017, 7, 11. [Google Scholar] [CrossRef]
Cherakara, N.; Varghese, F.; Shabana, S.; Nelson, N.; Karukayil, A.; Kulothungan, R.; Farhan, M.A.; Nesset, B.; Moujahid, M.; Dinkar, T.; et al. FurChat: An Embodied Conversational Agent Using LLMs, Combining Open and Closed-Domain Dialogue with Facial Expressions. In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czechia, 11–15 September 2023; pp. 588–592. [Google Scholar]
Zhang, B.; Soh, H. Large Language Models as Zero-Shot Human Models for Human-Robot Interaction. arXiv 2023, arXiv:2303.03548. [Google Scholar]
Yang, J.; Chen, X.; Qian, S.; Madaan, N.; Iyengar, M.; Fouhey, D.F.; Chai, J. LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent. arXiv 2023, arXiv:2309.12311. [Google Scholar]
Yoshida, T.; Masumori, A.; Ikegami, T. From Text to Motion: Grounding GPT-4 in a Humanoid Robot “Alter3”. arXiv 2023, arXiv:2312.06571. [Google Scholar]
Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Fu, C.; Gopalakrishnan, K.; Hausman, K.; et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv 2022, arXiv:2204.01691. [Google Scholar]
Lykov, A.; Tsetserukou, D. LLM-BRAIn: AI-Driven Fast Generation of Robot Behaviour Tree Based on Large Language Model. arXiv 2023, arXiv:2305.19352. [Google Scholar]
Kubota, A.; Cruz-Sandoval, D.; Kim, S.; Twamley, E.W.; Riek, L.D. Cognitively Assistive Robots at Home: HRI Design Patterns for Translational Science. In Proceedings of the 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Sapporo, Hokkaido, Japan, 7–10 March 2022; pp. 53–62. [Google Scholar]
Elbeleidy, S.; Rosen, D.; Liu, D.; Shick, A.; Williams, T. Analyzing Teleoperation Interface Usage of Robots in Therapy for Children with Autism. In Proceedings of the ACM Interaction Design and Children Conference, Athens, Greece, 18 May 2021; pp. 112–118. [Google Scholar]
Louie, W.-Y.G.; Nejat, G. A Social Robot Learning to Facilitate an Assistive Group-Based Activity from Non-Expert Caregivers. Int. J. Soc. Robot. 2020, 12, 1159–1176. [Google Scholar] [CrossRef]
Mishra, D.; Romero, G.A.; Pande, A.; Nachenahalli Bhuthegowda, B.; Chaskopoulos, D.; Shrestha, B. An Exploration of the Pepper Robot’s Capabilities: Unveiling Its Potential. Appl. Sci. 2024, 14, 110. [Google Scholar] [CrossRef]
Anderson, P.L.; Lathrop, R.A.; Herrell, S.D.; Webster, R.J. Comparing a Mechanical Analogue With the Da Vinci User Interface: Suturing at Challenging Angles. IEEE Robot. Autom. Lett. 2016, 1, 1060–1065. [Google Scholar] [CrossRef] [PubMed]
Muradore, R.; Fiorini, P.; Akgun, G.; Barkana, D.E.; Bonfe, M.; Boriero, F.; Caprara, A.; De Rossi, G.; Dodi, R.; Elle, O.J.; et al. Development of a Cognitive Robotic System for Simple Surgical Tasks. Int. J. Adv. Robot. Syst. 2015, 12, 37. [Google Scholar] [CrossRef]
Łukasik, S.; Tobis, S.; Suwalska, J.; Łojko, D.; Napierała, M.; Proch, M.; Neumann-Podczaska, A.; Suwalska, A. The Role of Socially Assistive Robots in the Care of Older People: To Assist in Cognitive Training, to Remind or to Accompany? Sustainability 2021, 13, 10394. [Google Scholar] [CrossRef]
Natural Language Robot Programming: NLP Integrated with Autonomous Robotic Grasping. Available online: https://ar5iv.labs.arxiv.org/html/2304.02993 (accessed on 8 March 2024).
Papadopoulos, I.; Koulouglioti, C.; Lazzarino, R.; Ali, S. Enablers and Barriers to the Implementation of Socially Assistive Humanoid Robots in Health and Social Care: A Systematic Review. BMJ Open 2020, 10, e033096. [Google Scholar] [CrossRef]
Kim, C.Y.; Lee, C.P.; Mutlu, B. Understanding Large-Language Model (LLM)-Powered Human-Robot Interaction. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction (HRI ′24), Boulder, CO, USA, 11–14 March 2024; pp. 371–380. [Google Scholar]
Emotion Is All You Need?-Boosting ChatGPT Performance with Emotional Stimulus-FlowGPT. Available online: https://flowgpt.com/blog/emoGPT (accessed on 30 January 2024).
Mishra, C.; Verdonschot, R.; Hagoort, P.; Skantze, G. Real-Time Emotion Generation in Human-Robot Dialogue Using Large Language Models. Front. Robot. AI 2023, 10, 1271610. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Hasler, S.; Tanneberg, D.; Ocker, F.; Joublin, F.; Ceravola, A.; Deigmoeller, J.; Gienger, M. LaMI: Large Language Models for Multi-Modal Human-Robot Interaction. In Proceedings of the Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; p. 218. [Google Scholar]
Townsend, D.; MajidiRad, A. Trust in Human-Robot Interaction Within Healthcare Services: A Review Study. In Proceedings of the Volume 7: 46th Mechanisms and Robotics Conference (MR), St. Louis, MI, USA, 14 August 2022; p. V007T07A030. [Google Scholar]
Abdi, J.; Al-Hindawi, A.; Ng, T.; Vizcaychipi, M.P. Scoping Review on the Use of Socially Assistive Robot Technology in Elderly Care. BMJ Open 2018, 8, e018815. [Google Scholar] [CrossRef] [PubMed]
Abbott, R.; Orr, N.; McGill, P.; Whear, R.; Bethel, A.; Garside, R.; Stein, K.; Thompson-Coon, J. How Do “Robopets” Impact the Health and Well-being of Residents in Care Homes? A Systematic Review of Qualitative and Quantitative Evidence. Int. J. Older People Nurs. 2019, 14, e12239. [Google Scholar] [CrossRef] [PubMed]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 483–498. [Google Scholar]
Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P.J. PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization. arXiv 2020. [Google Scholar] [CrossRef]
Cañete, J.; Chaperon, G.; Fuentes, R.; Ho, J.-H.; Kang, H.; Pérez, J. Spanish Pre-Trained BERT Model and Evaluation Data. arXiv 2023, arXiv:2308.02976. [Google Scholar]
Da Vinci Robotic Surgical Systems|Intuitive. Available online: https://www.intuitive.com/en-us/products-and-services/da-vinci (accessed on 9 March 2024).
ROS: Home. Available online: https://www.ros.org/ (accessed on 21 September 2023).
Palrobot ARI—The Social and Collaborative Robot. Available online: https://pal-robotics.com/robots/ari/ (accessed on 9 March 2024).
The World’s Most Advanced Social Robot. Available online: https://furhatrobotics.com/ (accessed on 9 March 2024).
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; p. 1182. [Google Scholar]
Su, H.; Qi, W.; Chen, J.; Yang, C.; Sandoval, J.; Laribi, M.A. Recent Advancements in Multimodal Human–Robot Interaction. Front. Neurorobot. 2023, 17, 1084000. [Google Scholar] [CrossRef] [PubMed]
Saunderson, S.; Nejat, G. How Robots Influence Humans: A Survey of Nonverbal Communication in Social Human–Robot Interaction. Int. J. Soc. Robot. 2019, 11, 575–608. [Google Scholar] [CrossRef]
Maurtua, I.; Fernández, I.; Tellaeche, A.; Kildal, J.; Susperregi, L.; Ibarguren, A.; Sierra, B. Natural Multimodal Communication for Human–Robot Collaboration. Int. J. Adv. Robot. Syst. 2017, 14, 1729881417716043. [Google Scholar] [CrossRef]
Schreiter, T.; Morillo-Mendez, L.; Chadalavada, R.T.; Rudenko, A.; Billing, E.; Magnusson, M.; Arras, K.O.; Lilienthal, A.J. Advantages of Multimodal versus Verbal-Only Robot-to-Human Communication with an Anthropomorphic Robotic Mock Driver. In Proceedings of the 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Busan, Republic of Korea, 28–31 August 2023; pp. 293–300. [Google Scholar]
RITA–Reminiscence Interactive Therapy and Activities—mPower. Available online: https://mpowerhealth.eu/impact/reducing-the-digital-divide-connecting-and-empowering/rita-reminiscence-interactive-therapy-and-activities/ (accessed on 27 May 2024).
Roles and Challenges of Semantic Intelligence in Healthcare Cognitive Computing; Carbonaro, A.; Tiwari, S.; Ortiz-Rodriguez, F.; Janev, V. (Eds.) Studies on the Semantic Web/Ssw; IOS Press: Amsterdam, The Netherlands, 2023; ISBN 978-1-64368-460-4. [Google Scholar]
National Library of Medicine The Semantic Network. Available online: https://www.nlm.nih.gov/research/umls/new_users/online_learning/OVR_003.html (accessed on 29 January 2024).
Zhang, H.; Hu, H.; Diller, M.; Hogan, W.R.; Prosperi, M.; Guo, Y.; Bian, J. Semantic Standards of External Exposome Data. Environ. Res. 2021, 197, 111185. [Google Scholar] [CrossRef] [PubMed]
Aldughayfiq, B.; Ashfaq, F.; Jhanjhi, N.Z.; Humayun, M. Capturing Semantic Relationships in Electronic Health Records Using Knowledge Graphs: An Implementation Using MIMIC III Dataset and GraphDB. Healthcare 2023, 11, 1762. [Google Scholar] [CrossRef] [PubMed]
Busso, M.; Gonzalez, M.P.; Scartascini, C. On the Demand for Telemedicine: Evidence from the COVID-19 Pandemic. Health Econ. 2022, 31, 1491–1505. [Google Scholar] [CrossRef] [PubMed]
Chatterjee, A.; Prinz, A.; Riegler, M.A.; Meena, Y.K. An Automatic and Personalized Recommendation Modelling in Activity eCoaching with Deep Learning and Ontology. Sci. Rep. 2023, 13, 10182. [Google Scholar] [CrossRef] [PubMed]
Barisevičius, G.; Coste, M.; Geleta, D.; Juric, D.; Khodadadi, M.; Stoilos, G.; Zaihrayeu, I. Supporting Digital Healthcare Services Using Semantic Web Technologies. In Proceedings of the 17th International Semantic Web Conference, Monterey, CA, USA, 8–12 October 2018; pp. 291–306. [Google Scholar]
Yu, W.D.; Jonnalagadda, S.R. Semantic Web and Mining in Healthcare. In Proceedings of the HEALTHCOM 2006 8th International Conference on e-Health Networking, Applications and Services, New Delhi, India, 17–19 August 2006; pp. 198–201. [Google Scholar]
Kara, N.; Dragoi, O.A. Reasoning with Contextual Data in Telehealth Applications. In Proceedings of the Third IEEE International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob 2007), White Plains, NY, USA, 8–10 October 2007; p. 69. [Google Scholar]
Liu, W.; Daruna, A.; Patel, M.; Ramachandruni, K.; Chernova, S. A Survey of Semantic Reasoning Frameworks for Robotic Systems. Robot. Auton. Syst. 2023, 159, 104294. [Google Scholar] [CrossRef]
Tang, X.; Zheng, Z.; Li, J.; Meng, F.; Zhu, S.-C.; Liang, Y.; Zhang, M. Large Language Models Are In-Context Semantic Reasoners Rather than Symbolic Reasoners. arXiv 2023, arXiv:2305.14825. [Google Scholar]
Wen, Y.; Zhang, Y.; Huang, L.; Zhou, C.; Xiao, C.; Zhang, F.; Peng, X.; Zhan, W.; Sui, Z. Semantic Modelling of Ship Behavior in Harbor Based on Ontology and Dynamic Bayesian Network. ISPRS Int. J. Geo-Inf. 2019, 8, 107. [Google Scholar] [CrossRef]
Zheng, W.; Liu, X.; Ni, X.; Yin, L.; Yang, B. Improving Visual Reasoning Through Semantic Representation. IEEE Access 2021, 9, 91476–91486. [Google Scholar] [CrossRef]
Pise, A.A.; Vadapalli, H.; Sanders, I. Relational Reasoning Using Neural Networks: A Survey. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2021, 29, 237–258. [Google Scholar] [CrossRef]
Li, K.; Hopkins, A.K.; Bau, D.; Viégas, F.; Pfister, H.; Wattenberg, M. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. arXiv 2023, arXiv:2210.13382. [Google Scholar]
Available online: https://everydayrobots.com (accessed on 23 February 2024).
Peng, S.; Genova, K.; Jiang, C.; Tagliasacchi, A.; Pollefeys, M.; Funkhouser, T. OpenScene: 3D Scene Understanding with Open Vocabularies. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 815–824. [Google Scholar]
Boston Dynamics Spot|Boston Dynamics. Available online: https://bostondynamics.com/products/spot/ (accessed on 24 February 2024).
Avgousti, S.; Christoforou, E.G.; Panayides, A.S.; Voskarides, S.; Novales, C.; Nouaille, L.; Pattichis, C.S.; Vieyres, P. Medical Telerobotic Systems: Current Status and Future Trends. Biomed. Eng. OnLine 2016, 15, 96. [Google Scholar] [CrossRef] [PubMed]
Yang, G.; Lv, H.; Zhang, Z.; Yang, L.; Deng, J.; You, S.; Du, J.; Yang, H. Keep Healthcare Workers Safe: Application of Teleoperated Robot in Isolation Ward for COVID-19 Prevention and Control. Chin. J. Mech. Eng. 2020, 33, 47. [Google Scholar] [CrossRef]
Battaglia, E.; Boehm, J.; Zheng, Y.; Jamieson, A.R.; Gahan, J.; Fey, A.M. Rethinking Autonomous Surgery: Focusing on Enhancement Over Autonomy. Eur. Urol. Focus 2021, 7, 696–705. [Google Scholar] [CrossRef] [PubMed]
Leonard, S.; Wu, K.L.; Kim, Y.; Krieger, A.; Kim, P.C.W. Smart Tissue Anastomosis Robot (STAR): A Vision-Guided Robotics System for Laparoscopic Suturing. IEEE Trans. Biomed. Engineering 2014, 61, 1305–1317. Available online: https://ieeexplore.ieee.org/document/6720152 (accessed on 27 May 2024). [CrossRef] [PubMed]
Takada, C.; Suzuki, T.; Afifi, A.; Nakaguchi, T. Hybrid Tracking and Matching Algorithm for Mosaicking Multiple Surgical Views. In Computer-Assisted and Robotic Endoscopy; Peters, T., Yang, G.-Z., Navab, N., Mori, K., Luo, X., Reichl, T., McLeod, J., Eds.; Springer International Publishing: Athens, Greece, 17 October 2017; pp. 24–35. [Google Scholar]
Afifi, A.; Takada, C.; Yoshimura, Y.; Nakaguchi, T. Real-Time Expanded Field-of-View for Minimally Invasive Surgery Using Multi-Camera Visual Simultaneous Localization and Mapping. Sensors 2021, 21, 2106. [Google Scholar] [CrossRef] [PubMed]
Lamini, C.; Benhlima, S.; Elbekri, A. Genetic Algorithm Based Approach for Autonomous Mobile Robot Path Planning. Procedia Comput. Sci. 2018, 127, 180–189. [Google Scholar] [CrossRef]
Xiang, D.; Lin, H.; Ouyang, J.; Huang, D. Combined Improved A* and Greedy Algorithm for Path Planning of Multi-Objective Mobile Robot|Scientific Reports. Sci. Rep. 2022, 12, 13273. Available online: https://www.nature.com/articles/s41598-022-17684-0 (accessed on 10 March 2024). [CrossRef] [PubMed]
de Sales Guerra Tsuzuki, M.; de Castro Martins, T.; Takase, F.K. Robot Path Planning Using Simulated Annealing—ScienceDirect. IFAC Proc. Vol. 2006, 39, 175–180. Available online: https://www.sciencedirect.com/science/article/pii/S1474667015358250 (accessed on 10 March 2024). [CrossRef]
End-to-End Deep Learning-Based Framework for Path Planning and Collision Checking: Bin Picking Application. Available online: https://ar5iv.labs.arxiv.org/html/2304.00119 (accessed on 3 March 2024).
Quinones-Ramirez, M.; Rios-Martinez, J.; Uc-Cetina, V. Robot Path Planning Using Deep Reinforcement Learning. arXiv 2023, arXiv:2302.09120. [Google Scholar]
Nicola, F.; Fujimoto, Y.; Oboe, R. A LSTM Neural Network applied to Mobile Robots Path Planning. In Proceedings of the IEEE International Conference on Industrial Informatics (INDIN), Porto, Portugal, 18–20 July 2018; pp. 349–354. [Google Scholar]
Hjeij, M.; Vilks, A. A Brief History of Heuristics: How Did Research on Heuristics Evolve? Humanit. Soc. Sci. Commun. 2023, 10, 64. [Google Scholar] [CrossRef]
Kawaguchi, K.; Kaelbling, L.; Bengio, Y. Generalization in Deep Learning. In Mathematical Aspects of Deep Learning; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
On the Generalization Mystery in Deep Learning. Available online: https://ar5iv.labs.arxiv.org/html/2203.10036 (accessed on 10 March 2024).
Understanding LLMs: A Comprehensive Overview from Training to Inference. Available online: https://ar5iv.labs.arxiv.org/html/2401.02038 (accessed on 3 May 2024).
ADaPT: As-Needed Decomposition and Planning with Language Models. Available online: https://arxiv.org/abs/2311.05772 (accessed on 10 March 2024).
Wang, J.; Wu, Z.; Li, Y.; Jiang, H.; Shu, P.; Shi, E.; Hu, H.; Ma, C.; Liu, Y.; Wang, X.; et al. Large Language Models for Robotics: Opportunities, Challenges, and Perspectives. Available online: https://arxiv.org/abs/2401.04334v1 (accessed on 3 March 2024).
Tjomsland, J.; Kalkan, S.; Gunes, H. Mind Your Manners! A Dataset and A Continual Learning Approach for Assessing Social Appropriateness of Robot Actions. Front. Robot. AI 2022, 9, 669420. [Google Scholar] [CrossRef] [PubMed]
Soh, H.; Pan, S.; Min, C.; Hsu, D. The Transfer of Human Trust in Robot Capabilities across Tasks. In Proceedings of the Robotics: Science and Systems XIV; Robotics: Science and Systems Foundation, Pittsburgh, PA, USA, 26 June 2018. [Google Scholar]
Soh, H.; Xie, Y.; Chen, M.; Hsu, D. Multi-Task Trust Transfer for Human-Robot Interaction. Sage J. 2020, 39, 233–249. [Google Scholar] [CrossRef]
Sap, M.; Rashkin, H.; Chen, D.; LeBras, R.; Choi, Y. SocialIQA: Commonsense Reasoning about Social Interactions. arXiv 2019, arXiv:1904.09728. [Google Scholar]
FRANKA RESEARCH 3. Available online: https://franka.de/ (accessed on 10 March 2024).
Puig, X.; Ra, K.; Boben, M.; Li, J.; Wang, T.; Fidler, S.; Torralba, A. VirtualHome: Simulating Household Activities Via Programs. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8494–8502. [Google Scholar]
Renfrow, J. New Robot from Pillo Health, Black + Decker Offers in-Home Monitoring, Medication Dispensing|Fierce Healthcare. Available online: https://www.fiercehealthcare.com/tech/new-robot-offers-home-monitoring-and-medication-dispensing (accessed on 3 May 2024).
Speech-to-Text AI: Speech Recognition and Transcription|Google Cloud. Available online: https://cloud.google.com/speech-to-text (accessed on 9 March 2024).
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Zhou, X.; Girdhar, R.; Joulin, A.; Krähenbühl, P.; Misra, I. Detecting Twenty-Thousand Classes Using Image-Level Supervision. In Proceedings of the Computer Vision–ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 350–368. [Google Scholar]
Tobeta, M.; Sawada, Y.; Zheng, Z.; Takamuku, S.; Natori, N. E2Pose: Fully Convolutional Networks for End-to-End Multi-Person Pose Estimation. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 532–537. [Google Scholar]
Machine, A. Android Alter3. Available online: http://alternativemachine.co.jp/en/project/alter3/ (accessed on 4 March 2024).
Shi, H.; Ball, L.; Thattai, G.; Zhang, D.; Hu, L.; Gao, Q.; Shakiah, S.; Gao, X.; Padmakumar, A.; Yang, B.; et al. Alexa, Play with Robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI. arXiv 2023, arXiv:2308.05221. [Google Scholar]
Gao, Q.; Thattai, G.; Shakiah, S.; Gao, X.; Pansare, S.; Sharma, V.; Sukhatme, G.; Shi, H.; Yang, B.; Zheng, D.; et al. Alexa Arena: A User-Centric Interactive Platform for Embodied AI. arXiv 2023. [Google Scholar] [CrossRef]
Ethics of Care in Technology-mediated Healthcare Practices: A Scoping Review-Ramvi-2023-Scandinavian Journal of Caring Sciences-Wiley Online Library. Available online: https://onlinelibrary.wiley.com/doi/full/10.1111/scs.13186 (accessed on 10 March 2024).
Ethical Implications of AI and Robotics in Healthcare: A Review-PMC. Available online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10727550/ (accessed on 10 March 2024).
The Value and Importance of Health Information Privacy-Beyond the HIPAA Privacy Rule-NCBI Bookshelf. Available online: https://www.ncbi.nlm.nih.gov/books/NBK9579/ (accessed on 10 March 2024).
Harrer, S. Attention Is Not All You Need: The Complicated Case of Ethically Using Large Language Models in Healthcare and Medicine. eBioMedicine 2023, 90, 104512. [Google Scholar] [CrossRef] [PubMed]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large Language Models Encode Clinical Knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event, 3 March 2021; pp. 610–623. [Google Scholar]
Lareyre, F.; Raffort, J. Ethical Concerns Regarding the Use of Large Language Models in Healthcare. EJVES Vasc. Forum 2023, 61, 1. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Moon, J.T.; Purkayastha, S.; Celi, L.A.; Trivedi, H.; Gichoya, J.W. Ethics of Large Language Models in Medicine and Medical Research. Lancet Digit. Health 2023, 5, e333–e335. [Google Scholar] [CrossRef] [PubMed]
Jeyaraman, M.; Balaji, S.; Jeyaraman, N.; Yadav, S. Unraveling the Ethical Enigma: Artificial Intelligence in Healthcare. Cureus 2023, 15, e43262. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Liu, S.; Yang, H.; Guo, J.; Wu, Y.; Liu, J. Ethical Considerations of Using ChatGPT in Health Care. J. Med. Internet Res. 2023, 25, e48009. [Google Scholar] [CrossRef] [PubMed]
Stahl, B.C.; Eke, D. The Ethics of ChatGPT–Exploring the Ethical Issues of an Emerging Technology. Int. J. Inf. Manag. 2024, 74, 102700. [Google Scholar] [CrossRef]
Oniani, D.; Hilsman, J.; Peng, Y.; Poropatich, R.K.; Pamplin, J.C.; Legault, G.L.; Wang, Y. Adopting and Expanding Ethical Principles for Generative Artificial Intelligence from Military to Healthcare. Npj Digit. Med. 2023, 6, 225. [Google Scholar] [CrossRef] [PubMed]
Clusmann, J.; Kolbinger, F.R.; Muti, H.S.; Carrero, Z.I.; Eckardt, J.-N.; Laleh, N.G.; Löffler, C.M.L.; Schwarzkopf, S.-C.; Unger, M.; Veldhuizen, G.P.; et al. The Future Landscape of Large Language Models in Medicine. Commun. Med. 2023, 3, 141. [Google Scholar] [CrossRef] [PubMed]
Murphy, K.; Di Ruggiero, E.; Upshur, R.; Willison, D.J.; Malhotra, N.; Cai, J.C.; Malhotra, N.; Lui, V.; Gibson, J. Artificial Intelligence for Good Health: A Scoping Review of the Ethics Literature. BMC Med. Ethics 2021, 22, 14. [Google Scholar] [CrossRef] [PubMed]
Sharkey, A.; Sharkey, N. Granny and the Robots: Ethical Issues in Robot Care for the Elderly. Ethics Inf. Technol. 2012, 14, 27–40. [Google Scholar] [CrossRef]
Siqueira-Batista, R.; Souza, C.R.; Maia, P.M.; Siqueira, S.L. ROBOTIC SURGERY: BIOETHICAL ASPECTS. ABCD Arq. Bras. Cir. Dig. São Paulo 2016, 29, 287–290. [Google Scholar] [CrossRef] [PubMed]
O’Brolcháin, F. Robots and People with Dementia: Unintended Consequences and Moral Hazard. Nurs. Ethics 2019, 26, 962–972. [Google Scholar] [CrossRef] [PubMed]
House of Lords. AI in the UK: Ready, Willing and Able. 2017. Available online: https://publications.parliament.uk/pa/ld201719/ldselect/ldai/100/100.pdf (accessed on 10 March 2024).
Decker, M. Caregiving Robots and Ethical Reflection: The Perspective of Interdisciplinary Technology Assessment. AI Soc. 2008, 22, 315–330. [Google Scholar] [CrossRef]
Coeckelbergh, M.; Pop, C.; Simut, R.; Peca, A.; Pintea, S.; David, D.; Vanderborght, B. A Survey of Expectations About the Role of Robots in Robot-Assisted Therapy for Children with ASD: Ethical Acceptability, Trust, Sociability, Appearance, and Attachment. Sci. Eng. Ethics 2016, 22, 47–65. [Google Scholar] [CrossRef]
Feil-Seifer, D.; Matarić, M.J. Socially Assistive Robotics. IEEE Robot. Autom. Mag. 2011, 18, 24–31. [Google Scholar] [CrossRef]
Luxton, D.D. Recommendations for the Ethical Use and Design of Artificial Intelligent Care Providers. Artif. Intell. Med. 2014, 62, 1–10. [Google Scholar] [CrossRef] [PubMed]
Nielsen, S.; Langensiepen, S.; Madi, M.; Elissen, M.; Stephan, A.; Meyer, G. Implementing Ethical Aspects in the Development of a Robotic System for Nursing Care: A Qualitative Approach. BMC Nurs. 2022, 21, 180. [Google Scholar] [CrossRef]
Yasuhara, Y. Expectations and Ethical Dilemmas Concerning Healthcare Communication Robots in Healthcare Settings: A Nurse’s Perspective. In Information Systems-Intelligent Information Processing Systems, Natural Language Processing, Affective Computing and Artificial Intelligence, and an Attempt to Build a Conversational Nursing Robot; IntechOpen: London, UK, 2021; ISBN 978-1-83962-360-8. [Google Scholar]
Chatila, R.; Havens, J.C. The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems. In Robotics and Well-Being; Aldinhas Ferreira, M.I., Silva Sequeira, J., Singh Virk, G., Tokhi, M.O., Kadar, E., Eds.; Intelligent Systems, Control and Automation: Science and Engineering; Springer International Publishing: Cham, Switzerland, 2019; Volume 95, pp. 11–16. ISBN 978-3-030-12523-3. [Google Scholar]
The Toronto Declaration. Available online: https://www.torontodeclaration.org/declaration-text/english/ (accessed on 26 June 2024).
AI Universal Guidelines–Thepublicvoice.Org. Available online: https://thepublicvoice.org/ai-universal-guidelines/ (accessed on 26 June 2024).
Chakraborty, A.; Karhade, M. Global AI Governance in Healthcare: A Cross-Jurisdictional Regulatory Analysis. Available online: https://arxiv.org/abs/2406.08695v1 (accessed on 24 June 2024).
Birhane, A.; Kasirzadeh, A.; Leslie, D.; Wachter, S. Science in the Age of Large Language Models. Nat. Rev. Phys. 2023, 5, 277–280. [Google Scholar] [CrossRef]
Browning, J.; LeCun, Y. Language, Common Sense, and the Winograd Schema Challenge. Artif. Intell. 2023, 325, 104031. [Google Scholar] [CrossRef]
Abdulsaheb, J.A.; Kadhim, D.J. Classical and Heuristic Approaches for Mobile Robot Path Planning: A Survey. Robotics 2023, 12, 93. [Google Scholar] [CrossRef]
Müller, V.C. Ethics of Artificial Intelligence and Robotics. In The Stanford Encyclopedia of Philosophy; Zalta, E.N., Nodelman, U., Eds.; Metaphysics Research Lab, Stanford University: Stanford, CA, USA, 2023. [Google Scholar]
Clanahan, J.M.; Awad, M.M. How Does Robotic-Assisted Surgery Change OR Safety Culture? AMA J. Ethics 2023, 25, 615–623. [Google Scholar] [CrossRef]
Ullah, E.; Parwani, A.; Baig, M.M.; Singh, R. Challenges and Barriers of Using Large Language Models (LLM) Such as ChatGPT for Diagnostic Medicine with a Focus on Digital Pathology–A Recent Scoping Review. Diagn. Pathol. 2024, 19, 43. [Google Scholar] [CrossRef] [PubMed]
Lown, B.A.; Rosen, J.; Marttila, J. An Agenda For Improving Compassionate Care: A Survey Shows About Half Of Patients Say Such Care Is Missing. Health Aff. 2011, 30, 1772–1778. [Google Scholar] [CrossRef]
shanepeckham Getting Started with LLM Prompt Engineering. Available online: https://learn.microsoft.com/en-us/ai/playbook/technology-guidance/generative-ai/working-with-llms/prompt-engineering (accessed on 28 May 2024).
Wang, L.; Chen, X.; Deng, X.; Wen, H.; You, M.; Liu, W.; Li, Q.; Li, J. Prompt Engineering in Consistency and Reliability with the Evidence-Based Guideline for LLMs. Npj Digit. Med. 2024, 7, 1–9. [Google Scholar] [CrossRef] [PubMed]
Marson, S.M.; Powell, R.M. Goffman and the Infantilization of Elderly Persons: A Theory in Development. J. Sociol. Soc. Welf. 2014, 41, 143–158. [Google Scholar] [CrossRef]
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2024, arXiv:2307.06435. [Google Scholar]
Busselle, R.; Reagan, J.; Pinkleton, B.; Jackson, K. Factors Affecting Internet Use in a Saturated-Access Population. Telemat. Inform. 1999, 16, 45–58. [Google Scholar] [CrossRef]
Ali, M.R.; Lawson, C.A.; Wood, A.M.; Khunti, K. Addressing Ethnic and Global Health Inequalities in the Era of Artificial Intelligence Healthcare Models: A Call for Responsible Implementation. J. R. Soc. Med. 2023, 116, 260–262. [Google Scholar] [CrossRef] [PubMed]
Johnmaeda Prompt Engineering with Semantic Kernel. Available online: https://learn.microsoft.com/en-us/semantic-kernel/prompts/ (accessed on 28 May 2024).
Das, B.C.; Amini, M.H.; Wu, Y. Security and Privacy Challenges of Large Language Models: A Survey. arXiv 2024, arXiv:2402.00888. [Google Scholar]
Mireshghallah, N.; Kim, H.; Zhou, X.; Tsvetkov, Y.; Sap, M.; Shokri, R.; Choi, Y. Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
OpenAI Privacy Policy. Available online: https://openai.com/policies/privacy-policy/ (accessed on 28 May 2024).
OpenAI How Your Data Is Used to Improve Model Performance|OpenAI Help Center. Available online: https://help.openai.com/en/articles/5722486-how-your-data-is-used-to-improve-model-performance (accessed on 13 May 2024).
Liu, L.; Ning, L. USER-LLM: Efficient LLM Contextualization with User Embeddings. Available online: http://research.google/blog/user-llm-efficient-llm-contextualization-with-user-embeddings/ (accessed on 28 May 2024).
Staab, R.; Vero, M.; Balunovi’c, M.; Vechev, M.T. Beyond Memorization: Violating Privacy Via Inference with Large Language Models. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Chen, Y.; Lent, H.; Bjerva, J. Text Embedding Inversion Security for Multilingual Language Models. arXiv 2024, arXiv:2401.12192. [Google Scholar]
Zhang, Y.; Jia, R.; Pei, H.; Wang, W.; Li, B.; Song, D. The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 13–19 June 2020; pp. 250–258. [Google Scholar]
Morris, J.X.; Zhao, W.; Chiu, J.T.; Shmatikov, V.; Rush, A.M. Language Model Inversion. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7 May 2024. [Google Scholar]
Liu, Y.; Palmieri, L.; Koch, S.; Georgievski, I.; Aiello, M. DELTA: Decomposed Efficient Long-Term Robot Task Planning Using Large Language Models. arXiv 2024, arXiv:2404.03275. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
BTGenBot: Behavior Tree Generation for Robotic Tasks with Lightweight LLMs. Available online: https://ar5iv.labs.arxiv.org/html/2403.12761 (accessed on 14 May 2024).
OpenAI OpenAI Platform. Available online: https://platform.openai.com (accessed on 8 March 2024).
Montreuil, V.; Clodic, A.; Ransan, M.; Alami, R. Planning Human Centered Robot Activities. In Proceedings of the 2007 IEEE International Conference on Systems, Man and Cybernetics, Montreal, QC, Canada, 7–10 October 2007; pp. 2618–2623. [Google Scholar]
Son, T.C.; Pontelli, E.; Balduccini, M.; Schaub, T. Answer Set Planning: A Survey. Theory Pract. Log. Program. 2023, 23, 226–298. [Google Scholar] [CrossRef]
Gebser, M.; Kaminski, R.; Kaufmann, B.; Schaub, T. Clingo = ASP + Control: Preliminary Report. arXiv 2014, arXiv:1405.3694. [Google Scholar]
Wang, J.; Shi, E.; Yu, S.; Wu, Z.; Ma, C.; Dai, H.; Yang, Q.; Kang, Y.; Wu, J.; Hu, H.; et al. Prompt Engineering for Healthcare: Methodologies and Applications. arXiv 2024, arXiv:2304.14670. [Google Scholar]
Xiao, G.; Tian, Y.; Chen, B.; Han, S.; Lewis, M. Efficient Streaming Language Models with Attention Sinks. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Hooper, C.; Kim, S.; Mohammadzadeh, H.; Mahoney, M.W.; Shao, Y.S.; Keutzer, K.; Gholami, A. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. arXiv 2024, arXiv:2401.18079. [Google Scholar]
Papers with Code—HellaSwag Benchmark (Sentence Completion). Available online: https://paperswithcode.com/sota/sentence-completion-on-hellaswag (accessed on 28 May 2024).
Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.A.; Socher, R.; Amatriain, X.; Gao, J. Large Language Models: A Survey. arXiv 2024. [Google Scholar] [CrossRef]
Rights (OCR), O. for C. Summary of the HIPAA Privacy Rule. Available online: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html (accessed on 20 May 2024).
General Data Protection Regulation (GDPR)–Legal Text. Available online: https://gdpr-info.eu/ (accessed on 20 May 2024).
Canada, O. of the P.C. of PIPEDA Requirements in Brief. Available online: https://www.priv.gc.ca/en/privacy-topics/privacy-laws-in-canada/the-personal-information-protection-and-electronic-documents-act-pipeda/pipeda_brief/ (accessed on 20 May 2024).
Han, X.; You, Q.; Liu, Y.; Chen, W.; Zheng, H.; Mrini, K.; Lin, X.; Wang, Y.; Zhai, B.; Yuan, J. InfiMM-Eval: Complex Open-Ended Reasoning Evaluation for Multi-Modal Large Language Models. arXiv 2023, arXiv:2311.11567. [Google Scholar]
Seo, G.; Park, S.; Lee, M. How to Calculate the Life Cycle of High-Risk Medical Devices for Patient Safety. Front. Public Health 2022, 10, 989320. [Google Scholar] [CrossRef]
Javaid, M.; Estivill-Castro, V. Explanations from a Robotic Partner Build Trust on the Robot’s Decisions for Collaborative Human-Humanoid Interaction. Robotics 2021, 10, 51. [Google Scholar] [CrossRef]
How Should AI Systems Behave, and Who Should Decide? Available online: https://openai.com/index/how-should-ai-systems-behave/ (accessed on 24 June 2024).
Altman, S.; Brockman, G.; Sutskever, I. Governance of Superintelligence. Available online: https://openai.com/index/governance-of-superintelligence/ (accessed on 24 June 2024).
Leike, J.; Sutskever, I. Introducing Superalignment. Available online: https://openai.com/index/introducing-superalignment/ (accessed on 24 June 2024).
Raval, V.; Shah, S. The Practical Aspect: Privacy Compliance—A Path to Increase Trust in Technology. ISACA 2020, 6, 15–19. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
Su, Y.; Lan, T.; Li, H.; Xu, J.; Wang, Y.; Cai, D. PandaGPT: One Model To Instruction-Follow Them All. arXiv 2023, arXiv:2305.16355. [Google Scholar]
Sunnybrook Hospital Patient and Visitor Recording Policy-Sunnybrook Hospital. Available online: https://sunnybrook.ca/content/?page=privacy-recording-policy (accessed on 28 May 2024).
Bello, S.A.; Yu, S.; Wang, C. Review: Deep Learning on 3D Point Clouds. Remote. Sens. 2020, 12, 1729. [Google Scholar] [CrossRef]
Xu, R.; Wang, X.; Wang, T.; Chen, Y.; Pang, J.; Lin, D. PointLLM: Empowering Large Language Models to Understand Point Clouds. arXiv 2023, arXiv:abs/2308.16911. [Google Scholar]
Robinson, F.; Nejat, G. A Deep Learning Human Activity Recognition Framework for Socially Assistive Robots to Support Reablement of Older Adults. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: London, UK, 2023; pp. 6160–6167. [Google Scholar]
Meta Meta Llama 2. Available online: https://llama.meta.com/llama2/ (accessed on 28 May 2024).
Colossal-AI One Half-Day of Training Using a Few Hundred Dollars Yields Similar Results to Mainstream Large Models, Open-Source and Commercial-Free Domain-Specific LLM Solution. Available online: https://hpc-ai.com/blog/one-half-day-of-training-using-a-few-hundred-dollars-yields-similar-results-to-mainstream-large-models-open-source-and-commercial-free-domain-specific-llm-solution (accessed on 28 May 2024).
Zhang, Z.; Zhao, J.; Zhang, Q.; Gui, T.; Huang, X. Unveiling Linguistic Regions in Large Language Models. arXiv 2024, arXiv:2402.14700. [Google Scholar]
Wen-Yi, A.; Mimno, D. Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings. arXiv; 2023. [Google Scholar]
Mahowald, K.; Ivanova, A.A.; Blank, I.A.; Kanwisher, N.; Tenenbaum, J.B.; Fedorenko, E. Dissociating Language and Thought in Large Language Models. Trends Cogn. Sci. 2024, 28, 517–540. [Google Scholar] [CrossRef] [PubMed]
Chang, K.; Xu, S.; Wang, C.; Luo, Y.; Xiao, T.; Zhu, J. Efficient Prompting Methods for Large Language Models: A Survey. arXiv 2024, arXiv:2404.01077. [Google Scholar]
OpenAI Introducing GPTs. Available online: https://openai.com/index/introducing-gpts/ (accessed on 24 June 2024).
Choi, J.; Yun, J.; Jin, K.; Kim, Y. Multi-News+: Cost-Efficient Dataset Cleansing via LLM-Based Data Annotation. arXiv 2024, arXiv:2404.09682. [Google Scholar]
Ishibashi, Y.; Shimodaira, H. Knowledge Sanitization of Large Language Models. arXiv 2024, arXiv:2309.11852. [Google Scholar]
Faraboschi, P.; Giles, E.; Hotard, J.; Owczarek, K.; Wheeler, A. Reducing the Barriers to Entry for Foundation Model Training. arXiv 2024, arXiv:2404.08811. [Google Scholar]
Guo, M.; Wang, Y.; Yang, Q.; Li, R.; Zhao, Y.; Li, C.; Zhu, M.; Cui, Y.; Jiang, X.; Sheng, S.; et al. Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint. Interact. J. Med. Res. 2023, 12, e44310. [Google Scholar] [CrossRef]
Common Crawl—Overview. Available online: https://commoncrawl.org/overview (accessed on 24 June 2024).
Chaudhari, S.; Aggarwal, P.; Murahari, V.; Rajpurohit, T.; Kalyan, A.; Narasimhan, K.; Deshpande, A.; da Silva, B.C. RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs. arXiv 2024, arXiv:2404.08555. [Google Scholar]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography, Proceedings of theThird Theory of Cryptography Conference, TCC 2006; New York, NY, USA, 4–7 March 2006, Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
Lukas, N.; Salem, A.; Sim, R.; Tople, S.; Wutschitz, L.; Zanella-Béguelin, S. Analyzing Leakage of Personally Identifiable Information in Language Models. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP), Los Alamitos, CA, USA, 22–24 May 2023; pp. 346–363. [Google Scholar]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; 2016; pp. 260–270. [Google Scholar]
Yu, D.; Kairouz, P.; Oh, S.; Xu, Z. Privacy-Preserving Instructions for Aligning Large Language Models. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Li, X.; Tramèr, F.; Liang, P.; Hashimoto, T. Large Language Models Can Be Strong Differentially Private Learners. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
Everything You Need to Know about the “Right to Be Forgotten”. Available online: https://gdpr.eu/right-to-be-forgotten/ (accessed on 24 June 2024).
Charity, F.H. RITA (Reminiscence/Rehabilitation & Interactive Therapy Activities). Available online: https://www.nhsfife.org/fife-health-charity/what-weve-funded/rita-reminiscencerehabilitation-interactive-therapy-activities/ (accessed on 3 July 2024).
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar]
OpenAI GPT-4V(Ision) System Card. Available online: https://openai.com/index/gpt-4v-system-card/ (accessed on 4 July 2024).
Koh, W.Q.; Felding, S.A.; Budak, K.B.; Toomey, E.; Casey, D. Barriers and Facilitators to the Implementation of Social Robots for Older Adults and People with Dementia: A Scoping Review. BMC Geriatr. 2021, 21, 351. [Google Scholar] [CrossRef]
OpenAI API. Available online: https://openai.com/index/openai-api/ (accessed on 4 July 2024).
YouTube Data API. Available online: https://developers.google.com/youtube/v3 (accessed on 4 July 2024).
Stockton, T. Organizations Fear Ontario’s Investment to Reduce Surgical Wait Times Will Endanger Patients Because of Nursing Shortage. Capital Current. 2021. Available online: https://capitalcurrent.ca/organizations-fear-ontarios-investment-to-reduce-surgical-wait-times-will-endanger-patients-because-of-nursing-shortage/ (accessed on 4 July 2024).
Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; p. 340. [Google Scholar]
Iovino, M.; Scukins, E.; Styrud, J.; Ögren, P.; Smith, C. A Survey of Behavior Trees in Robotics and AI. Robot. Auton. Syst. 2022, 154, 104096. [Google Scholar] [CrossRef]
Cloud Computing Services. Available online: https://cloud.google.com/ (accessed on 5 July 2024).

Figure 1. LLM-embedded framework architecture for socially assistive robot in design 1.

Figure 2. LLM-embedded framework architecture for surgical robot in design 2.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pashangpour, S.; Nejat, G. The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare. Robotics 2024, 13, 112. https://doi.org/10.3390/robotics13080112

AMA Style

Pashangpour S, Nejat G. The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare. Robotics. 2024; 13(8):112. https://doi.org/10.3390/robotics13080112

Chicago/Turabian Style

Pashangpour, Souren, and Goldie Nejat. 2024. "The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare" Robotics 13, no. 8: 112. https://doi.org/10.3390/robotics13080112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare

Abstract

1. Introduction

2. Large Language Models (LLMs) for Healthcare

3. Human–Robot Interaction (HRI) and Communication

3.1. Single-Modal Communication

3.2. Multi-Modal Communication

3.3. Summary and Outlook

4. Semantic Reasoning

Summary and Outlook

5. Planning

Summary and Outlook

6. Ethical Considerations of Robotics Using LLMs in Healthcare

6.1. Accountability

6.2. Humanizing Care

6.3. Privacy

Summary and Ethical Outlook

7. Open Challenges and Future Research Directions in Healthcare Robots Using LLMs

7.1. Open Research Challenges

7.2. Future Research Directions

8. Design of Potential Healthcare Applications of LLM-Embedded Healthcare Robots

8.1. Design 1: Multi-Modal Communication for a Socially Assistive Robot

8.2. Design 2: Semantic Reasoning and Planning

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI