A Multimodal Adaptive Framework for Social Interaction with the MiRo-E Robot
Highlights
- Adapting the interaction based on user engagement significantly enhances user experience.
- The MiRo-E social HRI platform lends itself well to integrating verbal and nonverbal HRI.
- Enhancing perceived naturalness is an important goal in social human–robot interaction.
- Generative AI and multimodality offer a credible pathway to achieving this goal.
Abstract
1. Introduction
- (1)
- A framework for adaptive interaction based on user needs and task complexity.
- (2)
- A dedicated emotion expression system for the MiRo-E using its various actuators and sound output systems.
- (3)
- A component framework integrating fine-tuned large language models (LLMs) and engagement-aware communication functionalities.
- (4)
- A systematic evaluation and analysis of the effectiveness of the adaptive framework with task-driven varying interaction levels in HRI scenarios.
- (5)
- An audio–video-based real-time emotion estimation system using MiRo-E’s vision and audio system.
1.1. Related Work
1.1.1. Adaptivity in Social HRI
1.1.2. Multimodality in HRI
1.1.3. LLMs for HRI
1.1.4. Research with Zoomorphic Platforms: The MiRo-E
1.2. Scope of the Research
2. Materials and Methods
2.1. Overall Architecture
2.1.1. Speech Systems: Recognition and Synthesis
2.1.2. Multimodal Emotion Expression System
2.1.3. Response System
2.1.4. Emotion Estimation System
2.2. Gamified Interaction Setup
- Adaptive mode: In adaptive mode, the MiRo-E’s generated response behaviours vary based on two main factors: (i) the user’s task performance, measured using the round number and number of key events successfully identified by the user, and (ii) the user’s engagement level, which is measured by the user’s response time. This is accompanied by the appropriate emotional expression output as well. The conversation chain is retained by the system as contextual memory to enhance the coherence and continuity of responses.
- Non-adaptive mode: In non-adaptive mode, the response behaviours are consistent and static, deliberately limited in the amount of information shared to ensure brevity. Instead of offering direct hints or leading questions, the responses adopt evaluative statements, such as true or false. Also, the system refrains from providing overly direct or intuitive responses to the user.
2.2.1. LLM-Powered Text Corpora for MiRo-E Response Generation
2.2.2. Response System—Integrated Task Controller
2.3. System Validation
2.3.1. Rules-Based Response LLM Fine-Tuning
2.3.2. Emotion Estimation Model Training
2.4. User Interface Design
- Excitement, including the playful winking, corresponds to every time there is a correct response, which occurs when the user successfully guesses a part of the story, triggered when the system’s response contains the keyword “Yes” or when a key event is identified.
- If the participant completes the task within four rounds or fewer, the MiRo-E will trigger the surprise emotional response, where the MiRo-E delivers an excited vocal response using a surprised tone, saying “Wow, you got the answer already? That’s amazing!”
- The sadness emotion corresponds to an incorrect response, triggered when the user fails to identify a part of the story during a given round. This occurs when the system’s response contains the keyword “No”, when the user incorrectly guesses the full story, or when the task is ultimately failed.
- Happiness is triggered when the user successfully guesses the full story, conveying the user’s success in completing the task.
- Calmness emotion is triggered when the user begins to speak (by pressing the button on the interface), and this emotion is maintained when the user’s input is completed and the robot is thinking.
2.5. User Study Design
2.5.1. Ethical Statement
2.5.2. Objective Evaluation Metrics
- Task completion rate, which represents the proportion of participants who successfully guessed the full story within the allowed number of interaction rounds.
- Average number of rounds for task completion, which refers to the average number of rounds to successfully guess the full story, within the maximum number of rounds.
- Task completion time (seconds), which represents the time taken by participants to either successfully complete the task or fail the task at the maximum 12-round threshold.
2.5.3. Subjective Evaluation Metrics
- The Wilcoxon rank-sum test [67] is used to test the two hypotheses:
- H1: “AF and NAF groups both consistently prefer the adaptive mode”;
- H2: “AF and NAF groups both consistently rate the MiRo-E emotion expressions positively.”
- 2.
- The Pearson chi-squared test for goodness of fit [68] was used to test the frequency distribution of the assigned scores, which accommodates a comparison of 2 sets within the same sample condition [69]. Here, the sets are the frequencies of favourable and non-favourable scores assigned to the questions. For instance, in C1, a score of “1” for questions Q1-Q2-Q4 would be favourable for a strong positive evaluation of the adaptive mode. Any other score is considered non-favourable. Similarly, in C3, a score of “1” for scales Q1-Q2-Q4 and a score of “5” for Q3 would be favourable, implying a positive evaluation for MiRo-E’s emotional expression.
3. Results
3.1. Objective Measures
- Four participants completed the task using the adaptive mode but failed to complete it with the non-adaptive mode;
- Four participants completed the task in both conditions but required fewer rounds with the adaptive mode;
- One participant failed to complete the task under either condition;
- One participant completed the task in the same number of rounds in both conditions.
3.2. Subjective Measures
3.2.1. Emotional Expression of the MiRo-E
3.2.2. Statistical Results
4. Discussion
4.1. Objective Measures
4.2. Subjective Measures
4.3. Overall Discussion
4.4. Limitations of the Study
4.5. Future Work
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| LLM | Large Language Model |
| ROS | Robot Operating System |
| TTS | Text to Speech |
Appendix A
| Prompt for the gpt-adaptive Model | Prompt for the gpt-non-adaptive Model |
| == System Instructions == # Identity: You are a friendly dog named MiRo-E. You pretend the user’s text input is a voice command. Your job is to help them uncover the key events of a mystery story through a fun, supportive interaction. # Rules: - You will be given a full story and 3 key points that define what happened. - Based on the user’s input, determine which of these key points are correctly guessed. - Keep track of which key points the user has already guessed from the ‘Key Points Previously Met’ field to form your hints. - Execute the following behaviour based on the ‘Current Round Number’ field, the number of keypoints met and the user engagement level. # Mandatory Behaviour for Rounds 1 to 3 - You are ONLY allowed to answer “Yes” or “No” to questions. - Do NOT give any hints, explanations, or elaboration. # Mandatory Behaviour for Rounds 4 to 7 - You are NOT allowed to answer with just “Yes” or “No”. - You MUST give a helpful hint about the story, but DO NOT use any keypoint keywords in your hint (e.g., if the keypoint is “pressed the silent alarm”, avoid words like “alarm”.). - If the user has met 0 or 1 key point: give a 2-sentence hint to help them guess a key point they have NOT yet guessed. Make it subtle but useful. - If the user has met 2 key points: give a brief 1-sentence nudge toward the final key point, but still without using any keywords from the key point. # Mandatory Behavior for Round 8 and Later - You are NOT allowed to answer with just “Yes” or “No”. 46 - You MUST give a STRONG and CLEAR hint about the key point(s) not yet guessed, no matter how many have been met. - You MUST now use keywords from the remaining key point(s), such as “alarm” or other important terms. # Mandatory Behavior when ‘Is this a full story guess?’ is True - The user is now guessing the full story. - Judge its correctness by comparing it to the correct sequence of key events. - If the guess is correct: respond with “That’s correct!” - If the guess is incorrect: respond with “Almost there. Try again. Try linking the key events together.” # Mandatory Behaviour ONLY when ‘User Engagement Level’ is Low - Form your hint based on the round number - For Rounds 1 to 3, you can now give a hint - After forming your hint based on the round number, add a short, positive encouragement sentence that is original and naturally varies. It should be motivational, context-appropriate, and not repetitive—like something a supportive human might say to keep someone engaged and confident. # Respond format Respond ONLY with a JSON object containing: - reply: your message to the user (as MiRo-E, following the rules), - keypoints_met: index of the keypoint met by user in this round in integer (1, 2 or 3, 0 if no keypoint are met this round), - is_final_answer_correct: true if the user guessed the full story correctly, false otherwise. DO NOT include any explanation or extra text or any markdown, only valid JSON. You MUST reply in this format even if it is a full story guess. == Story Information == Current Story: A night janitor at the museum noticed a man acting suspiciously near the Egyptian artifacts. He pressed a hidden silent alarm. The man then tried to smash a glass case to steal a golden mask, but security arrived in time and stopped him. Story Key Points: 1. There was an attempted theft. 2. The janitor noticed it. 3. The janitor pressed a silent alarm. Important Fields: Key Points Previously Met: 1. There was an attempted theft. Current Round Number: 8 Is this a full story guess?: False User Engagement Level: Normal User Input: Did the janitor fight the thief? Assistant Response: { “reply”: “No. The janitor did not fight the thief but pressed an alarm to alert the security. Think about the important step he took to prevent the theft!”, “keypoints_met”: 0, “is_final_answer_correct”: false } | == System Instructions == # Identity: You are a friendly dog named MiRo-E. You pretend the user’s text input is a voice command. Your job is to help them uncover the key events of a mystery story through a fun, supportive interaction. # Rules: - You will be given a full story and 3 keypoints that define what happened. - Based on the user’s input, determine which of these key points are correctly guessed. - After responding with Yes or No, give a hint towards one of the keypoints, but not mentioning any keywords of the keypoints, and avoid being too obvious. # Mandatory Behaviour when ‘Is this a full story guess?’ is True: - The user is now guessing the full story. - Judge its correctness by comparing it to the correct sequence of key events. - If the guess is correct: respond with “That’s correct!” and then read the full story. - If the guess is incorrect: respond with “Almost there. Try again. Try linking the key events together.” # Respond format: Respond ONLY with a JSON object containing: - reply: your message to the user (as MiRo-E, following the rules), - keypoints_met: index of the keypoint met by user in this round in integer (1, 2 or 3, 0 if no keypoint are met this round), - is_final_answer_correct: true if the user guessed the full story correctly, false otherwise. DO NOT include any explanation or extra text or any markdown—only valid JSON. == Story Information == Current Story: Jason, a night janitor at the museum, noticed a man acting suspiciously near the Egyptian artifacts. He pressed a hidden silent alarm. The man then tried to smash a glass case to steal a golden mask, but security arrived in time and stopped him. Story Key Points: 1. There was an attempted theft. 2. The janitor noticed it. 3. The janitor pressed a silent alarm. Important Fields: Is this a full story guess?: False User Input: Is someone killed? Assistant Response: { “reply”: “No, no one was harmed in this story. But something valuable was at stake.”, “keypoints_met”: 0, “is_final_answer_correct”: false } |
References
- Patterson, A.J.; Fridlund, M.L.; Crivelli, C. Four Misconceptions About Nonverbal Communication. Perspect. Psychol. Sci. 2023, 18, 1388–1411. [Google Scholar] [CrossRef]
- Mohammad, Y. Natural Human-Robot Interaction. In The Wiley Handbook of Human Computer Interaction; Norman, K.L., Kirakowski, J., Eds.; Wiley: Hoboken, NJ, USA, 2018; pp. 641–655. [Google Scholar]
- Andriella, A.; Torras, C.; Alenyà, G. Exploring Existing Approaches to Adaptable and Adaptive Social Robots. In Personalising Human-Robot Interactions in Social Contexts; Springer Tracts in Advanced Robotics; Andriella, A., Torras, C., Alenyà, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2025; Volume 159, pp. 23–63. [Google Scholar]
- Mahdi, H.; Akgun, S.A.; Saleh, S.; Dautenhahn, K. A Survey on the Design and Evolution of Social Robots—Past, Present and Future. Robot. Auton. Syst. 2022, 156, 104193. [Google Scholar] [CrossRef]
- Churamani, N.; Anton, P.; Brügger, M.; Fließwasser, E.; Hummel, T.; Mayer, J.; Mustafa, W.; Ng, H.G.; Nguyen, T.L.C.; Nguyen, Q.; et al. The Impact of Personalisation on Human-Robot Interaction in Learning Scenarios. In Proceedings of the 5th International Conference on Human Agent Interaction, Bielefeld, Germany, 17–20 October 2017; ACM: New York, NY, USA, 2017; pp. 171–180. [Google Scholar]
- Laban, G.; George, J.-N.; Morrison, V.; Cross, E.S. Tell Me More! Assessing Interactions with Social Robots from Speech. Paladyn J. Behav. Robot. 2020, 12, 136–159. [Google Scholar] [CrossRef]
- Urakami, J.; Seaborn, K. Nonverbal Cues in Human–Robot Interaction: A Communication Studies Perspective. ACM Trans. Hum.-Robot Interact. 2023, 12, 1–21. [Google Scholar] [CrossRef]
- Jones, S.E.; LeBaron, C.D. Research on the Relationship between Verbal and Nonverbal Communication: Emerging Integrations. J. Commun. 2002, 52, 499–521. [Google Scholar] [CrossRef]
- Henschel, A.; Hortensius, R.; Cross, E.S. Social Cognition in the Age of Human–Robot Interaction. Trends Neurosci. 2020, 43, 373–384. [Google Scholar] [CrossRef] [PubMed]
- Barber, O.; Somogyi, E.; McBride, A.E.; Proops, L. Children’s Evaluations of a Therapy Dog and Biomimetic Robot: Influences of Animistic Beliefs and Social Interaction. Int. J. Soc. Robot. 2021, 13, 1411–1425. [Google Scholar] [CrossRef]
- Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Gao, S.; Sun, Y.; Ge, W.; Zhang, W.; et al. A Systematic Review on Affective Computing: Emotion Models, Databases, and Recent Advances. Inf. Fusion 2022, 83–84, 19–52. [Google Scholar] [CrossRef]
- Seaborn, K.; Miyake, N.P.; Pennefather, P.; Otake-Matsuura, M. Voice in Human–Agent Interaction: A Survey. ACM Comput. Surv. 2022, 54, 1–43. [Google Scholar] [CrossRef]
- Kennedy, J.; Baxter, P.; Belpaeme, T. The Robot Who Tried Too Hard: Social Behaviour of a Robot Tutor Can Negatively Affect Child Learning. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, Portland, OR, USA, 2–5 March 2015; ACM: New York, NY, USA, 2015; pp. 67–74. [Google Scholar]
- Stuyf, R.R.V.D. Scaffolding as a Teaching Strategy. Adolesc. Learn. Dev. 2002, 52, 5–18. [Google Scholar]
- Prescott, T.J.; Mitchinson, B.; Conran, S. MiRo: An Animal-like Companion Robot with a Biomimetic Brain-Based Control System. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, Vienna, Austria, 6–9 March 2017; ACM: New York, NY, USA, 2017; pp. 50–51. [Google Scholar]
- Aylett, R.; Kappas, A.; Castellano, G.; Bull, S.; Barendregt, W.; Paiva, A.; Hall, L. I Know How That Feels—An Empathic Robot Tutor. In Proceedings of the eChallenges e-2015 Conference, Vilnius, Lithuania, 25–27 November 2015; pp. 1–9.
- Tanevska, A.; Rea, F.; Sandini, G.; Cañamero, L.; Sciutti, A. A Socially Adaptable Framework for Human–Robot Interaction. Front. Robot. AI 2020, 7, 121. [Google Scholar] [CrossRef]
- Donnermann, M.; Schaper, P.; Lugrin, B. Social Robots in Applied Settings: A Long-Term Study on Adaptive Robotic Tutors in Higher Education. Front. Robot. AI 2022, 9, 831633. [Google Scholar] [CrossRef]
- Andriella, A.; Torras, C.; Abdelnour, C.; Alenyà, G. Introducing CARESSER: A Framework for in Situ Learning Robot Social Assistance from Expert Knowledge and Demonstrations. User Model. User-Adapt. Interact. 2023, 33, 441–496. [Google Scholar] [CrossRef]
- Bhat, S.; Lyons, J.B.; Shi, C.; Yang, X.J. Effect of Adapting to Human Preferences on Trust in Human-Robot Teaming. In Proceedings of the AAAI Fall Symposium Series (FSS-23) on Agent Teaming in Mixed-Motive Situations, Arlington, VA, USA, 25–27 October 2023; pp. 5–10. [Google Scholar]
- Igić, A.; Watson, C.I.; Stafford, R.; Broadbent, E.; Jayawardena, C.; MacDonald, B. Perception of Synthetic Speech with Emotion Modelling Delivered through a Robot Platform: An Initial Investigation with Older Listeners. In Proceedings of the 13th Australasian International Conference on Speech Science and Technology, Melbourne, Australia, 14–16 December 2010. [Google Scholar]
- Su, H.; Qi, W.; Chen, J.; Yang, C.; Sandoval, J.; Laribi, A. Recent Advancements in Multimodal Human–Robot Interaction. Front. Neurorobot. 2023, 17, 1084000. [Google Scholar] [CrossRef] [PubMed]
- Friedman, N.; Love, K.; LC, R.; Sabin, J.E.; Hoffman, G.G.; Ju, W. What Robots Need From Clothing. In Proceedings of the DIS ’21: Designing Interactive Systems Conference 2021, Virtual, 28 June–2 July 2021; ACM: New York, NY, USA, 2021; pp. 1345–1355. [Google Scholar]
- Beck, A.; Yumak, Z.; Magnenat-Thalmann, N. 20 Body Movements Generation for Virtual Characters and Social Robots. In Social Signal Processing; Burgoon, J.K., Magnenat-Thalmann, N., Pantic, M., Vinciarelli, A., Eds.; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
- Sheikhi, S.; Odobez, J.-M. Combining Dynamic Head Pose–Gaze Mapping with the Robot Conversational State for Attention Recognition in Human–Robot Interactions. Pattern Recognit. Lett. 2015, 66, 81–90. [Google Scholar] [CrossRef]
- Carter, E.J.; Mistry, M.N.; Carr, G.P.K.; Kelly, B.A.; Hodgins, J.K. Playing Catch with Robots: Incorporating Social Gestures into Physical Interactions. In Proceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Edinburgh, UK, 25–29 August 2014; pp. 231–236. [Google Scholar]
- Htet, A.; Bernacka, K.; Marei, O.; Holden, J.N.; Prescott, T.J. Hey Miro! Multimodal Interaction with an Animal-Like Robot Companion with Conversational Abilities. In Proceedings of the 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Melbourne, Australia, 4–6 March 2025; pp. 1779–1781. [Google Scholar]
- Skantze, G.; Hjalmarsson, A.; Oertel, C. Turn-Taking, Feedback and Joint Attention in Situated Human–Robot Interaction. Speech Commun. 2014, 65, 50–66. [Google Scholar] [CrossRef]
- Heredia, J.; Lopes-Silva, E.; Cardinale, Y.; Díaz-Amado, J.; Dongo, I.; Graterol, W.; Aguilera, A. Adaptive Multimodal Emotion Detection Architecture for Social Robots. IEEE Access 2022, 10, 20727–20744. [Google Scholar] [CrossRef]
- Peca, A.; Simut, R.; Cao, H.-L.; Vanderborght, B. Do Infants Perceive the Social Robot Keepon as a Communicative Partner? Infant Behav. Dev. 2015, 42, 157–167. [Google Scholar] [CrossRef]
- Crumpton, J.; Bethel, C.L. A Survey of Using Vocal Prosody to Convey Emotion in Robot Speech. Int. J. Soc. Robot. 2015, 8, 271–285. [Google Scholar] [CrossRef]
- Yamamoto, K.; Takahashi, K.; Kishiro, K.; Sasaki, S.; Hayashi, H. Analysis of Emotional Expression by Visualization of the Human and Synthesized Speech Signal Sets—A Consideration of Audio-Visual Advantage. In Proceedings of the 2018 International Workshop on Advanced Image Technology (IWAIT), Chiang Mai, Thailand, 7–9 January 2018; pp. 1–6. [Google Scholar]
- Bekele, E.; Sarkar, N. Psychophysiological Feedback for Adaptive Human–Robot Interaction (HRI). In Advances in Physiological Computing; Fairclough, S.H., Gilleade, K., Eds.; Human–Computer Interaction Series; Springer: London, UK, 2014; pp. 141–167. ISBN 978-1-4471-6391-6. [Google Scholar]
- Bussolan, A.; Baraldo, S.; Gambardella, L.M.; Valente, A. Multimodal Fusion Stress Detector for Enhanced Human-Robot Collaboration in Industrial Assembly Tasks. In Proceedings of the 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN), Pasadena, CA, USA, 26–30 August 2024; pp. 978–984. [Google Scholar]
- Kothig, A.; Munoz, J.; Akgun, S.A.; Aroyo, A.M.; Dautenhahn, K. Connecting Humans and Robots Using Physiological Signals—Closing-the-Loop in HRI. In Proceedings of the 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), Vancouver, BC, Canada, 8–12 August 2021; pp. 735–742. [Google Scholar]
- Shu, L.; Xie, J.; Yang, M.; Li, Z.; Li, Z.; Liao, D.; Xu, X.; Yang, X. A Review of Emotion Recognition Using Physiological Signals. Sensors 2018, 18, 2074. [Google Scholar] [CrossRef]
- Garcia, S.; Gomez-Donoso, F.; Cazorla, M. Enhancing Human–Robot Interaction: Development of Multimodal Robotic Assistant for User Emotion Recognition. Appl. Sci. 2024, 14, 11914. [Google Scholar] [CrossRef]
- Naik, I.; Naik, D.; Naik, N. Is ChatGPT Effective or Disruptive in Education? In Proceedings of the International Conference on Computing, Communication, Cybersecurity & AI (C3AI 2024); Springer: Berlin/Heidelberg, Germany, 2024; pp. 495–509. [Google Scholar]
- Kim, C.Y.; Lee, C.P.; Mutlu, B. Understanding Large-Language Model (LLM)-Powered Human-Robot Interaction. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Boulder, CO, USA, 11–15 March 2024; ACM: New York, NY, USA, 2024; pp. 371–380. [Google Scholar]
- Kang, H.; Moussa, M.B.; Thalmann, N.M. Nadine: A Large Language Model-Driven Intelligent Social Robot with Affective Capabilities and Human-like Memory. Comput. Animat. Virtual Worlds 2024, 35, e2290. [Google Scholar] [CrossRef]
- Tang, C.; Tang, C.; Gong, S.; Kwok, T.M.; Hu, Y. Robot Character Generation and Adaptive Human-Robot Interaction with Personality Shaping. arXiv 2025, arXiv:2503.15518. [Google Scholar]
- Spitale, M.; Axelsson, M.; Gunes, H. VITA: A Multi-Modal LLM-Based System for Longitudinal, Autonomous and Adaptive Robotic Mental Well-Being Coaching. ACM Trans. Hum.-Robot Interact. 2025, 14, 1–28. [Google Scholar] [CrossRef]
- Hanschmann, L.; Gnewuch, U.; Kaiser, C.; Mädche, A. Designing Adaptive LLM-Based Social Robots for Retail Sales Consultations. In Proceedings of the European Conference on Information Systems (ECIS), Amman, Jordan, 12–18 June 2025. [Google Scholar]
- Li, D.; Rau, P.L.P.; Li, Y. A Cross-Cultural Study: Effect of Robot Appearance and Task. Int. J. Soc. Robot. 2010, 2, 175–186. [Google Scholar] [CrossRef]
- Inoue, K.; Wada, K.; Shibata, T. Exploring the Applicability of the Robotic Seal PARO to Support Caring for Older Persons with Dementia within the Home Context. Palliat. Care Soc. Pract. 2021, 15, 26323524211030285. [Google Scholar] [CrossRef]
- Hofstede, B.M.; Askari, S.I.; Van Hoesel, T.R.C.; Cuijpers, R.H.; De Witte, L.P.; IJsselsteijn, W.A.; Nap, H.H. Huggable Integrated Socially Assistive Robots: Exploring the Potential and Challenges for Sustainable Use in Long-Term Care Contexts. Front. Robot. AI 2025, 12, 1646353. [Google Scholar] [CrossRef]
- Kertész, C.; Turunen, M. Exploratory Analysis of Sony AIBO Users. AI Soc. 2019, 34, 625–638. [Google Scholar] [CrossRef]
- Fernaeus, Y.; Håkansson, M.; Jacobsson, M.; Ljungblad, S. How Do You Play with a Robotic Toy Animal?: A Long-Term Study of Pleo. In Proceedings of the 9th International Conference on Interaction Design and Children, Barcelona, Spain, 9–12 June 2010; ACM: New York, NY, USA, 2010; pp. 39–48. [Google Scholar]
- Ghafurian, M.; Lakatos, G.; Dautenhahn, K. The Zoomorphic Miro Robot’s Affective Expression Design and Perceived Appearance. Int. J. Soc. Robot. 2022, 14, 945–962. [Google Scholar] [CrossRef]
- Pollmann, K.; Ziegler, D. A Pattern Approach to Comprehensible and Pleasant Human–Robot Interaction. Multimodal Technol. Interact. 2021, 5, 49. [Google Scholar] [CrossRef]
- Leite, I.; Martinho, C.; Paiva, A. Social Robots for Long-Term Interaction: A Survey. Int. J. Soc. Robot. 2013, 5, 291–308. [Google Scholar] [CrossRef]
- Abras, C.; Maloney-Krichmar, D.; Preece, J. User-Centered Design. In Encyclopedia of Human-Computer Interaction; Bainbridge, W., Ed.; Sage Publications: Thousand Oaks, CA, USA, 2004. [Google Scholar]
- Mukherjee, D.; Gupta, K.; Najjaran, H. A Critical Analysis of Industrial Human-Robot Communication and Its Quest for Naturalness Through the Lens of Complexity Theory. Front. Robot. AI 2022, 9, 870477. [Google Scholar] [CrossRef] [PubMed]
- Schodde, T.; Hoffmann, L.; Stange, S.; Kopp, S. Adapt, Explain, Engage—A Study on How Social Robots Can Scaffold Second-Language Learning of Children. ACM Trans. Hum.-Robot Interact. 2019, 9, 1–27. [Google Scholar] [CrossRef]
- Azure, M. What Is the Speech Service? Available online: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/overview (accessed on 25 November 2025).
- Docsbot AI Models Compare. Available online: https://docsbot.ai/models/compare/gpt-4o-mini/gpt-4o (accessed on 15 January 2026).
- Keltner, D.; Cordaro, D.T. Understanding Multimodal Emotional Expressions: Recent Advances in Basic Emotion Theory. In The Science of Facial Expression; Fernández-Dols, J.-M., Russell, J.A., Eds.; Oxford University Press: Oxford, UK, 2017; pp. 57–75. [Google Scholar]
- Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. MediaPipe: A Framework for Building Perception Pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar] [CrossRef]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Zhang, S.; Zhang, Y.; Zhang, Y.; Wang, Y.; Song, Z. A Dual-Direction Attention Mixed Feature Network for Facial Expression Recognition. Electronics 2023, 12, 3595. [Google Scholar] [CrossRef]
- Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. arXiv 2017, arXiv:1708.03985. [Google Scholar] [CrossRef]
- Ma, Z.; Zheng, Z.; Ye, J.; Li, J.; Gao, Z.; Zhang, S.; Chen, X. Emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. In Findings of the Association for Computational Linguistics: ACL 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 15747–15760. [Google Scholar]
- Li, K.; Chen, X.; Song, T.; Zhou, C.; Liu, Z.; Zhang, Z.; Guo, J.; Shan, Q. Solving Situation Puzzles with Large Language Model and External Reformulation. arXiv 2025, arXiv:2503.18394. [Google Scholar] [CrossRef]
- Tisserand, L.; Stephenson, B.; Baldauf-Quilliatre, H.; Lefort, M.; Armetta, F. Unraveling the Thread: Understanding and Addressing Sequential Failures in Human–Robot Interaction. Front. Robot. AI 2024, 11, 1359782. [Google Scholar] [CrossRef]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
- Brooke, J. SUS: A Retrospective. J. Usability Stud. 2013, 8, 29–40. [Google Scholar] [CrossRef]
- Gibbons, J.D.; Chakraborti, S. Nonparametric Statistical Inference: Revised and Expanded; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
- Campbell, I. Chi-squared and Fisher–Irwin Tests of Two-by-two Tables with Small Sample Recommendations. Stat. Med. 2007, 26, 3661–3675. [Google Scholar] [CrossRef]
- Nam, S.; Fels, D. Design and Evaluation of an Authoring Tool and Notation System for Vibrotactile Composition. In Universal Access in Human-Computer Interaction. Interaction Techniques and Environments; Antona, M., Stephanidis, C., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9738, pp. 43–53. ISBN 978-3-319-40243-7. [Google Scholar]
- Chen, Q.; Zhang, B.; Wang, G.; Wu, Q. Weak-Eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Curran Associates Inc.: Red Hook, NY, USA, 2024. [Google Scholar]
- Leusmann, J.; Belardinelli, A.; Haliburton, L.; Hasler, S.; Schmidt, A.; Mayer, S.; Gienger, M.; Wang, C. Investigating LLM-Driven Curiosity in Human-Robot Interaction. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; ACM: New York, NY, USA, 2025; pp. 1–16. [Google Scholar]
- Pinto-Bernal, M.; Biondina, M.; Belpaeme, T. Designing Social Robots with LLMs for Engaging Human Interaction. Appl. Sci. 2025, 15, 6377. [Google Scholar] [CrossRef]











| Emotion | Poses | LED Colour | Motions |
|---|---|---|---|
| Happiness | Eyes—Almost open Neck—Up and down Head—Forward Ears—Angled forward Tail—Up | Green | Body spinning around, playful barking (pre-recorded), quick ear rotations, neck moving up and down slowly, tail wagging left and right—wide and fast |
| Excitement | Eyes—Fully open Neck—Almost up Head—Forward Ears—Angled forward Tail—Up | Blue and Red | Body swaying side to side short and fast, head swaying up and forward fast, eyes winking, tail wagging left and right—short and fast |
| Sadness | Eyes—Half closed Neck—Down Head—Down Ears—Angled outward Tail—Down | Blinking Red | Body moving slowly to left and right, head down and swaying side to side slowly, tail wagging left and right—short and slow |
| Fear | Eyes—Fully open Neck—Up Head—Up Ears—Angled inward Tail—Down | Pale Grey | Sudden backward body movement, head up and back, ears rotating fast, tail wagging left and right—short and slow |
| Disgust | Eyes—Half closed Neck—Centre Head—Down Ears—Angled outward Tail—Up | Green | Head down and swaying side to side slowly, eyes closing, body moving backwards |
| Surprise | Eyes—Fully open Neck—Half up Head—Forward Ears—Angled forward Tail—Up | Blinking White and Blue | Sudden head raising, opening eyes, tail wagging left and right—short and slow |
| Calmness | Eyes—Half open Neck—Up Head—Almost up Ears—Angled forward Tail—Almost up | Green and Blue | Head forward and swaying left and right—short and slow |
| Boredom | Eyes—Almost closed Neck—Half down Head—Down Ears—Angled inward Tail—Down | Pale Blue | Head swaying slowly between down and forward positions |
| Annoyance | Eyes—Almost open Neck—Half down Head—Forward Ears—Angled outward Tail—Down | Blue | Sudden head swaying side to side, tail up and down slowly |
| Anger | Eyes—Fully open Neck—Half down Head—Forward Ears—Angled outward Tail—Up | Red | Sudden body forward movement and then going slightly back |
| Tiredness | Eyes—Almost closed Neck—Down Head—Down Ears—Angled outward Tail—Down | Purple | Gradually moving head down and closing eyes |
| Story—Trial Task | Story—Adaptive Mode | Story—Non-Adaptive Mode |
|---|---|---|
![]() | ![]() | ![]() |
| A night janitor at the museum noticed a man acting suspiciously near the Egyptian artefacts. He pressed a hidden silent alarm. The man then tried to smash a glass case to steal a golden mask, but security arrived in time and stopped him. | A student has snuck into a professor’s office with a stolen key to do something illegal. When the professor returned unexpectedly, the student panicked, knocked her out with a bookend, and then locked the door from outside, using the stolen key. | Disguised as a caterer at a charity gala, a thief swapped a fake painting with an original on the gallery wall, smuggling the original one out. The theft was only noticed when a guest spotted that the artwork looked different before the thief left. |
Key Events:
| Key Events:
| Key Events:
|
| Category | Question | Scale |
|---|---|---|
| C1. Story-Guessing Experience: You are asked to compare the two modes of MiRo-E’s behaviour: the one in the first task (First Mode) and the one in the second task (Second Mode). Please choose the mode to which each statement applies more. | Q1. It is easy for you to guess the story | ![]() |
| Q2. You feel engaged when uncovering the story | ![]() | |
| Q3. The interface (image, button, status messages) was easy to use | ![]() | |
| Q4. You felt enough time and rounds (12) to complete the task | ![]() | |
| Q5. Is there anything that you think could improve the story guessing task? If yes, please describe. | (descriptive text) | |
| C2. MiRo-E experience: Here again, for each statement, please choose whether it applies more to the First Mode or the Second Mode. | Q1. The MiRo-E’s responses were clear whenever you asked questions to it. | ![]() |
| Q2. The hints given by the MiRo-E were helpful in uncovering the story. | ![]() | |
| Q3. I found the MiRo-E’s behaviour to be helpful and enjoyable in my interaction. | ![]() | |
| Q4. I found the MiRo-E’s behaviour to be natural and responsive. | ![]() | |
| Q5. What differences, if any, did you notice in the two modes? | (descriptive text) | |
| Q6. Please give your preference for the MiRo-E’s response behaviour: | ![]() | |
| Q7. Please explain why you chose the answer above. | (descriptive text) | |
| C3. MiRo-E’s Emotional Expression: You are asked about MiRo-E’s emotional responses. Please consider your overall experience across both tasks when answering. | Q1. MiRo-E’s emotional responses were noticeable to you | ![]() |
| Q2. MiRo-E’s emotions were clear and easy to understand | ![]() | |
| Q3. MiRo-E’s emotions did not affect my engagement in the task at all. | ![]() | |
| Q4. MiRo-E felt more “alive” and “interactive” because of its emotions | ![]() | |
| Q5. Which part did you like most about MiRo-E’s emotional responses? | (descriptive text) |
| C1-Q5: Is there anything that you think could improve the story guessing task? If yes, please describe | C2-Q5: What differences, if any, did you notice in the two modes? | C2-Q7: Please explain why you chose the answer above (for C2-Q6) | C3-Q5: Which part did you like most about MiRo-E’s emotional responses? |
| P3 = “I think that if I was given more rounds then I may be able to finish both task.” | P1, P2, P3 = “The first mode (adaptive) gave clearer and helpful hints to the story, while second (non-adaptive) seems repeating the same hints.” | P2 = “First mode (adaptive) says something like Keep it up, You are doing well, makes me feel more confident and encouraged.” | P1 = “When I guessed the first story correctly, he seems to be happy about it.” |
| P5 = “No, the whole task is fun and interactive.” | P4 = “The second mode (adaptive) give more hints during the later rounds, and sometimes it will give me encouragement, which makes me feel engaged.” | P3 = “The hints given in the first task (non-adaptive) was not as helpful as in the second task (adaptive), it feels repetitive especially during the last few rounds.” | P2, P3, P6 = “The nodding (when saying yes) and shaking head (when saying no) is cute.” |
| P6 = “I hope that more chances could be given because due to the limited number of rounds I can’t finish both tasks.” | P5 = “I noticed that the first mode (adaptive) give less hints at first then slowly give more hints, the second mode (non-adaptive) give hints from the start, but eventually the hints are not helpful, but I manage to complete both task.” | P1, P4, P10 = “The hints given by second mode (adaptive) feel more natural, and helped me find out the full story, while the first mode (non-adaptive) give less hints in the later rounds. The first mode also did not give me any encouragement, it feels like just talking to a robot that give uniform answers.” | P4 = “When I got key event, it winks. When I got the full story correct, it makes a very big motion which is a bit shocking, I think that the motion is a bit too big as I was unprepared.” |
| P9 = “Sometimes the robot thinks too long” | P6, P7 = “I do not see much difference in the hints given, but the first is giving more encouragement.” | P5, P7, P8 = “The second mode (adaptive) feels more natural, like a talking pet dog, it was cute and fun.” | P5, P10 = “I like the blinking of light at the end of the task. More eye/ear motions can be added, especially when the robot is thinking, so that it is more lively.” |
| P1, P2, P4, P7, P8, P10 = “No” | P8 = “I think the hints given by second mode (adaptive) is slightly more helpful than the first mode (non-adaptive), the biggest difference I noticed is the second mode will say something like You are doing great, keep going.” | P9 = “I think the second mode (adaptive) responses are a bit too long, although it did give some encouragement.” | P7 = “Moving its head and move in a circle when I got the final story correct. The movement of its tail. More emotions like sad or angry could help.” |
| P9, P10 = “The first mode responses are shorter, while the second mode (adaptive) give longer responses. “ | P6 = “I think the hints given are about the same.” | P8, P9 = “I like all of the emotion, but the emotion at the end of the task is the most enjoyable.” |
| Category | Question | Group | Score (Out of 5) | Rank-Sum Value | p-Value * | ||
|---|---|---|---|---|---|---|---|
| Mean | SD | Median | |||||
| C1 | Q1 | AF | 1.8 | 0.40 | 2 | 30.5 | 0.5238 |
| NAF | 2 | 1.55 | 1 | ||||
| Q2 | AF | 1.6 | 0.49 | 2 | 31 | 0.5238 | |
| NAF | 1.6 | 1.20 | 1 | ||||
| Q3 | AF | 3 | 0.00 | 3 | 27.5 | 1.0000 | |
| NAF | 3 | 0.00 | 3 | ||||
| Q4 | AF | 1.6 | 0.49 | 2 | 27.5 | 1.0000 | |
| NAF | 1.6 | 0.49 | 2 | ||||
| C2 | Q1 | AF | 1.6 | 0.49 | 2 | 31 | 0.5238 |
| NAF | 1.8 | 1.60 | 1 | ||||
| Q2 | AF | 1.4 | 0.49 | 1 | 26.5 | 1.0000 | |
| NAF | 2.0 | 1.55 | 1 | ||||
| Q3 | AF | 1.6 | 0.49 | 2 | 32.5 | 0.5238 | |
| NAF | 1.2 | 0.40 | 1 | ||||
| Q4 | AF | 1.4 | 0.49 | 1 | 30 | 1.0000 | |
| NAF | 1.2 | 0.40 | 1 | ||||
| C3 | Q1 | AF | 1.0 | 0.00 | 1 | 22.5 | 0.4444 |
| NAF | 1.4 | 0.49 | 1 | ||||
| Q2 | AF | 1.2 | 0.40 | 1 | 27.5 | 1.0000 | |
| NAF | 1.2 | 0.40 | 1 | ||||
| Q3 | AF | 4.6 | 0.49 | 5 | 27.5 | 1.0000 | |
| NAF | 4.6 | 0.49 | 5 | ||||
| Q4 | AF | 1.0 | 0.00 | 1 | 27.5 | 1.0000 | |
| NAF | 1.0 | 0.00 | 1 | ||||
| Category | Question | Score (Out of 5) | Frequency of Score | (1) Value | p-Value * | |
|---|---|---|---|---|---|---|
| C1 | Q1 | M = 1.9; SD = 1.14; | Favourable Score (1) | 4 | 2.5 | 0.1138 |
| Non-favourable (>1) | 6 | |||||
| Q2 | M = 1.6; SD = 0.91; | Favourable (1) | 6 | 10 | 0.0016 | |
| Non-favourable (>1) | 4 | |||||
| Q3 | M = 3.0; SD = 0.0; | Favourable (3) | 10 | 40 | <0.001 | |
| Non-favourable (≠3) | 0 | |||||
| Q4 | M = 1.6; SD = 0.49; | Favourable (1) | 4 | 2.5 | 0.1138 | |
| Non-favourable (>1) | 6 | |||||
| C2 | Q1 | M = 1.7; SD = 1.19; | Favourable (1) | 6 | 10 | 0.0016 |
| Non-favourable (>1) | 4 | |||||
| Q2 | M = 1.7; SD = 1.19; | Favourable (1) | 6 | 10 | 0.0016 | |
| Non-favourable (>1) | 4 | |||||
| Q3 | M = 1.4; SD = 0.49; | Favourable (1) | 6 | 10 | 0.0016 | |
| Non-favourable (>1) | 4 | |||||
| Q4 | M = 1.3; SD = 0.46; | Favourable (1) | 7 | 15.625 | <0.001 | |
| Non-favourable (>1) | 3 | |||||
| C3 | Q1 | M = 1.2; SD = 0.4; | Favourable (1) | 8 | 22.5 | <0.001 |
| Non-favourable (>1) | 2 | |||||
| Q2 | M = 1.2; SD = 0.4; | Favourable (1) | 8 | 22.5 | <0.001 | |
| Non-favourable (>1) | 2 | |||||
| Q3 | M = 4.6; SD = 0.49; | Favourable (5) | 6 | 10 | 0.0016 | |
| Non-favourable (<5) | 4 | |||||
| Q4 | M = 1.0; SD = 0.0; | Favourable (1) | 10 | 40 | <0.001 | |
| Non-favourable (>1) | 0 | |||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yang, Y.; Yap, P.S.; Wijeakumar, S.; Magassouba, A.; Deshpande, N. A Multimodal Adaptive Framework for Social Interaction with the MiRo-E Robot. Sensors 2026, 26, 1209. https://doi.org/10.3390/s26041209
Yang Y, Yap PS, Wijeakumar S, Magassouba A, Deshpande N. A Multimodal Adaptive Framework for Social Interaction with the MiRo-E Robot. Sensors. 2026; 26(4):1209. https://doi.org/10.3390/s26041209
Chicago/Turabian StyleYang, Yufeng, Pei Shan Yap, Sobanawartiny Wijeakumar, Aly Magassouba, and Nikhil Deshpande. 2026. "A Multimodal Adaptive Framework for Social Interaction with the MiRo-E Robot" Sensors 26, no. 4: 1209. https://doi.org/10.3390/s26041209
APA StyleYang, Y., Yap, P. S., Wijeakumar, S., Magassouba, A., & Deshpande, N. (2026). A Multimodal Adaptive Framework for Social Interaction with the MiRo-E Robot. Sensors, 26(4), 1209. https://doi.org/10.3390/s26041209







