Embodied Co-Creation with Real-Time Generative AI: An Ukiyo-E Interactive Art Installation
Abstract
1. Introduction
2. Background and Related Work
2.1. Embodied Interaction and Tangible User Interfaces
2.2. Context-Aware Multimedia and Immersive Media Systems
2.3. Human-AI Co-Creation and Creative Interfaces
2.4. Situated Cognition and Affect in Interaction
3. Methodology
3.1. System Description
3.2. Input Modalities
3.2.1. Gesture and Pose Detection
3.2.2. Object-Based Input
3.3. Real-Time Adaptation Mechanism
3.3.1. Camera-Based Context Sensing
3.3.2. AI Generation Loop with Style Conditioning
3.3.3. Output Display and User Feedback Loop
3.4. Context-Aware Design Features
3.4.1. Immediate Visual Feedback
3.4.2. Adaptive Scene Generation Based on Social/Multi-User Input
4. Observational Study Design and Mixed-Methods Analysis
4.1. Observational Ethnographic Protocol
4.1.1. Ethics and Limitations
4.1.2. Participants
4.1.3. Data Collection
4.1.4. System Specifications
4.1.5. Analytical Framework
4.1.6. Thematic Coding
4.1.7. Interaction Logging with Time Stamped Behavioral Annotations
4.2. User Agency and Co-Creative Engagement Quantitative Analysis
4.2.1. Framework for Assessing User Agency and Co-Creative Engagement
4.2.2. Weighting Profiles and Weighted Engagement Score (WES)
5. Results
5.1. Qualitative Analysis Results
5.1.1. Modalities and Emergent Patterns of Embodied Interaction
- Dominance and Intuitiveness of Tangible Object Manipulation.
- The most consistent and impactful user interactions were mediated through tangible objects. Participants rapidly grasped that presenting physical objects to the camera would influence the generative system’s output in real time.
- Color Mapping: Object color had a direct and often immediate influence on the overall palette of the Ukiyo-e scene. Red sheets frequently triggered sunset-toned skies and reddish landscapes. Blue objects, especially the shiny blue sheet, evoked cool-toned skies and water bodies.
- Thematic Object Association: Specific objects consistently introduced thematic elements. For example, the pink rose reliably triggered cherry blossoms or floral motifs.
- Material Properties: Beyond color, material properties such as shine or texture affect the output. The shiny blue sheet often yielded more reflective water, suggesting the AI model recognized surface gloss.
- Object Hierarchy and Specificity: Some objects acted as dominant “scene keys” notably, patterned pillows like the polka dot design regularly produced checkerboard-textured terrain. These objects could override prior inputs and establish strong stylistic themes.
- Bodily Pose and Gesture as Significant Modifiers.
- Presence and General Appearance: Simply entering the interaction zone activated the system and changed the visual state. Clothing color also subtly influenced the visual tone.
- Compositional Poses: Expanding arms led to more expansive landscape views, while vertical gestures (e.g., arms raised or holding props high) shifted focus to clouds or mountain peaks.
- Framing Gestures: A recurring gesture was using hands to form a frame or diamond shape, often resulting in scenes centered around Mt. Fuji or moon elements.
- Pointing, Gaze, and Proximity: Though pointing was less consistent in effect, looking upward or moving objects close to the camera often intensified or zoomed specific elements.
- The Crucial Role of the Real-Time Feedback Loop.
- Discovering Correlations: Immediate output changes helped users intuitively grasp the input-output relationship.
- Rapid Iteration and Sustained Engagement: Participants engaged in fluid, trial-and-error interactions. The absence of “undo” features encouraged playful experimentation, with users adjusting objects or poses to shape the scene iteratively in real time.
5.1.2. Dynamics of Co-Creation and Co-Design
- User-AI Co-Creation.
- Learning the AI’s Language: Users progressively developed internal models of how the system responded. Repeated use of particular objects (e.g., pink rose or red sheet) illustrates a learned mapping.
- Negotiated Agency: Users directed scene themes through inputs, but the AI maintained stylistic control over composition and blending. For instance, blended outputs demonstrated an emergent, shared authorship.
- Overriding Influence: Some objects exert more control than others. Patterned pillows often overrode or reoriented the entire scene despite previous input context.
- Influence of Co-Present Social Interaction.
- Collaborative Exploration: Users often engaged, suggesting inputs, reacting collectively, and co-managing object placement.
- Turn-Taking and Object Swapping: Passing and swapping objects (especially pillows) was common and often served as an informal handoff of creative control.
- Verbal and Non-Verbal Communication: While not directly recorded, visible gestures and facial expressions indicated that users engaged in active dialog and feedback loops during co-creation.
- Observational Learning: Users frequently mimicked successful strategies observed in others’ interactions, whether in their group or prior participants.
- Photo-Taking as Social Capture: Photo-taking occurred 14 times across documents. Often, sociable participants took photos of each other or the collaborative artwork, indicating a sense of shared achievement.
5.1.3. Cognitive and Affective Engagement
- User Agency and Sense of Control: Physical actions’ direct and immediate impact on the Ukiyo-e fostered a strong sense of user agency. Participants could see their inputs shaping the artwork, leading to feelings of control and authorship, even if the AI contributed significant stylistic interpretation. The ability to quickly change the scene by introducing new objects or poses reinforced this.
- Mental Model Formation: Users rapidly develop mental models of the system’s logic, associating specific objects or actions with predictable (or at least categorizable) outputs. This result was evident in their targeted reuse of certain objects to achieve particular effects (e.g., consistently using the “polka dot pillow” for the checkerboard landscape). The learning curve appeared relatively shallow for core object-based interactions.
- Reduced Cognitive Load: Compared to complex software with numerous menus and parameters, the embodied interface seemed to impose a lower cognitive load for basic creative control. Users could leverage their intuitive understanding of physical objects and space rather than needing to learn abstract symbolic commands.
- Positive Affective Engagement: A striking and consistent finding was the high positive effect exhibited by participants. “Smiling/Happy” was the most frequently observed emotion (noted in over 100 interaction rows). Laughter, expressions of delight, and engaged discussion were common, particularly during collaborative interactions or when the AI produced surprising or aesthetically pleasing results. These results suggest that the experience was intrinsically motivating and enjoyable.
- Reduced Fear of Failure/Playful Experimentation: The ease of changing the scene and the lack of “undo” buttons (standard in digital software) did not hinder exploration. Instead, it encouraged a more playful “what if?” approach. If the output was not desired, users introduced a new object or changed their pose, iteratively sculpting the scene without apparent frustration from “mistakes.”
5.2. Quantitative Analysis Results
6. Discussion
7. Implications, Future Work and Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Zhang, R.; Agrawala, A.; Isola, P. Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet). arXiv 2023, arXiv:2302.05543. Available online: https://arxiv.org/abs/2302.05543 (accessed on 20 October 2025).
- Brewster, S.; Murray-Smith, R.; Crossan, A.; Vazquez-Alvarez, Y.; Rico, J. Multimodal Interactions for Expressive Interfaces. In Proceedings of the First International Workshop on Expressive Interactions for Sustainability and Empowerment (EISE 2009), London, UK, 29–30 October 2009. [Google Scholar] [CrossRef]
- Amin, R.M.; Kühle, O.H.; Buschek, D.; Butz, A. Composable Prompting Workspaces for Creative Writing: Exploration and Iteration Using Dynamic Widgets. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ‘25), Yokohama, Japan, 26 April–1 May 2025. [Google Scholar] [CrossRef]
- Kantosalo, A.; Toivonen, H. Modes for Creative Human–Computer Collaboration: Alternating and Task-Divided Co-Creativity. In Proceedings of the Seventh International Conference on Computational Creativity (ICCC 2016), Paris, France, 27 June–1 July 2016; Sony CSL: Paris, France, 2016; pp. 77–84. [Google Scholar]
- Bussell, C.; Ehab, A.; Hartle-Ryan, D.; Kapsalis, T. Generative AI for Immersive Experiences: Integrating Text-to-Image Models in VR-Mediated Co-Design Workflows. In Communications in Computer and Information Science; Springer Nature: Cham, Switzerland, 2023; Volume 1836, pp. 380–388. [Google Scholar]
- Li, D.; Numan, N.; Qian, X.; Chen, Y.; Zhou, Z.; Alekseev, E.; Lee, G.; Cooper, A.; Xia, M.; Chung, S.; et al. XR Blocks: Accelerating Human-Centered AI+XR Innovation. arXiv 2025, arXiv:2509.25504. Available online: https://arxiv.org/abs/2509.25504 (accessed on 29 September 2025).
- Dourish, P. Where the Action Is: The Foundations of Embodied Interaction; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
- Ishii, H.; Ullmer, B. Tangible Bits: Towards Seamless Interfaces Between People, Bits and Atoms. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Atlanta, GA, USA, 22–27 March 1997; pp. 234–241. [Google Scholar]
- Andrews, C. An Embodied Approach to AI Art Collaboration. In Proceedings of the 2019 Conference on Creativity and Cognition, San Diego, CA, USA, 23–26 June 2019; pp. 13–22. [Google Scholar]
- Svanæs, D. Interaction Design for and with the Lived Body. ACM Trans. Comput.-Hum. Interact. 2013, 20, 8. [Google Scholar] [CrossRef]
- Rani, S.; Sharma, N.; Singh, R. AI and Computing Technologies in Art Museums: A Systematic Literature Review. ITM Web Conf. 2023, 53, 01004. [Google Scholar] [CrossRef]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Real-Time Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
- Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar] [CrossRef]
- Li, M.; Yang, T.; Kuang, H.; Wu, J.; Wang, Z.; Xiao, X.; Chen, C. ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback. In European Conference on Computer Vision; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 129–147. Available online: https://github.com/liming-ai/ControlNet_Plus_Plus (accessed on 20 October 2025).
- Tiribelli, S.; Pansoni, S.; Frontoni, E.; Giovanola, B. Ethics of Artificial Intelligence for Cultural Heritage: Opportunities and Challenges. IEEE Trans. Technol. Soc. 2024, 5, 293–305. [Google Scholar] [CrossRef]
- Liapis, A.; Smith, G.; Shaker, N. Mixed-initiative content creation. In Procedural Content Generation in Games. Computational Synthesis and Creative Systems; Springer: Cham, Switzerland, 2016; pp. 195–214. [Google Scholar]
- Sundar, A.; Russell-Rose, T.; Kruschwitz, U.; Machleit, K. The AI interface: Designing for the ideal machine-human experience. Comput. Hum. Behav. 2024, 165, 108539. [Google Scholar] [CrossRef]
- Huang, Y.; Dong, Y. Sand Playground: An Embodied Human–AI Physical Interface for Co-Creation in Motion. In Proceedings of the Creativity & Cognition (C&C), Venice, Italy, 20–23 June 2022; pp. 49–55. [Google Scholar]
- Cetinic, E.; She, J. Understanding and Creating Art with AI: A Review and Outlook. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 66. [Google Scholar] [CrossRef]
- Zolfaghari, M.; Chacon, J.C.; Martinez Nimi, H.; Ono, K. A User-Guided Generative Image-Based Model Interface to Design Desired Products with Iterative Design Selection. J. Sci. Des. 2025, 9, 1_11–1_20. [Google Scholar] [CrossRef]
- Csikszentmihalyi, M. Flow: The Psychology of Optimal Experience; Harper & Row: New York, NY, USA, 1990. [Google Scholar]
- Ellis, J.W. The Floating World of Ukiyo-e Prints: Images of a Japanese Counterculture. J. Soc. Political Sci. 2019, 2, 701–718. Available online: https://www.asianinstituteofresearch.org/JSParchives/the-floating-world-of-ukiyo-e-prints%3A-images-of-a-japanese-counterculture- (accessed on 20 October 2025). [CrossRef]
- Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1302–1310. [Google Scholar] [CrossRef]
- Kendall, A.; Grimes, M.; Cipolla, R. Research Data Supporting PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. Available online: https://www.repository.cam.ac.uk/handle/1810/251342 (accessed on 7 October 2015).
- Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 105. [Google Scholar] [CrossRef]
- O’Brien, H.L.; Toms, E.G. What is user engagement? A conceptual framework for defining user engagement with technology. J. Am. Soc. Inf. Sci. Tec. 2008, 59, 938–955. [Google Scholar] [CrossRef]
- Liu, Z.; Heer, J. The Effects of Interactive Latency on Exploratory Visual Analysis. IEEE Trans. Vis. Comput. Graph. 2014, 20, 2122–2131. [Google Scholar] [CrossRef] [PubMed]
- Suh, M.; Youngblom, E.; Terry, M.; Cai, C.J. AI as social glue: Uncovering the roles of deep generative AI during social music composition. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ‘21), Yokohama, Japan, 8–13 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Braun, V.; Clarke, V. Using thematic analysis in psychology. Qual. Res. Psychol. 2006, 3, 77–101. [Google Scholar] [CrossRef]
- Suchman, L.A. Plans and Situated Actions: The Problem of Human–Machine Communication; Cambridge University Press: Cambridge, MA, USA, 1987. [Google Scholar]
- Norman, D.A. The Design of Everyday Things, Revised Ed.; Basic Books: New York, NY, USA, 2013. [Google Scholar]
- Weisz, J.D.; He, J.; Muller, M.; Hoefer, G.; Miles, R.; Geyer, W. Design Principles for Generative AI Applications. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ‘24), Honolulu, HI, USA, 11–16 May 2024; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
- Hornecker, E.; Buur, J. Getting a Grip on Tangible Interaction: A Framework on Physical Space and Social Interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Montréal, QC, Canada, 22–27 April 2006; pp. 437–446. [Google Scholar]







| Key Domains | Role in this Study |
|---|---|
| Embodied interaction & TUIs | Treats the body and props as the primary control surface for the generative AI. |
| Context-aware sensing and conditional diffusion | Enables real-time pose, prop, and gesture sensing to control a generative model. |
| Human–AI co-creation | Provides a framework for mixed-initiative, shared agency, and embodied co-creation. |
| Situated cognition & affect | Emphasizes the importance of immediacy, flow, and legibility in embodied systems. |
| Log | Description |
|---|---|
| Document | An identifier for the distinct video segment/interaction block. |
| Timestamp | Time within the segment. |
| Participants | Anonymized identifier(s) for active participants, including descriptive notes (e.g., clothing, accessories) and group indicators. |
| Action | A description of the primary action performed (e.g., “Enter scene,” “Holds object,” “Poses”). |
| Object Used | Specific tangible object(s) actively used. |
| Pose/Movement Description | Details of the bodily pose or movement. |
| Observed Emotion | Inferred emotional state based on facial expressions and body language (e.g., “Smiling/Happy,” “Neutral,” “Curious”). |
| Photo Taken | Binary indicator if a participant used their own device to photograph the screen. |
| Ukiyo-e Output Description | A qualitative description of the generated Ukiyo-e scene. |
| Notes/Correlation | Interpretive notes on the perceived relationship between user input and AI output, and other relevant observations. |
| Indicator | Description |
|---|---|
| DIU | Diversity of Inputs Used: raw count of unique tangible objects (e.g., “polka-dot pillow”, “red sheet”) recorded in the “Object Used” column per participant within a session. |
| IC | Intentionality and Control: heuristic points granted for explicit co-design or successful complex compositional blending. |
| IRE | Iterative Refinement and Exploration: number of iterative actions (total actions − 1). |
| ESR | Exploitation of the System’s Expressive Range: count of unique Ukiyo-e output features triggered. |
| UAE | User Affect and Engagement (proxy): points assigned for observed smiles/happiness and for photos taken. |
| Weighting Profile | w_diu | w_ic | w_ire | w_esr | w_uae | Description |
|---|---|---|---|---|---|---|
| Balanced | 0.20 | 0.25 | 0.20 | 0.25 | 0.10 | Even importance across exploration, control, and outcome. |
| Emphasis in Diversity and Exploration | 0.35 | 0.15 | 0.25 | 0.15 | 0.10 | Prioritizes breadth of input and iterative interaction. |
| Emphasis Intentional Control | 0.15 | 0.40 | 0.15 | 0.20 | 0.10 | Prioritizes achievement of specific, controlled outcomes. |
| Emphasis Output Variation | 0.15 | 0.20 | 0.15 | 0.40 | 0.10 | Prioritizes generation of a diverse range of artistic styles/elements |
| Emphasis Affective Engagement | 0.15 | 0.15 | 0.15 | 0.15 | 0.40 | Prioritizes user enjoyment and valuation of the co-created artifact. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nimi, H.; Lu, M.; Chacon, J.C. Embodied Co-Creation with Real-Time Generative AI: An Ukiyo-E Interactive Art Installation. Digital 2025, 5, 61. https://doi.org/10.3390/digital5040061
Nimi H, Lu M, Chacon JC. Embodied Co-Creation with Real-Time Generative AI: An Ukiyo-E Interactive Art Installation. Digital. 2025; 5(4):61. https://doi.org/10.3390/digital5040061
Chicago/Turabian StyleNimi, Hisa, Meizhu Lu, and Juan Carlos Chacon. 2025. "Embodied Co-Creation with Real-Time Generative AI: An Ukiyo-E Interactive Art Installation" Digital 5, no. 4: 61. https://doi.org/10.3390/digital5040061
APA StyleNimi, H., Lu, M., & Chacon, J. C. (2025). Embodied Co-Creation with Real-Time Generative AI: An Ukiyo-E Interactive Art Installation. Digital, 5(4), 61. https://doi.org/10.3390/digital5040061
