Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware Strategies

Chen, Jiadong; Gu, Chenghao; Zhang, Jiayi; Liu, Zhankun; Ma, Boxuan; Konomi, Shin‘ichi

doi:10.3390/app142412071

Open AccessArticle

Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware Strategies

by

Jiadong Chen

¹

,

Chenghao Gu

¹

,

Jiayi Zhang

¹,

Zhankun Liu

¹,

Boxuan Ma

²

and

Shin‘ichi Konomi

^2,*

¹

Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka 819-0395, Japan

²

Faculty of Arts and Science, Kyushu University, Fukuoka 819-0395, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(24), 12071; https://doi.org/10.3390/app142412071

Submission received: 22 November 2024 / Revised: 18 December 2024 / Accepted: 19 December 2024 / Published: 23 December 2024

(This article belongs to the Special Issue Advancing Human–Machine Interaction Through Intelligent Sensors and Technology)

Download

Browse Figures

Versions Notes

Abstract

:

This study explores the factors influencing turn-taking coordination in virtual reality (VR) environments, with a focus on identifying key interaction dynamics that affect the ease of gaining speaking opportunities. By analyzing VR interaction data through logistic regression and clustering, we identify significant variables impacting turn-taking success and categorize typical interaction states that present unique coordination challenges. The findings reveal that features related to interaction proactivity, individual status, and communication quality significantly impact turn-taking outcomes. Furthermore, clustering analysis identifies five primary interaction contexts: high competition, intense interaction, prolonged single turn, high-status role, and low activity, each with unique turn-taking coordination requirements. This work provides insights into enhancing turn-taking support systems in VR, emphasizing contextually adaptive feedback to reduce speaking overlap and turn-taking failures, thereby improving overall interaction flow in immersive environments.

Keywords:

virtual reality; social dynamics; turn-taking; multi-party interaction; human–computer interaction

1. Introduction

Virtual reality (VR) technology has seen widespread application in various fields such as communication, tourism, education, and entertainment as it continues to evolve [1,2]. During the COVID-19 outbreak, social VR platforms (e.g., VR Chat, AltspaceVR, and Mozilla Hubs) emerged as alternatives to face-to-face interactions, enabling people to stay connected in virtual spaces while mitigating the risk of virus transmission from in-person meetings. Compared to traditional video communication tools, VR has gained widespread recognition for its heightened sense of immersion and co-presence [3,4,5].

However, the quality of VR interactions remains far from matching that of real-world experiences due to technological limitations. For instance, VR interactions primarily rely on avatars, which often lack precise facial expression control and detailed posture tracking, making it difficult to convey rich non-verbal cues [6]. Moreover, the field of view in VR environments is generally narrower than that of human vision, constraining users’ ability to perceive their surroundings through visual channels [7]. These technical constraints lead to the loss or misinterpretation of non-verbal signals, complicating multi-party interactions in VR [8,9,10]. As a result, users often struggle to accurately perceive others’ intentions, leading to confusion about when to speak or interrupt, which in turn causes issues such as overlapping speech or “awkward silences” [10,11].

Previous research has explored the potential of using behavioral tracking data to predict speech behavior, demonstrating the feasibility of real-time conversational interventions in social VR [9,12,13,14,15]. However, systematic studies on how and when such models can effectively intervene in conversations and improve turn-taking among users remain limited. In particular, identifying the key factors contributing to communication breakdowns in VR and understanding users’ communication needs across different contexts are critical for designing effective VR interaction support systems. Therefore, this study focuses on addressing the challenges of turn-taking coordination in VR environments under different interaction states. It employs a quantitative analysis approach to explore the underlying causes and mechanisms, aiming to provide critical insights for the development of context-aware VR interaction support systems. Specifically, we address the following research questions:

R1: What interaction factors significantly influence participants’ ability to successfully acquire speaking turns in VR communication?
R2: How do the challenges participants face in acquiring speaking turns vary across different interaction scenarios?

To answer these questions, we analyzed a VR communication dataset of six groups (four participants each) engaging in a 15 min social game. Participants annotated their speaking intentions post-experiment, and logistic regression analysis was conducted to examine the effects of individual status, spatial relationships, interaction proactivity, and communication quality on turn-taking coordination. Our contributions include identifying three key dimensions that influence the acquisition of speaking rights: (1) Individual Status: Higher-status participants encounter fewer coordination failures. (2) Interaction Proactivity: The proactive participation of participants increases overlap and turn-taking failures. (3) Communication Quality: While initial speech overlap causes confusion, it later facilitates smoother turn transitions through spontaneous adjustments. Building on these findings, we clustered five typical interaction scenarios for self-initiated turn-taking: high-competition, intense interaction, prolonged single-turn, high-status role, and low-activity contexts. As another contribution of our work, we analyzed these scenarios to compare their distinct turn-taking challenges and discussed targeted support strategies. For instance, in high-competition scenarios, systems can reduce the likelihood of turn-taking failures by providing priority cues or delayed feedback. In intense interaction and silent scenarios, systems can optimize turn-taking by offering non-verbal cues and social signals.

We hope these findings will inform the development of context-aware VR communication support systems capable of adapting to diverse interaction scenarios and mitigating communication barriers caused by inappropriate interventions.

2. Related Work

In multi-party conversations, turn management is a crucial mechanism to ensure smooth interactions. Sacks et al. [16] observed that natural conversations are typically well-coordinated, with only one person speaking at a time and minimal overlap. They hypothesized the existence of an implicit mechanism that enables efficient coordination of speaking turns. Subsequent research explored various non-verbal cues that play a role in turn management, including grammar, prosody, gaze, body posture, and gestures [17,18,19,20,21,22,23,24,25]. For instance, individuals can predict turn completions and openings based on silence, changes in intonation, speech rate adjustments, or prosodic patterns [18,19,20]. Streeck et al. [22] discussed the role of gestures in turn management, showing how listeners use gestures to indicate speaking intentions or initiate new turns. Jokinen et al. [24,25] emphasized the importance of gaze and head movements in coordinating conversational turns and facilitating information flow. Additionally, Petukhova et al. [26] demonstrated that combining multiple cues, such as head and mouth movements, can significantly enhance participants’ competitiveness in turn-taking.

With advancements in technology, interactions have expanded into online and virtual environments (e.g., video conferencing, VR social platforms). Unlike face-to-face communication, these interactions are mediated by technology and thus subject to its limitations. For example, network-induced delays [27], tracking errors in VR avatars [28], and insufficient non-verbal cues such as gaze and facial expressions [6,29] pose challenges for turn management. Users struggle to accurately perceive or interpret each other’s speaking intentions, making it difficult to find appropriate moments to insert themselves into conversations or claim speaking turns [10,11,30,31,32]. This directly affects participants’ equal opportunities to contribute [33], and turn-taking balance is closely tied to communication effectiveness and quality [34,35]. As some studies have pointed out, the difficulties in managing turn-taking hinder effective workplace meetings [36,37] and educational activities in virtual settings [5,38].

To address these challenges, researchers have proposed solutions to supplement or reconstruct non-verbal cues to support turn management in virtual environments. For example, Hu et al. introduced proxemic metaphors to assist turn-taking in multi-party video calls [39], while He et al. generated 3D photos with gaze direction to improve engagement and efficiency in online meetings [40]. Lou et al. [41] used additional EMG sensors to capture and reconstruct facial expressions on VR avatars to address the lack of non-verbal cues. Kurzweg et al. [42] suggested designing avatar body language to indicate participants’ conversational states, attention, and engagement. Li et al. [43] utilized visualization of participants’ conversational turn-taking in shared VR environments to promote balanced turn allocation.

Advances in turn-taking recognition, leveraging behavioral tracking data such as mouth movements [12], head movements [9], breathing patterns [44], gaze behavior [13,45], and multimodal motion data [15], have spurred efforts to develop intelligent systems capable of understanding turn transitions. Mizuno et al. [46] envisioned systems that predict the next speaker and display relevant information to enhance communication. Such systems could also benefit social agents for natural conversational interactions [47], provide personalized assistance to users who struggle to interpret non-verbal cues due to platform differences or physical abilities [48], and alert leaders in group discussions to adjust their strategies when others’ chances to speak are being limited or suppressed [49].

Despite advancements in VR communication research, the practical application of turn-taking support systems in real-world scenarios remains underexplored. A key challenge lies in determining the optimal timing and method of intervention to ensure support is both timely and non-disruptive to natural interactions. While prior studies have examined issues such as interruptions, turn acquisition, and overlapping speech in VR [3,10,11], their findings are largely derived from qualitative analyses, primarily through participant interviews. Quantitative evaluations of turn-taking challenges across different VR scenarios remain limited.

This study addresses this gap by quantitatively analyzing interaction dynamics to identify factors contributing to turn-taking difficulties and assessing the risk of turn-taking failures in varying scenarios. By providing empirical evidence, this work aims to inform the development of predictive models for turn-taking failures and lay the groundwork for context-aware turn-taking support systems.

3. Methods

We utilized a dataset collected from multi-party interactions in a VR environment to examine participants’ expression of speaking intentions and the eventual acquisition of speaking turns. By analyzing the outcomes of turn-taking and the interaction states that participants were in at the time, we aim to identify the factors that influence the risk of turn-taking failures.

3.1. Dataset

The dataset used in this study was derived from VR communication experiments conducted in our previous work [15]. Figure 1 illustrates the overall workflow of the experiment. The experiment involved 24 participants, evenly distributed between males (12) and females (12), aged between 22 and 30, with most being graduate students. Of these participants, 13 had prior experience with VR, while 11 did not. To minimize potential biases caused by varying levels of VR familiarity, a pre-experiment was conducted for participants without VR experience to familiarize them with the VR equipment and virtual environment before the formal experiment.

The experiment was conducted in a virtual meeting room environment (as shown in Figure 2). Previous research has demonstrated that VR environments can influence users in a manner similar to physical environments [50,51,52]. Based on this, we selected a conventional and formal virtual environment to encourage participants to focus on the experimental tasks [53]. In the virtual environment, each participant was represented by a simplified avatar consisting of a head, upper torso, and hands. To control for any additional effects of avatar design on communication [54,55], all participants were provided with identical avatars to ensure consistency of experimental variables. The experimental task was conducted in groups, with participants engaging in a social game called “Two Truths and a Lie” [56]. This type of icebreaker game was chosen because it fosters a relaxed and lively discussion atmosphere, allowing participants to naturally integrate into the discussion and encouraging the expression of speaking intentions [57,58].

During the experimental implementation phase, all participants were randomly divided into six groups, each consisting of four members, with an equal distribution of two males and two females in each group to ensure gender balance. Each group used Meta Quest 2 headsets to enter the virtual meeting room, where they communicated for approximately 15 min. In the virtual space, participants’ avatars could move and rotate freely using the joystick on handheld controllers, with motions tracked to reflect participants’ physical movements.

For data collection phase, we recorded VR tracking data throughout the interaction, including the position, rotation, velocity, acceleration, angular velocity and angular acceleration of participants’ avatars. Additionally, the discussion process was fully recorded using the devices’ built-in screen recording feature. These recordings were used to annotate speech events by segmenting speech into inter-pausal units (IPUs) using a 200 ms pause threshold [59,60]. Speech intervals exceeding this threshold were considered separate speech units.

To capture moments when participants attempted to acquire speaking turns, we employed a retrospective cue-based approach to collect speaking intention tags. After the experiment, participants reviewed the recorded discussions and annotated the specific timestamps when they began expressing speaking intentions. During annotation, participants referred to social signals they displayed, such as audible lip-smacking or tongue-clicking sounds, posture adjustments, deep breaths, or noticeable mouth movements [14,26], to determine the moments when they clearly exhibited speaking intentions and attempted to gain speaking turns. Notably, speaking intentions are not always directly linked to actual speaking behavior. Some speaking intentions failed to result in speech, possibly because participants abandoned their attempts due to a lack of suitable opportunities. Conversely, some speaking behaviors might be driven by group pressure rather than explicit speaking intentions [61].

As part of the annotation process, speech behaviors that did not involve speaking intentions were excluded. For example, when participants were explicitly assigned speaking turns by others, their speech was typically a passive response rather than an expression of active intent. In such cases, participants did not need to exhibit additional (non-verbal) behaviors to coordinate turn-taking [26]. Additionally, backchannel responses (e.g., brief supportive utterances) were excluded, as they were used to express support for the speaker without involving turn acquisition.

The speaking intention tags were categorized based on whether the intention successfully translated into speech: successful speaking intentions (intentions that resulted in speech) and unsuccessful speaking intentions (intentions that did not result in speech). Figure 3 presents an example of annotated speech data. To ensure annotation accuracy, all annotated speaking intention tags underwent rigorous review by researchers, with erroneous tags removed. For example, tags labeled as successful intentions but timestamped later than the actual speech onset, or those labeled as successful intentions but lacking corresponding speech behaviors, were excluded. After filtering, the dataset included 414 successful speaking intention tags and 77 unsuccessful speaking intention tags.

3.2. Turn-Taking Failures

This study evaluates the challenges of turn-taking by analyzing the outcomes of participants’ speaking intentions. In addition to unsuccessful speaking intentions (i.e., turn-taking failures), we also examine instances of speech conflicts, as they reflect breakdowns in coordination during the turn-taking process.

In an ideal multi-party conversation, smooth turn transitions rely on the effective transmission and reception of social cues among all participants, ensuring seamless alternation of speech. However, when participants fail to secure speaking turns through their cues, it can lead to “interruptions”, where a participant initiates their turn before the current speaker finishes [62]. Such intrusive behaviors not only disrupt conversational flow [63] but may also negatively affect interpersonal attitudes among participants [64].

Another form of speech conflict arises when multiple participants simultaneously initiate their speaking turns. This scenario indicates a failure to accurately perceive others’ intentions, resulting in inappropriate timing of speech initiation and triggering a turn-taking conflict.

To systematically analyze the outcomes following the expression of speaking intentions, we classified them into the following categories based on whether the intention successfully transitioned into a speaking turn and whether a speech conflict occurred during turn initiation:

Obstructed: The speaking intention did not result in speaking behavior, or the turn initiation involved overlapping speech.
Unobstructed: The speaking intention successfully transitioned into speaking behavior, and the turn initiation was characterized by non-overlapping, smooth turn transitions.

Notably, in evaluating overlapping speech, we considered not only overlaps at the exact moment of turn initiation but also overlaps occurring within 0.2 s after the turn began. This is because, even if no apparent overlap occurs at the onset, a very short delay between the turns of two speakers might indicate a failure to detect the first speaker’s intention. Figure 4 illustrates the structural relationship between speaking intention tags and the categories of turn-taking outcomes.

3.3. Interaction Dynamic Features

Based on prior research on social dynamics, we explore the key factors potentially contributing to turn-taking failures across four dimensions: individual status, spatial relationships, interaction proactivity, and communication quality.

Individual Status: Status disparities among participants are a critical factor influencing conversational equality [65,66]. For instance, the time interval since a participant’s last speech reflects their status: participants who have spoken recently bear greater conversational responsibilities, whereas those who remain silent for extended periods risk marginalization [67]. To capture this characteristic, we introduce the feature Negative Logarithm of Speaking Interval, where the time interval refers to the duration between a participant’s last speech and their expression of speaking intention. By applying a logarithmic transformation to the time interval, the impact of extreme values is reduced, and taking its negative reflects an inverse relationship, where shorter intervals indicate higher status.

Spatial Relationships: Spatial relationships play a crucial role in shaping interpersonal interactions [68]. Physical distance has a significant impact on social influence, interaction frequency [69,70,71], as greater distances can weaken social signal transmission [69] and reduce communication efficiency [71]. To quantify spatial relationships, we introduce the feature Group Area, defined as the area of the smallest rectangle enclosing the positions of all four participants. This feature provides insights into how spatial distribution influences the fluidity and coordination of turn-taking.

Interaction Proactivity: Higher interaction proactivity often leads to increased competition for turns and more frequent turn exchanges, thereby complicating turn-taking coordination. The low fidelity of VR environments may exacerbate this issue by reducing participants’ ability to detect and respond to non-verbal cues. To measure this dimension, we include two voice-channel-related features: the Speaking Duration Ratio and the Utterance Count. These features have been widely used in previous studies to assess conversational activity and interaction patterns [32,72,73]. Additionally, to complement the measurement of interaction intensity from a non-verbal perspective, we calculate the Speaking Intention Count, which represents the number of speaking intention tags from others. This feature reflects the active expression of speaking intentions, providing additional support for a comprehensive assessment of interaction proactivity.

Communication Quality: Communication quality is closely related to the effectiveness of interactions. Clear expression reduces misunderstandings among participants, enhancing the accuracy and consistency of information exchange. Balanced participation fosters diverse viewpoints and cultivates an open and inclusive conversational environment. To measure differences in communication quality, we introduced two key features: Speaking Overlap Ratio and Max Speaker Duration Ratio. The speaking overlap ratio represents the proportion of overlapping speaking time to the total speaking time, reflecting the level of conversational chaos, as simultaneous speech by multiple participants can hinder effective information exchange. The max speaker duration ratio measures the ratio of the longest speaker’s speaking duration to the total speaking time, indicating the balance of participation. A lower ratio suggests a more open and balanced communication, which we hypothesize may facilitate the acceptance of participants’ speaking intentions. Finally, Table 1 provides a summary of all features. Table 2 provides the descriptive statistics for each feature.

To calculate the relevant features, we selected data within a time window of t seconds before and after participants annotated their speaking intentions. This window was designed as a key event period to analyze the interaction dynamics during participants’ attempts to acquire speaking turns. The rationale behind this choice was to capture the critical time period influencing the expression of speaking intentions and their transition to actual speaking behavior. To determine the optimal length of t, we performed a statistical analysis of the time intervals between participants’ annotated speaking intentions and the actual onset of speaking. The analysis showed that, in the majority of cases, this interval was no longer than 1.5 s (mean = 0.867, std = 0.534, min = 0.055, 50% = 0.780, 87% ≤ 1.5, max = 2.507). Based on these findings, we set t to 1.5 s, as this duration effectively covers the typical transition period from intention to action while minimizing the inclusion of irrelevant or extraneous data outside the scope of this process.

It is important to note that, to prevent potential bias in feature calculation caused by speech behaviors triggered by the speaking intention itself, we adjusted the time window for certain tags. Specifically, when speech occurs within 1.5 s after a tag, the window is shifted earlier to ensure that the subsequent speech behavior is excluded from feature computation (as illustrated in Figure 5).

4. Results

4.1. Regression Analysis

We employed a logistic regression model to systematically analyze the significance and impact of various features on turn-taking success. Specifically, the outcome of turn-taking (categorized as “obstructed” or “unobstructed” ) was set as the dependent variable, while interaction dynamics features were used as independent variables. Prior to model construction, all feature data were standardized, and the variance inflation factors (VIF) for all independent variables were verified to be below 5, indicating no significant multicollinearity issues. The regression analysis results are presented in Table 3. The model’s McFadden pseudo-

R^{2}

value is 0.192, close to 0.2, indicating that the model effectively captures key relationships between the independent and dependent variables.

From the regression coefficients (

β

values) and their significance (p-values), the group area (

β

= −0.0852, p = 0.429) and max speaker duration ratio (

β

= 0.3258, p = 0.051) were found to have no significant effect on turn-taking success. However, the utterance count (

β

= −0.5211, p = 0.006) and speaking duration ratio (

β

= −0.9142, p = 0.000) exhibited significant negative effects, indicating that high-frequency or continuous speaking scenarios increase the difficulty for participants to successfully acquire speaking turns.

Additionally, the feature negative logarithm of speaking interval (

β

= 0.2710, p = 0.014) positively influenced turn-taking success, specifically indicating that shorter intervals since their last turn made their speaking intentions more likely to be acknowledged. On the other hand, the speaking intention count (

β

= −0.5344, p = 0.000) had a significant negative impact on coordination outcomes, meaning that when speaking intentions occurred in clusters, the difficulty of turn-taking coordination increased accordingly.

An unexpected finding was that speaking overlap ratio (

β

= 0.3875, p = 0.009) had a positive effect on successful turn-taking, suggesting that in scenarios with a higher proportion of overlapping speech, the failure rate of turn-taking decreased. This phenomenon may indicate that participants actively coordinate turn-taking during overlaps to avoid prolonged speech conflicts.

To further validate the predictive performance of the significant features, we retrained the logistic regression model using these features to predict turn-taking failures. The dataset was divided into 90% for training and 10% for testing, with the test set containing 50 samples (24 positive cases and 26 negative cases). On the test set, we achieved an overall accuracy of 82%, with an AUC value of 0.83. This indicates that the model has high accuracy in predicting failures and demonstrates strong discriminative ability. These results further support the critical role of these features in turn-taking coordination and confirm their potential for application and effective prediction in new datasets.

4.2. Cluster Analysis

Based on the regression analysis, we further conducted a clustering analysis of the significant variables to identify representative group interaction patterns. These patterns, with their concentrated trends in specific interaction features, provide key insights into the causes of coordination difficulties across different scenarios.

In this study, we used the K-means clustering method. To determine the optimal number of clusters, we experimented with configurations ranging from 2 to 10 clusters and evaluated them using the elbow method and silhouette coefficient. Ultimately, we selected five clusters as the optimal solution. Figure 6 provides a detailed view of the distribution of each feature across the different categories.

Based on the feature distribution across clusters, we summarized the main characteristics of each cluster:

C1—Prolonged Single-turn Scenario: This cluster is characterized by the highest proportion of speaking duration but a low number of speaking turns and almost no speech overlap. Participants wishing to speak hold a relatively low status in the current interaction. This suggests that the interactions in this cluster are primarily dominated by long monologues from a few participants, with limited group interaction, leaning toward a unidirectional information-sharing pattern.
C2—High-status Role Scenario: This cluster exhibits moderate levels of speaking duration and speaking turns, with a low proportion of speech overlap. A notable feature is that participants wishing to speak have a relatively high status. This reflects interactions where participants expressing speaking intentions are closely tied to the ongoing exchange and hold a relatively dominant role.
C3—Intense Interaction Scenario: This cluster features the highest rate of speech overlap and speaking turns, with a speaking duration proportion close to the highest. Participants wishing to speak have high status, and the speaking intentions of other participants are also strong. This type of interaction demonstrates a highly intense interactive state, where group members frequently speak, indicating high engagement and interaction density.
C4—Low-activity Scenario: This cluster is characterized by long periods without speech activity, with participants speaking infrequently, resulting in an overall silent state. This indicates a “cold” interaction phase where group members exhibit low willingness to participate, lacking active speaking and engagement.
C5—High-competition Scenario: This cluster shows the highest number of speaking intentions, moderate levels of speaking duration and speaking turns, and a low proportion of speech overlap. Such interactions likely occur when a topic of broad interest prompts participants to express speaking intentions collectively, resulting in turn-taking competition and reflecting a high level of interaction demand.

Table 4 shows the number of tags in each cluster and the proportion of turn-taking failures (including overlaps and unsuccessful attempts) during turn acquisition. Among the clusters, C5 has the highest obstructed ratio with both its conflicting ratio and unsuccessful attempt ratio being the highest. The failure rates of the other clusters, in descending order, are C3, C1, and C2, with C4 having the lowest failure rate.

Notably, while C3 and C1 have similar overall failure rates, their specific patterns differ: C3 exhibits a higher overlap ratio, whereas C1 shows a higher ratio of unsuccessful attempts. In comparison, C2 has the lowest proportion of unsuccessful attempts, and C4 has the lowest proportion of overlap cases among all clusters.

4.3. Survival Analysis

To further investigate the effects of different conditions on turn-timing, we applied survival analysis to evaluate the time delay between the expression of speaking intentions and the actual speech initiation. By comparing survival curves across different clusters, we aim to reveal behavioral differences in turn-timing choices under varying scenarios.

Specifically, for all successful speaking intention tags (including both non-overlapping and overlapping cases), the time of intention expression was set as the starting point for survival analysis, while the actual speech initiation was defined as the “event”. By measuring the time interval between the tag and the event, we calculated the “survival rate” of intentions at each time point. Figure 7 presents the survival curves for each category, providing a visual representation of the dynamics of speaking intention realization across different scenarios.

The log-rank test results indicate that the difference between C1 and C4 scenarios is statistically significant (p < 0.01). Through the analysis of the survival curves, it can be observed that in C4, the likelihood of speaking intentions being converted into actual speech shortly after expression is significantly lower than in other categories (i.e., the survival probability is higher). This suggests that in C4 (low-activity Scenario), participants are more inclined to delay the execution of their speaking intentions rather than speak immediately. In contrast, in C1, the survival probability drops to the lowest among all categories after 1 s, suggesting that speaking behaviors in C1 (prolonged single-turn scenario) are less likely to be delayed for a longer period.

Further analysis of survival probability after 1.5 s reveals that these probabilities in C1 and C5 are significantly lower compared to C2, C3, and C4. This suggests that in C1 (prolonged single-turn scenario) and C5 (high-competition scenario), speaking intentions that successfully transition into actions are mostly concentrated within 1.5 s after the intention is expressed. Speaking intentions that are not acted upon within 1.5 s are relatively rare, indicating that they are often abandoned and become unsuccessful intentions. In contrast, in C2 (high-status role scenario), C3 (intense interaction scenario), and C4 (low-activity scenario), approximately 20% of speaking intentions remain unfulfilled even after 1.5 s. This difference may reflect variations in participants’ patience or strategies for acquiring speaking turns across different scenarios, highlighting the influence of contextual factors on speaking intention outcomes.

5. Discussion

5.1. Key Features and Predictions of Turn-Taking Coordination Failures

Using logistic regression analysis, we addressed RQ1, identifying which interaction dynamics significantly influence participants’ success in acquiring speaking turns in VR environments. We also explored the potential of these factors for predicting coordination failures. The analysis identified five variables across three dimensions with significant impacts on turn-taking success. Below, we provide an in-depth discussion of these variables to uncover the underlying mechanisms driving their effects.

Firstly, the individual status significantly affects their success in acquiring a turn. The negative logarithm of the speaking interval reflects this, with participants expressing intentions shortly after their last turn being more likely to succeed. This phenomenon can be attributed to several underlying mechanisms. This phenomenon can be attributed to several underlying mechanisms. Previous studies have identified the “short-term gain effect” in communication, where participants who have just spoken are more likely to speak again within a short period [74]. The regularity in such speaking sequences increases the predictability of the next speaker for group members [75]. Consequently, participants who have just spoken are better able to sustain and capture others’ attention, making their social signals of speaking intention more easily perceived and responded to.

Interaction proactivity features, such as speaking duration ratio, utterance count, and speaking intention count, strongly influence turn-taking coordination. An increase in speaking turns, speaking time, and participants expressing speaking intentions heightens the risk of turn-taking failures. The successful implementation of rapid turn exchanges relies on a clear signaling system to effectively manage the turn-taking process. However, existing literature highlights that the interpretation of signals often depends on specific contextual factors and is not always unambiguous, further complicating rapid turn exchanges [76]. In high-intensity interaction scenarios, participants are required to not only process others’ spoken content quickly but also dynamically interpret and adapt to evolving social signals, which significantly increases the complexity of interactions. Moreover, technical limitations in VR environments, such as delays or inaccuracies in the transmission of non-verbal signals, may exacerbate these challenges. Taken together, these factors make interaction proactivity a critical contributor to speech overlaps and turn-taking failures.

Interestingly, the communication quality dimension’s speaking overlap ratio yielded counterintuitive results: increased overlap was associated with a higher likelihood of successfully acquiring speaking turns. This phenomenon may indicate that during overlapping speech, participants tend to actively coordinate to reach a conversational consensus. According to the “one speaker at a time” principle [16], extended periods of overlap may prompt participants to adjust their speaking rhythms to clarify a single active speaker. Through this coordination, participants become more attuned to others’ speaking intentions, creating conditions for smoother turn acquisition. Thus, in specific contexts, overlapping speech may serve as a mechanism for achieving smoother turn transitions rather than as a barrier. As for the max speaker duration ratio, no significant effect was observed. Initially, we hypothesized that more balanced speaking distributions would create a more open conversational atmosphere, facilitating turn acquisition. However, this assumption was not supported. This suggests that the relationship between openness in turn-taking and participation balance may be more complex than initially anticipated.

Finally, spatial relationships, measured by the group area feature, showed no significant effect on turn-taking coordination. This result may be due to the spatial constraints of the VR environment used in the study, which simulated the size of a meeting room. The relatively limited range of movement might not have been sufficient to observe the effects of spatial distance on non-verbal signal transmission. Future studies could investigate this factor in larger VR environments, such as plazas or halls, where increased activity ranges may amplify the impact of distance on turn-taking coordination.

We validated the predictive performance of the logistic regression model using the significant features for turn-taking failure prediction. The model demonstrated high accuracy on the test set, supporting the critical role of these features in turn-taking coordination and confirming their robust predictive capability in new data. From an application perspective, these significant features can guide the development of VR systems equipped with failure prediction and real-time feedback capabilities. By integrating a prediction model for turn-taking difficulties, systems can provide assistive notifications when participants are at risk of failing to acquire a turn. This would not only enhance the natural flow of turn transitions but also minimizes unnecessary disruptions introduced by the system during interactions.

5.2. Impacts of Different Interaction Scenarios on Turn-Taking and System Support Strategies

To address RQ2, we performed a clustering analysis of significant features, identifying five typical interaction states when participants expressed speaking intentions. These findings reveal the distinct mechanisms through which different interaction states impact turn-taking success. For each scenario, we discuss potential design strategies to address the associated challenges.

In the high-competition scenario (C5), the failure rate for acquiring speaking turns is the highest. Participants tend to speak quickly, exhibiting “competitive” behavior. In such scenarios, the simultaneous expression of speaking intentions by multiple participants makes synchronization and coordination extremely challenging. The resulting “first-come, first-served” pattern highlights the immediacy and prioritization of turn transitions under competitive dynamics, which significantly increases the failure rate. To address this challenge, system support could focus on providing clearer priority cues or offering delayed feedback during moments of reduced competition to optimize the allocation of speaking turns and reduce instances of turn-taking failures.

In the intense interaction scenario (C3) and the prolonged single-turn scenario (C1), turn-taking failures also occur frequently, but their underlying causes differ. In the intense interaction scenario (C3), frequent turn transitions make it difficult for participants to accurately identify appropriate speaking opportunities, resulting in increased overlapping speech. This rapid interaction pace requires careful consideration in system intervention design. Overly direct or frequent system support could exacerbate communication chaos and disrupt the natural flow of turn-taking. Notably, overlapping speech can sometimes facilitate spontaneous coordination among participants, helping them quickly reach a consensus on turn allocation. Therefore, when providing support, systems should carefully balance the intensity and approach of interventions to avoid disrupting the highly dynamic rhythm of turn exchanges.

In contrast, the prolonged single-turn scenario (C1) restricts opportunities for others to speak, resulting in more turn-taking failures, which in turn impact group engagement and interaction balance. From the perspective of turn-timing, after waiting for a brief period, participants may lose patience and choose to interrupt the current speaker to forcibly claim a turn. To address this, support systems could prompt long-turn speakers to acknowledge the speaking intentions of other participants and encourage them to voluntarily relinquish the floor. Simultaneously, the system can provide clear feedback to participants attempting to express intentions, helping them confirm that their intentions have been recognized by others. Such mechanisms can increase opportunities for multi-party participation, effectively reduce interruptions caused by poor transmission of information, and improve the overall fluidity and coordination of interactions.

In the high-status role scenario (C2), the process of acquiring speaking turns is relatively smooth, with participants exhibiting greater patience in choosing the timing of their speech. This may reflect role awareness and stability associated with differences in participant status, which helps maintain conversational rhythm. In this scenario, the system does not need to implement extensive interventions but should focus on providing targeted support to non-dominant participants to help address the imbalance in speaking intentions and participation opportunities caused by status differences.

In the low-activity scenario (C4), participants’ self-initiated turns rarely lead to turn-taking failures, but they often exhibit longer hesitation periods before speaking. This may be due to the greater psychological pressure associated with breaking the silence in low-activity contexts, where participants require more social cues as triggers for speaking. To address this, the system could share information about speaking intentions within the group, helping participants recognize that their intentions have been perceived by others and understand whether there are other potential speakers. This could help alleviate participants’ psychological pressure, thereby reducing hesitation time, encouraging more interaction, and improving the overall efficiency of the conversation.

Finally, we provide several examples of how these mechanisms can be implemented in practical VR applications. For instance, in virtual meetings during rapid brainstorming sessions, the system can display a dynamic speaking queue to visually manage the speaking order, effectively preventing speech conflicts. When prolonged single-speaker situations arise, the system can send private prompts to long-time speakers, suggesting they summarize their points and invite others to contribute. Simultaneously, the system can enhance the presence of non-dominant users through visual cues (e.g., markers or aura effects), promoting balanced participation. In contrast, when interactions fall into a state of silence, the system can use auditory prompts or animations to encourage participants to break the silence.

5.3. Limitations and Future Work

The current study was subject to several limitations. First, the sample size and experimental context are relatively limited. As noted earlier, the constrained spatial range of the virtual environment may have diminished the significance of spatial relationship features. Additionally, the characteristics of social interactions may vary across diverse social groups [77], while the participants in this study were predominantly university students, representing a relatively homogeneous demographic. Future research should aim to include more diverse populations and broader interaction scenarios to enhance the generalizability of the findings.

Secondly, this study prioritized features that are easily detectable by automated systems, which introduces certain limitations in examining influencing factors. For example, personality traits, which have been shown to significantly impact interaction behaviors [78,79], were not included in this study. Future research could consider incorporating additional features, such as personality traits, using pre-experiment questionnaires to evaluate the risk of turn-taking failures more comprehensively.

6. Conclusions

This study explored the key factors influencing turn-taking coordination in VR environments and analyzed the challenges of acquiring speaking turns and strategies for system support across different interaction scenarios. Our findings addressed which interaction dynamics factors affect smooth turn-taking coordination (RQ1) and examined the differences in turn-taking difficulty across various interaction states and their implications for system support (RQ2).

Logistic regression analysis identified key features such as negative logarithm of speaking interval, utterance count, speaking duration ratio, speaking intentions count, and speaking overlap ratio, all of which significantly impact turn-taking outcomes. Clustering analysis further revealed five typical interaction scenarios: prolonged single-turn scenario (C1), high-status role scenario (C2), intense interaction scenario (C3), low-activity scenario (C4), and high-competition scenario (C5). Each scenario presents unique challenges and requires different system support strategies.

The findings provide recommendations for developing context-aware VR turn-taking support systems. Future VR systems can adjust intervention strategies based on interaction states, such as optimizing priority management in high-competition scenarios or enhancing interaction prompts in low-activity contexts. These differentiated support strategies can reduce barriers to turn-taking, improve communication fluidity, and foster more coordinated multi-party interactions in VR environments, ultimately enhancing users’ immersive experience.

Author Contributions

Conceptualization, J.C., C.G. and S.K.; methodology, J.C. and C.G.; software, J.C. and C.G.; validation, J.Z. and Z.L.; formal analysis, J.C.; investigation, J.Z. and Z.L.; resources, S.K.; data curation J.Z. and Z.L.; writing—original draft preparation, J.C. and C.G.; writing—review and editing, B.M. and S.K.; visualization, J.C.; supervision, S.K.; project administration, S.K.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI grant numbers JP20H00622, JP23H03507, and JP23K02469.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of Kyushu University (protocol code 202204, 5 December 2022).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to privacy reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jerald, J. The VR Book: Human-Centered Design for Virtual Reality; Morgan & Claypool: San Rafael, CA, USA, 2015. [Google Scholar]
Yassien, A.; ElAgroudy, P.; Makled, E.; Abdennadher, S. A design space for social presence in VR. In Proceedings of the 11th Nordic Conference on Human-Computer Interaction: Shaping Experiences, Shaping Society, Tallinn, Estonia, 25–29 October 2020; pp. 1–12. [Google Scholar]
Barreda-Ángeles, M.; Horneber, S.; Hartmann, T. Easily applicable social virtual reality and social presence in online higher education during the covid-19 pandemic: A qualitative study. Comput. Educ. X Real. 2023, 2, 100024. [Google Scholar] [CrossRef]
Steinicke, F.; Lehmann-Willenbrock, N.; Meinecke, A.L. A first pilot study to compare virtual group meetings using video conferences and (immersive) virtual reality. In Proceedings of the 2020 ACM Symposium on Spatial User Interaction, Virtual Event, 30 October–1 November 2020; pp. 1–2. [Google Scholar]
Yoshimura, A.; Borst, C.W. Remote Instruction in Virtual Reality: A Study of Students Attending Class Remotely from Home with VR Headsets. In Mensch und Computer 2020—Workshopband; Gesellschaft für Informatik e.V.: Bonn, Germany, 2020. [Google Scholar] [CrossRef]
Tanenbaum, T.J.; Hartoonian, N.; Bryan, J. “How do I make this thing smile?” An Inventory of Expressive Nonverbal Communication in Commercial Social Virtual Reality Platforms. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–13. [Google Scholar]
Williamson, J.R.; O’Hagan, J.; Guerra-Gomez, J.A.; Williamson, J.H.; Cesar, P.; Shamma, D.A. Digital proxemics: Designing social and collaborative interaction in virtual environments. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April–5 May 2022; pp. 1–12. [Google Scholar]
Maloney, D.; Freeman, G.; Wohn, D.Y. “Talking without a Voice” Understanding Non-verbal Communication in Social Virtual Reality. Proc. ACM Hum.-Comput. Interact. 2020, 4, 1–25. [Google Scholar] [CrossRef]
Ishii, R.; Kumano, S.; Otsuka, K. Prediction of next-utterance timing using head movement in multi-party meetings. In Proceedings of the 5th International Conference on Human Agent Interaction, Bielefeld, Germany, 17–20 October 2017; pp. 181–187. [Google Scholar]
Moustafa, F.; Steed, A. A longitudinal study of small group interaction in social virtual reality. In Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology, Tokyo, Japan, 28 November–1 December 2018; pp. 1–10. [Google Scholar]
Williamson, J.; Li, J.; Vinayagamoorthy, V.; Shamma, D.A.; Cesar, P. Proxemics and social interactions in an instrumented virtual reality workshop. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 8–13 May 2021; pp. 1–13. [Google Scholar]
Ishii, R.; Kumano, S.; Otsuka, K. Analyzing mouth-opening transition pattern for predicting next speaker in multi-party meetings. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; pp. 209–216. [Google Scholar]
Lee, M.C.; Trinh, M.; Deng, Z. Multimodal Turn Analysis and Prediction for Multi-party Conversations. In Proceedings of the 25th International Conference on Multimodal Interaction, Paris, France, 9–13 October 2023; pp. 436–444. [Google Scholar]
Li, L.; Molhoek, J.; Zhou, J. Inferring Intentions to Speak Using Accelerometer Data In-the-Wild. arXiv 2024, arXiv:2401.05849. [Google Scholar]
Chen, J.; Gu, C.; Zhang, J.; Liu, Z.; Konomi, S. Sensing the Intentions to Speak in VR Group Discussions. Sensors 2024, 24, 362. [Google Scholar] [CrossRef]
Sacks, H.; Schegloff, E.A.; Jefferson, G. A simplest systematics for the organization of turn-taking for conversation. Language 1974, 50, 696–735. [Google Scholar] [CrossRef]
Duncan, S. Some signals and rules for taking speaking turns in conversations. J. Personal. Soc. Psychol. 1972, 23, 283–292. [Google Scholar] [CrossRef]
Ford, C.E.; Thompson, S.A. Interactional units in conversation: Syntactic, intonational, and pragmatic resources for the management of turns. Stud. Interact. Socioling. 1996, 13, 134–184. [Google Scholar]
Grosjean, F. Using prosody to predict the end of sentences in English and French: Normal and brain-damaged subjects. Lang. Cogn. Process. 1996, 11, 107–134. [Google Scholar] [CrossRef]
De Ruiter, J.P.; Mitterer, H.; Enfield, N.J. Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation. Language 2006, 82, 515–535. [Google Scholar] [CrossRef]
Novick, D.G.; Hansen, B.; Ward, K. Coordinating turn-taking with gaze. In Proceedings of the Fourth International Conference on Spoken Language Processing, ICSLP’96, Philadelphia, PA, USA, 3–6 October 1996; Volume 3, pp. 1888–1891. [Google Scholar]
Streeck, J.; Hartge, U. Previews: Gestures at the transition place. In The Contextualization of Language; John Benjamins Publishing Company: Amsterdam, The Netherlands, 1992; pp. 135–157. [Google Scholar]
Argyle, M.; Cook, M.; Cramer, D. Gaze and mutual gaze. Br. J. Psychiatry 1994, 165, 848–850. [Google Scholar] [CrossRef]
Jokinen, K.; Nishida, M.; Yamamoto, S. On eye-gaze and turn-taking. In Proceedings of the 2010 Workshop on Eye Gaze in Intelligent Human Machine Interaction, Hong Kong, China, 7 February 2010; pp. 118–123. [Google Scholar]
Jokinen, K.; Furukawa, H.; Nishida, M.; Yamamoto, S. Gaze and turn-taking behavior in casual conversational interactions. ACM Trans. Interact. Intell. Syst. (TIIS) 2013, 3, 1–30. [Google Scholar] [CrossRef]
Petukhova, V.; Bunt, H. Who’s next? Speaker-selection mechanisms in multiparty dialogue. In Proceedings of the Workshop on the Semantics and Pragmatics of Dialogue, Stockholm, Sweden, 24–26 June 2009; pp. 19–26. [Google Scholar]
Seuren, L.M.; Wherton, J.; Greenhalgh, T.; Shaw, S.E. Whose turn is it anyway? Latency and the organization of turn-taking in video-mediated interaction. J. Pragmat. 2021, 172, 63–78. [Google Scholar] [CrossRef] [PubMed]
Akselrad, D.; DeVeaux, C.; Han, E.; Miller, M.R.; Bailenson, J.N. Body crumple, sound intrusion, and embodiment violation: Toward a framework for miscommunication in VR. In Proceedings of the Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing, Minneapolis, MN, USA, 14–18 October 2023; pp. 122–125. [Google Scholar]
Garau, M.; Slater, M.; Vinayagamoorthy, V.; Brogni, A.; Steed, A.; Sasse, M.A. The impact of avatar realism and eye gaze control on perceived quality of communication in a shared immersive virtual environment. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Ft. Lauderdale, FL, USA, 5–10 April 2003; pp. 529–536. [Google Scholar]
Vertegaal, R.; Van der Veer, G.; Vons, H. Effects of gaze on multiparty mediated communication. In Proceedings of the Graphics Interface, Montréal, QC, Canada, 15–17 May 2000; pp. 95–102. [Google Scholar]
Vertegaal, R.; Weevers, I.; Sohn, C.; Cheung, C. Gaze-2: Conveying eye contact in group video conferencing using eye-controlled camera direction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Ft. Lauderdale, FL, USA, 5–10 April 2003; pp. 521–528. [Google Scholar]
O’Conaill, B.; Whittaker, S.; Wilbur, S. Conversations over video conferences: An evaluation of the spoken aspects of video-mediated communication. Hum.-Comput. Interact. 1993, 8, 389–428. [Google Scholar] [CrossRef]
Whittaker, S. Theories and methods in mediated communication: Steve whittaker. In Handbook of Discourse Processes; Routledge: London, UK, 2003; pp. 246–289. [Google Scholar]
McVeigh-Schultz, J.; Kolesnichenko, A.; Isbister, K. Shaping pro-social interaction in VR: An emerging design framework. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–12. [Google Scholar]
Woolley, A.W.; Chabris, C.F.; Pentland, A.; Hashmi, N.; Malone, T.W. Evidence for a collective intelligence factor in the performance of human groups. Science 2010, 330, 686–688. [Google Scholar] [CrossRef] [PubMed]
Geimer, J.L.; Leach, D.J.; DeSimone, J.A.; Rogelberg, S.G.; Warr, P.B. Meetings at work: Perceived effectiveness and recommended improvements. J. Bus. Res. 2015, 68, 2015–2026. [Google Scholar] [CrossRef]
Kocsis, D.J.; de Vreede, G.J.; Briggs, R.O. Designing and Executing Effective Meetings with Codified Best Facilitation Practices. In The Cambridge Handbook of Meeting Science; Cambridge Handbooks in Psychology; Cambridge University Press: Cambridge, UK, 2015; pp. 483–503. [Google Scholar]
Yoshimura, A.; Borst, C.W. Evaluation and Comparison of Desktop Viewing and Headset Viewing of Remote Lectures in VR with Mozilla Hubs. In Proceedings of the ICAT-EGVE 2020—International Conference on Artificial Reality and Telexistence and Eurographics Symposium on Virtual Environments; Argelaguet, F., McMahan, R., Sugimoto, M., Eds.; The Eurographics Association: Liverpool, UK, 2020. [Google Scholar] [CrossRef]
Hu, E.; Grønbæk, J.E.S.; Houck, A.; Heo, S. Openmic: Utilizing proxemic metaphors for conversational floor transitions in multiparty video meetings. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 23–28 April2023; pp. 1–17. [Google Scholar]
He, Z.; Wang, K.; Feng, B.Y.; Du, R.; Perlin, K. Gazechat: Enhancing virtual conferences with gaze-aware 3d photos. In Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology, Virtual Event, 10–14 October 2021; pp. 769–782. [Google Scholar]
Lou, J.; Wang, Y.; Nduka, C.; Hamedi, M.; Mavridou, I.; Wang, F.Y.; Yu, H. Realistic facial expression reconstruction for VR HMD users. IEEE Trans. Multimed. 2019, 22, 730–743. [Google Scholar] [CrossRef]
Kurzweg, M.; Reinhardt, J.; Nabok, W.; Wolf, K. Using body language of avatars in vr meetings as communication status cue. In Proceedings of Mensch und Computer 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 366–377. [Google Scholar]
Li, J.V.; Kreminski, M.; Fernandes, S.M.; Osborne, A.; McVeigh-Schultz, J.; Isbister, K. Conversation balance: A shared vr visualization to support turn-taking in meetings. In Proceedings of the CHI Conference on Human Factors in Computing Systems Extended Abstracts, New Orleans, LA, USA, 29 April–5 May 2022; pp. 1–4. [Google Scholar]
Ishii, R.; Otsuka, K.; Kumano, S.; Yamato, J. Using respiration to predict who will speak next and when in multiparty meetings. ACM Trans. Interact. Intell. Syst. (TIIS) 2016, 6, 1–20. [Google Scholar] [CrossRef]
Ishii, R.; Otsuka, K.; Kumano, S.; Yamato, J. Prediction of who will be the next speaker and when using gaze behavior in multiparty meetings. ACM Trans. Interact. Intell. Syst. (TIIS) 2016, 6, 1–31. [Google Scholar] [CrossRef]
Mizuno, S.; Hojo, N.; Kobashikawa, S.; Masumura, R. Next-speaker prediction based on non-verbal information in multi-party video conversation. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Ishii, R.; Otsuka, K.; Kumano, S.; Yamato, J. Predicting Who Will Be the Next Speaker and When in Multi-party Meetings. NTT Tech. Rev. 2015, 13, 37–47. [Google Scholar] [CrossRef]
Wang, P.; Han, E.; Queiroz, A.; DeVeaux, C.; Bailenson, J.N. Predicting and Understanding Turn-Taking Behavior in Open-Ended Group Activities in Virtual Reality. arXiv 2024, arXiv:2407.02896. [Google Scholar]
Gu, C.; Chen, J.; Zhang, J.; Yang, T.; Liu, Z.; Konomi, S. Detecting Leadership Opportunities in Group Discussions Using Off-the-Shelf VR Headsets. Sensors 2024, 24, 2534. [Google Scholar] [CrossRef] [PubMed]
Cha, S.H.; Koo, C.; Kim, T.W.; Hong, T. Spatial perception of ceiling height and type variation in immersive virtual environments. Build. Environ. 2019, 163, 106285. [Google Scholar] [CrossRef]
Heydarian, A.; Carneiro, J.P.; Gerber, D.; Becerik-Gerber, B.; Hayes, T.; Wood, W. Immersive virtual environments versus physical built environments: A benchmarking study for building design and user-built environment explorations. Autom. Constr. 2015, 54, 116–126. [Google Scholar] [CrossRef]
Valtchanov, D.; Barton, K.R.; Ellard, C. Restorative Effects of Virtual Nature Settings. Cyberpsychol. Behav. Soc. Netw. 2010, 13, 503–512. [Google Scholar] [CrossRef] [PubMed]
Miller, M.R.; Sonalkar, N.; Mabogunje, A.; Leifer, L.; Bailenson, J. Synchrony within triads using virtual reality. Proc. ACM Hum.-Comput. Interact. 2021, 5, 1–27. [Google Scholar] [CrossRef]
Banakou, D.; Chorianopoulos, K. The effects of avatars’ gender and appearance on social behavior in virtual worlds. J. Virtual Worlds Res. 2010, 2, 3–16. [Google Scholar] [CrossRef]
Wu, S.; Xu, L.; Dai, Z.; Pan, Y. Factors affecting avatar customization behavior in virtual environments. Electronics 2023, 12, 2286. [Google Scholar] [CrossRef]
ThoughtCo. Two Truths and a Lie: How to Play. Available online: https://www.thoughtco.com/2-truths-lie-idea-list-1-31144 (accessed on 17 December 2024).
Zwaagstra, L. Group Dynamics and Initiative Activities with Outdoor Programs. In Back to the Basics: Proceedings of the International Conference on Outdoor Recreation and Education; Association of Outdoor Recreation and Education: Boulder, CO, USA, 1997; p. 12. [Google Scholar]
Yeganehpour, P. The effect of using different kinds of ice-breakers on upperintermediate language learners’ speaking ability. J. Int. Educ. Sci. 2016, 3, 217–238. [Google Scholar]
Koiso, H.; Horiuchi, Y.; Tutiya, S.; Ichikawa, A.; Den, Y. An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs. Lang. Speech 1998, 41, 295–321. [Google Scholar] [CrossRef]
Maynard, S.K. Japanese Conversation: Self-Contextualization Through Structure and Interactional Management; Ablex Publishing Corporation: New York, NY, USA, 1989. [Google Scholar]
Weiß, C. When gaze-selected next speakers do not take the turn. J. Pragmat. 2018, 133, 28–44. [Google Scholar] [CrossRef]
Schegloff, E.A. Accounts of conduct in interaction: Interruption, overlap, and turn-taking. In Handbook of Sociological Theory; Springer: Boston, MA, USA, 2001; pp. 287–321. [Google Scholar]
Hilton, K. The Perception of Overlapping Speech: Effects of Speaker Prosody and Listener Attitudes. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 1260–1264. [Google Scholar]
Cafaro, A.; Glas, N.; Pelachaud, C. The effects of interrupting behavior on interpersonal attitude and engagement in dyadic interactions. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, Singapore, 9–13 May 2016; pp. 911–920. [Google Scholar]
Fisek, M.H.; Berger, J.; Norman, R.Z. Participation in Heterogeneous and Homogeneous Groups: A Theoretical Integration. Am. J. Sociol. 1991, 97, 114–142. [Google Scholar] [CrossRef]
Skvoretz, J. Models of participation in status-differentiated groups. Soc. Psychol. Q. 1988, 51, 43–57. [Google Scholar] [CrossRef]
Gibson, D.R. Marking the turn: Obligation, engagement, and alienation in group discussions. Soc. Psychol. Q. 2010, 73, 132–151. [Google Scholar] [CrossRef]
Hall, E.T. The Hidden Dimension, 1st ed.; Doubleday & Co.: New York, NY, USA, 1966. [Google Scholar]
Latané, B.; Liu, J.H.; Nowak, A.; Bonevento, M.; Zheng, L. Distance matters: Physical space and social impact. Personal. Soc. Psychol. Bull. 1995, 21, 795–805. [Google Scholar] [CrossRef]
Krikorian, D.H.; Lee, J.S.; Chock, T.M.; Harms, C. Isn’t that spatial?: Distance and communication in a 2-D virtual environment. J. Comput.-Mediat. Commun. 2000, 5, JCMC541. [Google Scholar] [CrossRef]
Welsch, R.; Hecht, H.; Chuang, L.; von Castell, C. Interpersonal distance in the SARS-CoV-2 crisis. Hum. Factors 2020, 62, 1095–1101. [Google Scholar] [CrossRef]
Bachour, K.; Kaplan, F.; Dillenbourg, P. An interactive table for supporting participation balance in face-to-face collaborative learning. IEEE Trans. Learn. Technol. 2010, 3, 203–213. [Google Scholar] [CrossRef]
Kim, J.; Truong, K.P.; Charisi, V.; Zaga, C.; Lohse, M.; Heylen, D.; Evers, V. Vocal turn-taking patterns in groups of children performing collaborative tasks: An exploratory study. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, Dresden, Germany, 6–10 September 2015; pp. 1645–1649. [Google Scholar]
Stasser, G.; Taylor, L.A. Speaking turns in face-to-face discussions. J. Personal. Soc. Psychol. 1991, 60, 675–684. [Google Scholar] [CrossRef]
Dabbs, J.M., Jr.; Ruback, R.B. Dimensions of group process: Amount and structure of vocal interaction. In Advances in Experimental Social Psychology; Elsevier: Amsterdam, The Netherlands, 1987; Volume 20, pp. 123–169. [Google Scholar]
Duncan, S.; Fiske, D.W. Face-to-Face Interaction: Research, Methods, and Theory, 1st ed.; Routledge: London, UK, 1977. [Google Scholar] [CrossRef]
Mast, M.S. Dominance as expressed and inferred through speaking time: A meta-analysis. Hum. Commun. Res. 2002, 28, 420–450. [Google Scholar] [CrossRef]
Aran, O.; Gatica-Perez, D. One of a kind: Inferring personality impressions in meetings. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, Sydney, Australia, 9–13 December 2013; pp. 11–18. [Google Scholar]
Lepri, B.; Subramanian, R.; Kalimeri, K.; Staiano, J.; Pianesi, F.; Sebe, N. Employing social gaze and speaking activity for automatic determination of the extraversion trait. In Proceedings of the International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, Beijing, China, 8–10 November 2010; pp. 1–8. [Google Scholar]

Figure 1. Workflow of the VR communication experiment.

Figure 2. Virtual environment used for the experiment (A). Participants engaged in discussions within this environment (B).

Figure 3. Example data of speaking behavior from the dataset, including participant-labeled speaking intention tags (light gray rectangles indicate successful speaking intention tags, dark gray rectangles indicate unsuccessful speaking intention tags; colored rectangles represent the utterances of individual participants).

Figure 4. The relationship between the participant-labeled tags and the categories of turn-taking outcomes.

Figure 5. Example of analysis window selection for tags. dark gray rectangles represent unsuccessful speaking intention tags, while light gray rectangles represent successful speaking intention tags. t denotes 1.5 s. UT indicates utterances. The third one demonstrates the case of window shifting.

Figure 6. Box plots of feature values for each cluster (A) and radar chart of normalized features (B).

Figure 7. The Kaplan–Meier survival curves for each cluster, where the survival state indicates that speaking intentions have not yet been converted into actual speaking behavior.

Table 1. Summary of analyzed interaction dynamics. (IS: individual status, SR: spatial relationships, IP: interaction proactivity, CQ: communication quality).

Dimensions	Features	Description of Features
IS	Negative Logarithm of Speaking Interval	The negative logarithm of the time interval since the participant’s last speech, representing their positional difference in conversational dynamics.
SR	Group Area	The area of the minimum bounding rectangle that encompasses all participants’ positions in the virtual environment during a speaking intention expression.
IP	Speaking Duration Ratio	The proportion of time, within the analysis window, where at least one participant is speaking, indicating overall vocal activity.
	Utterance Count	The total number of utterances within the analysis window, measuring the frequency of speech events.
	Speaking Intention Count	The number of speaking intention tags from other participants within the analysis window, reflecting the level of competition for speaking turns.
CQ	Max Speaker Duration Ratio	The ratio of the longest speaker’s speaking time to the total speaking time, indicating conversational balance or dominance.
CQ	Speaking Overlap Ratio	The proportion of overlapping speaking time to the total speaking time, reflecting the degree of conversational disorder.

Table 2. Descriptive Statistics of Features.

Features	Mean	Std	Min	Median	Max
Neg Log Speaking Interval	−2.563	1.463	−5.710	−2.617	2.900
Group Area	15.258	7.403	0.768	17.067	32.755
Speaking Duration Ratio	0.476	0.364	0.000	0.481	1.000
Utterance Count	1.049	0.841	0.000	1.000	5.000
Speaking Intention Count	0.169	0.396	0.000	0.000	2.000
Max Speaker Duration Ratio	0.729	0.416	0.000	1.000	1.000
Speaking Overlap Ratio	0.043	0.138	0.000	0.000	0.889

Table 3. Results of the logistic regression analysis and variance inflation factor (VIF).

Dimensions	Features	coef ( $β$ )	std err	z	p-Value	VIF
IS	Neg Log Speaking Interval	0.2710 *	0.110	2.470	0.014	1.076
SR	Group Area	−0.0852	0.108	−0.790	0.429	1.020
IP	Speaking Duration Ratio	−0.9142 **	0.160	−5.721	0.000	2.250
	Utterance Count	−0.5211 **	0.191	−2.733	0.006	2.992
	Speaking Intention Count	−0.5344 **	0.117	−4.577	0.000	1.095
CQ	Max Speaker Duration Ratio	0.3258	0.167	1.948	0.051	2.078
CQ	Speaking Overlap Ratio	0.3875 **	0.148	2.614	0.009	1.711
	Constant	0.4707 **	0.108	4.364	0.000	1.000

Note: *

p < 0.05

, **

p < 0.01

. IS: Individual Status, SR: Spatial Relationships, IP: Interaction Proactivity, CQ: Communication Quality.

Table 4. Speaking intention outcomes across clusters (conflicting cases and unsuccessful cases combine to form obstructed cases).

Cluster	Instances (Count)	Conflicting (Ratio)	Unsuccessful (Ratio)	Obstructed (Ratio)
Prolonged Single-turn Scenario (C1)	154	29.9%	18.8%	48.7%
High-status Role Scenario (C2)	96	27.1%	8.3%	35.4%
Intense Interaction Scenario (C3)	36	38.9%	13.9%	52.8%
Low-activity Scenario (C4)	136	5.2%	11.8%	17.0%
High-competition Scenario (C5)	69	40.6%	27.5%	68.1%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Gu, C.; Zhang, J.; Liu, Z.; Ma, B.; Konomi, S. Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware Strategies. Appl. Sci. 2024, 14, 12071. https://doi.org/10.3390/app142412071

AMA Style

Chen J, Gu C, Zhang J, Liu Z, Ma B, Konomi S. Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware Strategies. Applied Sciences. 2024; 14(24):12071. https://doi.org/10.3390/app142412071

Chicago/Turabian Style

Chen, Jiadong, Chenghao Gu, Jiayi Zhang, Zhankun Liu, Boxuan Ma, and Shin‘ichi Konomi. 2024. "Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware Strategies" Applied Sciences 14, no. 24: 12071. https://doi.org/10.3390/app142412071

APA Style

Chen, J., Gu, C., Zhang, J., Liu, Z., Ma, B., & Konomi, S. (2024). Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware Strategies. Applied Sciences, 14(24), 12071. https://doi.org/10.3390/app142412071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Coordination of Speaking Opportunities in Virtual Reality: Analyzing Interaction Dynamics and Context-Aware Strategies

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Dataset

3.2. Turn-Taking Failures

3.3. Interaction Dynamic Features

4. Results

4.1. Regression Analysis

4.2. Cluster Analysis

4.3. Survival Analysis

5. Discussion

5.1. Key Features and Predictions of Turn-Taking Coordination Failures

5.2. Impacts of Different Interaction Scenarios on Turn-Taking and System Support Strategies

5.3. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI