An Efficient Gaze Control System for Kiosk-Based Embodied Conversational Agents in Multi-Party Conversations

Jung, Sunghun; Kum, Junyeong; Lee, Myungho

doi:10.3390/electronics14081592

Open AccessArticle

An Efficient Gaze Control System for Kiosk-Based Embodied Conversational Agents in Multi-Party Conversations

by

Sunghun Jung

^1,†

,

Junyeong Kum

^1,†

and

Myungho Lee

^2,*

¹

Department of Information Convergence Engineering, Pusan National University, Busan 46241, Republic of Korea

²

School of Computer Science and Engineering, Pusan National University, Busan 46241, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(8), 1592; https://doi.org/10.3390/electronics14081592

Submission received: 5 February 2025 / Revised: 11 April 2025 / Accepted: 13 April 2025 / Published: 15 April 2025

(This article belongs to the Special Issue AI Synergy: Vision, Language, and Modality)

Download

Browse Figures

Versions Notes

Abstract

:

The adoption of kiosks in public spaces is steadily increasing, with a trend toward providing more natural user experiences through embodied conversational agents (ECAs). To achieve human-like interactions, ECAs should be able to appropriately gaze at the speaker. However, kiosks in public spaces often face challenges, such as ambient noise and overlapping speech from multiple people, making it difficult to accurately identify the speaker and direct the ECA’s gaze accordingly. In this paper, we propose a lightweight gaze control system that is designed to operate effectively within the resource constraints of kiosks and the noisy conditions common in public spaces. We first developed a speaker detection model that identifies the active speaker in challenging noise conditions using only a single camera and microphone. The proposed model achieved a 91.6% mean Average Precision (mAP) in active speaker detection and a 0.6% improvement over the state-of-the-art lightweight model (Light ASD) (as evaluated on the noise-augmented AVA-Speaker Detection dataset), while maintaining real-time performance. Building on this, we developed a gaze control system for ECAs that detects the dominant speaker in a group and directs the ECA’s gaze toward them using an algorithm inspired by real human turn-taking behavior. To evaluate the system’s performance, we conducted a user study with 30 participants, comparing the system to a baseline condition (i.e., a fixed forward gaze) and a human-controlled gaze. The results showed statistically significant improvements in social/co-presence and gaze naturalness compared to the baseline, with no significant difference between the system and human-controlled gazes. This suggests that our system achieves a level of social presence and gaze naturalness comparable to a human-controlled gaze. The participants’ feedback, which indicated no clear distinction between human- and model-controlled conditions, further supports the effectiveness of our approach.

Keywords:

active speaker detection; deep learning; embodied conversational agents; gaze behavior; multi-modal communication; multi-party conversations; non-verbal behavior; speaker recognition

1. Introduction

The service industry is increasingly adopting devices, such as kiosks and robots, to reduce labor costs. While these automated solutions aim to efficiently handle tasks, such as ordering, payment, and customer service, kiosks can sometimes inconvenience users due to complex and inconsistently designed interfaces across different companies [1]. This complexity can be a significant barrier, especially for seniors or those unfamiliar with digital devices. To improve user convenience, intelligent kiosks that integrate voice recognition and natural language processing have been developed. These kiosks are frequently equipped with embodied conversational agents (ECAs), which are designed to engage in human-like conversations using AI technology and digital human interfaces.

The use of ECAs enables interactions that incorporate both verbal and non-verbal cues, thanks to their embodiment. Non-verbal cues, such as facial expressions, gestures, prosody, and eye gaze, make users perceive ECAs as more natural and human-like. For instance, synchronizing various facial expressions with speech enhances ECAs’ realism, significantly influencing users’ preferences and immersion [2]. Non-verbal behavior accompanied by utterances also helps users perceive ECAs’ speech more effectively and meaningfully, providing a better conversational experience [3]. It has also been reported that the gestures of ECAs positively influence perceptions of human likeness, liveliness, and intelligence, effectively capturing the user’s attention [4]. Therefore, by utilizing non-verbal cues, ECAs can provide more human-like interactions with users, thereby playing a crucial role in enhancing immersion and satisfaction.

Among these non-verbal cues, eye gaze is particularly important in multi-party conversations. It helps identify the main speaker and regulates the flow of dialogue by facilitating interruptions and assisting in preventing or repairing disruptions [5,6].

Even with ECAs presented on 2D displays, where the Mona Lisa effect makes it challenging for users to perceive accurate eye gaze [7], directing the gaze toward the speaker showed a positive influence on users [8]. This study highlights that speaker-directed gaze enhances perceived social presence and credibility during interactions, whereas misaligned gaze can diminish the quality of user engagement. This result emphasizes the need for methodologies that ensure precise gaze alignment, thereby fostering more natural and effective multi-party conversations.

Despite the importance of eye gaze behavior in ECAs, integrating it into kiosk-based systems is challenging. Kiosks are typically placed in public spaces, like outdoor attractions or shopping centers, where ambient noise and overlapping conversations from multiple users create difficulties in identifying the dominant speaker and recognizing speech.

While various methods have been proposed to recognize active speakers in multi-party and/or noisy settings (cf. Related Work section), most rely on specialized hardware, such as directional or array microphones, limiting their generalizability. For example, algorithm-based methods are highly dependent on specific equipment settings, and model-based methods require training specific to the hardware used.

One approach to avoid retraining and minimize device dependency is to leverage deep learning models that are pre-trained on large-scale datasets. For instance, the AVA-Speaker Detection dataset—constructed from various Hollywood movies—includes videos captured at diverse camera resolutions and audio recordings from different environments [9]. Many audio–visual models trained on this dataset have demonstrated strong performance [10,11,12,13]. These methods leverage the multi-modal data from video by encoding visual features from sequences of facial images alongside audio signals. The fused features are then used to determine whether an individual is speaking within a given frame. Typically, such methods employ multi-stage frameworks to enhance detection performance. However, this dataset does not fully represent the range of noise encountered in real-world scenarios; consequently, models trained on the AVA-Speaker Detection dataset tend to experience degraded performance under actual noisy conditions [14].

In addition, kiosk systems deployed in public spaces typically have limited resources, often featuring only a single camera and microphone. These constraints preclude advanced microphone arrays that capture directional audio signals in noisy environments, making methods dependent on multi-channel audio less practical. Therefore, it is essential to develop lightweight, noise-robust models capable of accurate detection using only single-camera and single-microphone setups, even in environments with significant ambient noise. This capability is particularly important for kiosk-based ECAs, which must exhibit natural gaze behavior and fluid interactions during multi-party conversations.

In this paper, we propose an efficient gaze control system for kiosk-based ECAs. Our system integrates a real-time speaker detection model with an algorithm for dominant speaker identification, enabling ECAs to establish appropriate eye contact, even in noisy public environments, using only a single camera and microphone. Our model achieves 91.6% accuracy—a 0.6% improvement over the state-of-the-art lightweight model (Light ASD), as evaluated on the AVA-Speaker Detection dataset in noisy conditions—while maintaining real-time performance. We further validate our system through a user study comparing fixed-forward, human-controlled, and model-controlled gaze conditions. The findings indicate that our model-controlled gaze closely replicates human-controlled behavior, enhancing social/co-presence and interaction naturalness. This work advances the development of socially aware ECAs suitable for resource-limited kiosk environments and contributes to improving ECA interactions in noisy public spaces.

2. Related Work

2.1. Gaze Behavior of Social Agents

ECAs can convey various non-verbal behaviors, such as prosody, facial expressions, and gestures, each of which play a distinct role in conversation [15]. For instance, eye gaze aids in gathering information, signaling the focus on a subject, and expressing emotions toward the subject [16,17]. According to Ding et al. [5], participants can identify a speaker in a multi-party conversational environment using eye gaze alone, even in the absence of verbal information. Kendon [6] also demonstrated that the direction of eye gaze helps manage conversations. Furthermore, Kendon found that eye gaze plays a crucial role in preventing and repairing conversational breakdown, thereby confirming its function in regulating the flow of dialogue. In interpersonal communication, eye gaze often manifests as eye contact. Mason et al. [18] showed that direct eye contact conveys interest and respect toward a conversational partner, whereas avoiding eye contact may indicate discomfort or a lack of interest. Thus, eye gaze is a vital non-verbal element with multiple roles in the conversational process.

Similarly, eye contact has been shown to facilitate social interaction with robots. Shimada et al. [19] conducted experiments using robots to explore the effects of eye contact on users. Their experiments confirmed that even if a robot makes eye contact with only a few people or merely appears to do so, it can still create a positive impression on those it gazes at. Likewise, controlling the robot to gaze at a person’s face can increase the amount of eye contact between humans and robots [20,21]. In a study by Kiilavuori et al. [22], gaze exchange in robots was found to trigger automatic emotional and attentional responses similar to those elicited by eye contact with humans, highlighting the importance of eye contact in creating an enjoyable experience during human–robot social interactions.

These findings have been similarly reflected in ECAs. Bee et al. [23] investigated how ECAs use eye contact to initiate conversations. In their study, ECAs established eye contact by aligning the agent’s gaze with the participant’s eye direction during one-on-one interactions. The results indicated that participants felt more comfortable and perceived the agent as more attentive when eye contact was made. The researchers employed subjective measures to assess factors, such as engagement, exclusion of external distractions, and perceived quality of gaze behavior, alongside objective measures analyzing participants’ own eye gaze behavior. Both subjective and objective findings confirmed that users were more inclined to continue interacting with agents that established eye contact. Kontogiorgos et al. [24] conducted a study comparing a smart speaker with two variations of ECA to determine whether subjects’ visual attention is similar across different forms of embodiment and social eye contact. The results indicated that ECA eye contact during conversation increased social facilitation and was highly rated as a natural means of communication. Their study also confirmed that performing social behaviors, beyond mere embodiment, plays a crucial role in fostering a sense of familiarity.

Recent research has highlighted the importance of eye contact in conversational interactions. Choi et al. [25] conducted an experiment using an ECA in a counseling context and found that non-verbal cues, particularly eye contact from the counselor, increased participants’ empathy, reduced their anger intensity, and improved the overall effectiveness of the counseling interaction. Similarly, Kum et al. [8] demonstrated that a 2D-display ECA employing eye contact in a multi-party conversational setting—via a Wizard of Oz method, where a human operator controlled the agent’s gaze—significantly enhanced users’ perceptions of the agent’s interpersonal skills, competence, and co-presence, despite limitations in precisely detecting eye contact. These findings suggest that ECAs deployed in public spaces, such as kiosks, should consistently maintain eye contact with users. To achieve this effectively in multi-party settings, it is crucial for ECAs to accurately identify the active speaker in the group.

2.2. Speaker Detection

Active speaker detection is the task of identifying the person currently speaking in a group, and it has been actively researched. In previous studies [26,27,28,29,30], multiple microphones have been utilized to determine the origin of sound. Subsequently, by tracking users, these systems identified the user at that location as the speaker. Other studies [31,32,33] have employed deep learning or reinforcement learning to estimate the speaker’s location using multiple microphones. More recent research [34,35,36] has focused on accurately tracking speakers in multi-user environments using models that integrate audio and video data simultaneously. However, methods relying on multiple microphone arrays to obtain accurate 3D speaker positions typically result in device-specific dependencies, limiting their general applicability.

For broader generalization, it is important that active speaker detection methods operate effectively across various devices. Therefore, instead of depending on specialized microphone setups, it is crucial to develop approaches that synchronize audio signals with visual cues, such as lip movements, using only a single microphone input. The AVA-Speaker Detection dataset [9] was specifically annotated to indicate active speakers in video. This large-scale dataset features diverse lighting conditions, image qualities, and audio quality levels, and it has facilitated the development of various models. One such approach is the lightweight end-to-end active speaker detection (Light ASD) model, which achieves real-time operation and a mean Average Precision (mAP) of 94.2% [12].

However, since the AVA-Speaker Detection dataset comprises scenes from Hollywood movies, it lacks recordings from environments containing various types of background noise [14]. To address this limitation, we aimed to develop a speaker detection system robust enough to function effectively in noisy, multi-party conversational environments. We trained our model by incorporating noisy conditions from the Living Environment Noise dataset (this research utilized datasets from the Open AI Dataset Project, AI-Hub, South Korea, and all of the data are available at www.aihub.or.kr), ensuring robust performance against noise interference. Additionally, we propose a system designed to facilitate seamless gaze processing during multi-party conversations.

3. Method

3.1. Real-Time Speaker Detection Model

This section presents a speaker detection model optimized for noisy environments. This model enhances the existing Light ASD model and is designed to perform robustly in the presence of noise [12]. The structure of the model is illustrated in Figure 1. It consists of three main components: an audio encoder that extracts features from audio data, a visual encoder that extracts features from the input sequence of face images, and a detector that combines these two feature vectors to determine whether the person is speaking.

Figure 2 shows the Time-Dilated Video (TDV) and Time-Dilated Audio (TDA) block structures. In traditional speaker detection models, 3D convolution has been used to construct a visual feature encoder [37,38]. This approach also extracts temporal features by utilizing continuous images of the user’s face. However, 3D convolution increases the parameters and computational load of the model. To address this issue, various studies have demonstrated the effectiveness of decomposing operations into 2D and 1D convolutions, thereby reducing the computational burden associated with 3D convolutions [39,40]. This dimension-splitting approach has proven effective in extracting both spatial and temporal features, facilitating the development of Light ASD [12].

The proposed TDV and TDA blocks follow this approach by dividing the convolution operation into operations on individual frames and along the time axis. Specifically, the method applies dilated convolution on the time axis, allowing the model to observe a wider receptive field over time. Dilated convolution differs from traditional convolution by varying the size of the receptive field based on the dilation rate used during the operation [41]. This type of convolution is commonly employed in models that analyze extensive time-series data, such as audio-based models [42,43]. Unlike the original Light ASD model, which only used a conventional convolution layer in 1D convolutions, the TDV and TDA blocks employ dilated convolution along the time axis. Figure 2 shows that these blocks allow a model with fewer layers to process a broader range of time-series data. Each block consists of three layers. The first layer extracts spatial features from individual frames. The second layer extracts temporal features from frames using dilated convolution to process a broader range of time-series data. In the last layer, the features extracted with kernel sizes of 3 and 5 are combined into features with a kernel size of 1 to integrate the information.

The visual encoder shown in Figure 3 receives gray-scale face images of the speaker at a resolution of 112 × 112 pixels. Three TDV blocks were used, with the dilation rate (D) set to 1 in the first block and 2 in the subsequent blocks. Each convolution operation was followed by batch normalization and ReLU activation. For computational efficiency, the stride was set to 2 in the first TDV block and 1 in the following blocks. Additionally, max pooling was applied twice, and, in the final stage, the visual features were averaged using global average pooling to obtain the video feature vector.

The structure of the audio encoder is shown in Figure 3. The audio input is processed into a Mel-Frequency Cepstral Coefficient (MFCC), which is a feature commonly used in speaker recognition models [10,12,37]. The MFCC feature map is a time series consisting of 13 coefficients. The input length of the audio data was four times that of the video frames. The audio encoder employed four TDA blocks to ensure that the receptive field along the time axis in the audio matched that of the video. The dilation rate was set to 1 in the first two TDA blocks and 2 in the following two blocks. Max pooling was applied twice along the temporal axis to match the video frames. The audio features were averaged in the final stage using global average pooling to obtain the audio feature vector. In addition, we incorporated a bidirectional GRU (BiGRU) block into the audio encoder model to further enhance the temporal context in both forward and backward directions, complementing the original Light ASD audio encoder.

Unlike Light ASD, which directly adds audio and video features, our approach concatenates these features and feeds them into a fully connected layer to produce an audio–visual (AV) feature, as shown in Figure 3. This AV feature is then passed through a BiGRU, enabling the model to capture bidirectional temporal and contextual information from the data. Finally, a fully connected layer with 128 features determines whether a person is speaking in each frame.

3.2. Gaze Control System for Multi-Party Conversations

Based on the proposed speaker detection model, we implemented a gaze control system for ECAs engaged in multi-party conversations. The overall architecture is shown in Figure 4. We used the PyTorch framework with Python 3.9, and all of the experiments reported in this paper were conducted on a Windows 11 PC equipped with an NVIDIA GeForce RTX 4090 GPU. Images were captured at 25 fps using an Intel RealSense D455 camera, and the audio was recorded with a USB mono microphone. Both audio and video were preprocessed before being fed into the speaker detection model.

In the image processing step of the gaze control system, bounding boxes around the users’ faces were first detected and saved for each frame using YOLOv8 (https://github.com/ultralytics/ultralytics, accessed on 5 February 2025). When a new user appeared, a unique ID was assigned; if a user was not detected for more than ten consecutive frames, the ID was removed to improve computational efficiency. Additionally, the sizes of the bounding boxes detected by YOLOv8 varied, so post-processing was required to ensure consistency.

{\bar{x}}_{t} = α {\bar{x}}_{t - 1} + (1 - α) x_{t}, 0 \leq α \leq 1 .

(1)

Noise reduction was performed using a low-pass filter, as shown in Equation (1), where

{\bar{x}}_{t}

denotes the filtered bounding box position at frame time t, and

x_{t}

represents the raw bounding box position at the same frame time. The parameter

α

controls the extent to which the previously filtered value influences the current one.

Alongside YOLOv8-based face detection, additional video processing was performed. For each unique user ID, the corresponding face region was cropped using the stored post-processed bounding box and resized to meet the input requirements of the speaker detection model. For audio processing, the segment corresponding to the input audio duration was converted into MFCCs using the python-speech-features library (https://python-speech-features.readthedocs.io/en/latest/, accessed on 5 February 2025). To improve performance, the input data length was set to approximately 2 s.

Based on the model’s output, a post-processing step was performed to determine whether the identified speaker was the dominant speaker in the conversation. Specifically, the post-processed data was fed into the model, which evaluated each frame—even in the presence of noise or overlapping speech—to determine whether the user was speaking. For each frame, the model output a binary value, thereby enabling reliable identification of the active speaker: 1 if the user was speaking, and 0 otherwise. We implemented this post-processing method based on the findings of a previous study [44] that examined the relationship between the participants’ respiration and subsequent turn-taking behavior in a four-person conversation. That study found that the speakers exhibited statistically significant differences in breathing patterns, with a correlation coefficient of 0.43 associated with maintaining their speaking turn. The median breathing duration was approximately 0.6 s when maintaining the turn and about 1.2 s when transitioning to another speaker.

If any of the outputs within a 15-frame window (approximately 0.6 s) indicated that a user was speaking, the conversational turn was considered to have been maintained during that period. Subsequently, these 15 stored values were used to determine whether the speaker’s turn continued. Because user utterances could overlap, the overall score was calculated by assigning greater weight to more recent values. This score was computed using the same low-pass filter formula described in Equation (1), and it was applied to 15 values with

α

set to 0.7.

Through this process, we constructed an algorithm to determine the dominant speaker and the target of the ECA’s gaze in a multi-party conversational environment. Specifically, a score was computed for each user to represent their speaking activity in each frame, and the user with the highest score was identified as the dominant speaker. Additionally, if the current and new speakers received equal scores, the current speaker was retained as the dominant speaker.

4. Experiment 1

In Experiment 1, we compared our enhanced model’s learning results with those of the base mode.

4.1. Dataset

We used two datasets: the AVA-Speaker Detection dataset [9] and the Living Environment Noise dataset. The AVA-Speaker Detection dataset is a large-scale benchmark for detecting speakers. It consists of 262 Hollywood movie clips. Of these, 120 were used for training and 33 were used for validation. Because test data were not provided, the performance was compared using validation data, similar to previous studies [12,45,46]. The dataset labels were classified as SPEAKING AUDIBLE, SPEAKING NOT AUDIBLE, and NOT SPEAKING. In this study, only the SPEAKING AUDIBLE label was considered as the speaking state for training purposes. While the dataset includes various situations, such as overlapping speech, low resolution, diverse lighting conditions, and low-quality audio, it does not encompass a variety of audio situations in real-life noisy environments [14]. Therefore, relying solely on this dataset poses limitations when addressing noisy environments such as public places.

To overcome this issue, we used the Living Environment Noise dataset. This dataset contains a various noises that occur in daily life, such as car driving sounds, ambient noises around school districts, construction site noises, and dog barking. It comprises four categories: inter-floor noise, industrial noise, construction noise, and traffic noise. There are 115,191 noise samples in total, which are divided into training (92,159 samples), validation (11,529 samples), and test (11,503 samples) sets. The training and validation data from this dataset were used for training.

We used existing research methods of data augmentation to simulate noisy environments in speaker detection models [47,48,49]. We applied speech synthesis by mixing the speech data from the AVA-Speaker Detection dataset with noise samples from the Living Environment Noise dataset within a signal-to-noise ratio (SNR) range of −10 to 10 dB. This approach helps to simulate real-world noisy environments and improves the robustness of the model to noise.

4.2. Implementation Details

To compare the structural performance with the existing models, we adopted the loss function used in the Light ASD model. Training was conducted in an NVIDIA GeForce RTX 4090 environment. The model was trained for 30 epochs using the Adam optimizer [50]. Additionally, the learning rate started at 0.001 and decreased by 0.05 for each training session. Following the standard protocol of the AVA-ActiveSpeaker dataset, we adopted mAP as the evaluation metric [9,10,11,12,13]. The human face was represented as a 112 × 112 grayscale image for the data input, and the audio was converted to MFCC features.

4.3. Result and Discussion

Table 1 summarizes the performance evaluation results for the Light ASD model and the improved model from this study. Light ASD (without retraining) refers to the model trained without noise conditions using the AVA-Speaker Detection dataset, which originally showed a 94.1% mAP on the noise-not-argument validation set. Light ASD (with retraining) is the model that was retrained with the addition of the Environmental Noise dataset, and Ours is the model proposed in this research. Compared to the original model’s performance, Light ASD (with retraining) showed a performance of 90.2%, which is a decrease of 3.9%. However, after retraining, the performance improved by 0.6% to 91.0%. This indicates that the performance of the model increased during training when the environmental noise was included. The proposed model further improved the performance to 91.6%, which is an increase of 0.6% over the retrained Light ASD model. This confirms that the proposed model demonstrated higher performance in everyday noisy environments than the existing model.

For an accurate comparison, we analyzed performance across different SNR levels. Noise was applied to all evaluation data at levels of −10, −5, 0, 5, and 10 dB. As shown in Table 2, the Light ASD (with retraining) model improved its performance over the Light ASD (without retraining) model at all SNR levels except for 10 dB. In addition, our proposed model outperformed both the non-retrained and retrained versions across all SNR levels, including 10 dB. Notably, at −10 dB, retraining yielded a 2.4% improvement over the Light ASD (without retraining) model, and our proposed method further boosted performance by an additional 0.6% compared to the Light ASD (with retraining) model. Similarly, at −5 dB, the Light ASD (with retraining) model showed a 1.8% improvement, with our method providing an additional 0.6% gain over it.

To evaluate the capability of the visual encoder alone in determining speech, we conducted an experiment in which both the SPEAKING AUDIBLE and SPEAKING NOT AUDIBLE labels were set to True, and the NOT SPEAKING labels was set to False. This allowed the model to predict the speaking activity using only visual information. As shown in Table 3, it was confirmed that our proposed visual encoder achieved an mAP of 81.12%, which is an improvement of 2.87% over the existing Light ASD visual encoder.

Table 4 presents the performance evaluation of the receptive field changes through the TDA and TDV blocks applied to the model. WOD uses the visual and audio blocks that do not incorporate any dilation blocks. Both AD and VD are derived from the WOD model: AD integrates the TDA block, while VD incorporates the TDV block. Lastly, AD + VD is for applying TDA and TDV blocks. From the experimental results, we observed that VD exhibited the lowest performance at 91.3%. Moreover, when compared to the video encoder performance shown in Table 3, extending the receptive field solely along the time axis for video resulted in a performance decline relative to the case where it was not applied. In the cases of AD, performance improved by about 0.1% compared to woD. When both AD and VD were applied (AD + VD), there was an improvement of 0.2%. The experimental results confirm that expanding the receptive field along the time axis for both audio and video can improve performance in noisy environments.

In Table 5, we present an experiment to determine how adding a BiGRU layer to audio and visual encoders affects model performance in a noisy environment. In woBiGRU, BiGRU was not added to any encoder in the proposed model. In ABiGRU, BiGRU was added to the audio encoder, and, in VBiGRU, BiGRU was added to the visual encoder in the woBiGRU model. Lastly, ABiGRU + VBiGRU were added to both encoders in the woBiGRU model, where woBiGRU showed a performance of 91.1%. In the case of ABiGRU, we confirmed that the performance improved by 0.5% to 91.6%. Additionally, we observed that the performance decreased in the cases of VBiGRU and ABiGRU + VBiGRU, where BiGRU was added to the visual encoder. Accordingly, we only added the BiGRU to the audio encoder in the proposed model.

Table 6 compares the performance of the different multimodal combination methods used in the detector, where woConcat refers to the existing Light ASD method in our model, which simply combines the inputs of the visual and audio features, each consisting of 128-dimensional values; and Concat represents the method proposed in this paper, which involves a different approach to combining the visual feature and audio feature using a fully connected layer. Using the proposed method, we confirmed a 0.4% performance improvement compared with the existing method.

As additional layers were included to improve model performance, the parameter size of the existing model, Light ASD, increased from 1.02 million to 1.27 million parameters in our proposed model. Table 7 compares the real-time operational performances. Table 7 shows the average inference time measured on an NVIDIA GeForce RTX 4090 environment under the same conditions, and it was tested 1000 times to compare the model performance. Overall, the inference time of the proposed model increased slightly. For approximately 40 s of data (1000 frames), the inference time compared to Light ASD increased by approximately 1.41 ms. For a single frame, the increase was approximately 0.97 ms. However, interestingly, for 20 s of data (500 frames), our model was faster by approximately 0.34 ms compared to Light ASD. Although the inference time increased slightly owing to the larger model size, it was confirmed that real-time operations remained unaffected.

We addressed the weaknesses of the Light ASD model in noise handling. The proposed model achieved a performance of 91.6%, which is an improvement of approximately 0.6% over the existing model’s performance of 91.0%. This enhancement allows for the accurate tracking of whether a user is speaking, even in public places or outdoor settings where noise is present. Using the model, we evaluated its effectiveness by additionally designing software that allows an ECA to detect speakers and make eye contact with them.

5. Experiment 2

In Experiment 2, a multi-party conversation gaze control system was applied to ECAs to evaluate whether natural eye contact could be achieved with the main speaker during a multi-party conversation. We conducted a user study to evaluate our gaze control system. This study was conducted with the approval of the Institutional Review Board (IRB) at the authors’ institution, and informed consent was obtained from all participants prior to the experiment.

5.1. Method

The experiment used a within-subject design, where participants experienced three conditions:

None: The ECA gazed straight ahead without focusing on a specific object, reflecting the typical eye gaze behavior of ECAs used in kiosks.
Human: A human operator control the ECA’s gaze, making it look at the target of attention during the conversation.
System: The gaze control system controlled the ECA’s gaze.

The participants experienced all conditions and the order was randomized with the Latin square method. A single experimenter controlled the ECA’s gaze for consistency in the Human condition. The human-controlled gaze was designed to be as natural as possible. A graphical user interface (GUI) was developed to facilitate gaze control for a human operator, as shown in Figure 5. The GUI included buttons for setting the gaze conditions: None, System, and Human. The Center button reset the ECA’s gaze in the human-controlled condition. To ensure consistent experience, the Wizard of Oz paradigm was used, allowing the experimenter to control the content by pressing appropriate buttons to avoid introducing unintended conversation topics. Additional buttons—such as HI for greeting; Question: n to prompt a new question (with n being the question number); RE to repeat a question; and BYE to end a conversation—were included to manage the ECA’s interactions.

The participants’ 3D locations were essential for enabling the gaze interactions with ECAs. To achieve this, images from an Intel RealSense D455 camera were processed using the YOLOv8 model to detect the participants’ faces. The depth information from the camera was then used to calculate the actual 3D spatial position using Pyrealsense2 (https://dev.intelrealsense.com/docs/python2, accessed on 5 February 2025). The experimenter directed the ECA’s gaze by clicking on a red box around the user’s face in the captured image (see Figure 5). In addition, bounding box colors were visualized: green for a speaking person, and red for a non-speaking person.

5.2. Materials

Figure 6 illustrates the experimental setup. The experiment was conducted in an open, uncontrolled office space. As shown in Figure 6a, a desk was placed in front of a 65-inch TV to create the illusion that the desk was in a virtual environment that seamlessly extended from the real desk. This arrangement was designed to provide an immersive experience for the user, merging the real and virtual spaces. The virtual space featuring the ECAs was developed using the Unity game engine (https://unity.com/, accessed on 5 February 2025). Since the experiment was conducted in an office setting, the ECA appeared within a virtual background that simulated an office environment. Two participants were positioned at a distance where their gaze could be recognized on the 2D display, allowing for precise monitoring of their gaze [8]. Participants were also seated in a way that prevented them from seeing the experimenter controlling the ECAs.

Figure 7a shows the virtual environment used in the experiment. Each ECA had unique hairstyles, clothing, and skin tones, as illustrated in Figure 7b. The three ECAs were modeled to resemble ordinary women, each based on photos of different individuals and were uniquely crafted using Character Creator 3 (https://www.reallusion.com/character-creator/, accessed on 5 February 2025). Distinct voices for each ECA were generated using the Naver Cloud Platform’s CLOVA Voice API (https://developers.naver.com/products/clova/tts/, accessed on 5 February 2025), giving each character a unique voice. Lip movements for dialogue were synchronized using the SALSA LipSync Suite (https://crazyminnowstudio.com/unity-3d/lip-sync-salsa/, accessed on 5 February 2025), while gaze and body coordination were managed by Final IK (http://root-motion.com/, accessed on 5 February 2025), enabling the ECAs to engage in natural conversation and establish eye contact with the participants.

5.3. Procedure

Two participants participated in each session: one was the main participant and the other was a dummy participant. To maintain consistency, the participants were paired with a dummy of the opposite gender. The ECA was chosen randomly from a set of three for each condition.

The ECA questions were structured as shown in Table 8. The experimental scenario involved conversation among three entities: one ECA, one dummy, and one participant. Both the participant and the dummy responded to these questions. The questions covered topics of daily life. The dummy provided standardized responses in all experiments and answered first whenever the main participant hesitated, ensuring a smooth and natural flow of conversation. Furthermore, the conversation order was not predetermined, and anyone was free to initiate responses, allowing overlapping dialogue to occur naturally, as in real-life interactions. For consistency in the experiment, the dummy conducted conversations with fixed answers to all the questions.

Before starting the experiment, participants completed a preliminary questionnaire. The experiment proceeded as follows: upon arrival, participants were asked to complete a preliminary questionnaire on a tablet. Both the participant and dummy then sat in the designated chairs, as shown in Figure 7. The seating arrangements were fixed for all the sessions. The participants were instructed to respond freely to the ECA’s questions, providing at least three sentences in their answers. Each condition included a greeting, five questions, and a closing statement from the ECA. Following each condition, the participants completed a questionnaire. After all three conditions, the participants were interviewed to discuss any differences they noticed among the conditions. All sessions were recorded on video, with the participants notified in advance.

5.4. Measurement

The pre-questionnaire included demographic questions and assessed the participants’ attitudes toward robots using the Negative Attitudes toward Robots Scale (NARS) [51]. Also, the questionnaire included a 7-point Likert scale ranging from 1 (never) to 7 (daily) to gauge the participants’ previous interactions with ECAs or robots. NARS responses were measured using a 5-point Likert scale, and the results were analyzed by averaging the six items.

For subjective measures, the post questionnaire was conducted three times (once for each condition), and it focused on social presence, co-presence, and gaze naturalness.

Social Presence: We utilized the questionnaire developed by Bailenson et al. [52], which includes five items designed to measure participants’ perceptions of the extent to which they perceive the ECA as a social being.
Co-presence: We utilized the questionnaire developed by Harms and Bicca [53], which consists of six items designed to measure the participants’ sense of not being alone and secluded, their peripheral and focal awareness of others, and others’ awareness of them.
Gaze Naturalness: Participants evaluated whether the ECA’s gaze felt natural. This was measured using a single item on a 5-point Likert scale.

For objective measures, we recorded a video of the two participants seated side by side, captured their conversation on audio, and also recorded the display screen showing the ECA. Using the videos, we analyzed the participants’ speaking time and calculated the proportion of frames in which they visually focused on the ECA relative to the total frames.

5.5. Participants

Experiment 2 was approved for human research. The participants were recruited through a post on an online community. A total of 30 participants participated in this study, including 18 males and 12 females. The average age was 23.76 years (SD = 2.69). All of the participants voluntarily participated in this study. Most participants had little or no experience interacting with ECAs or robots (M = 1.75, SD = 0.97). They were recruited from various academic departments.

5.6. Result and Discussion

This section presents the methods and results of this analysis. Statistical evaluations were conducted using IBM SPSS version 27, with a significance level set at 5%. Each measure was analyzed using the Friedman test [54], which was followed by Wilcoxon signed-rank tests for pairwise comparisons, with p-values adjusted using Bonferroni correction. The measurements were averaged, and Cronbach’s

α

values for each construct are reported in Table 9. The results are shown in Figure 8.

Social Presence: There were significant differences between the None and Human (p < 0.01), as well as between the None and System (p < 0.05), conditions. There was no significant difference between the Human and System (see Figure 8a) conditions.
Co-presence: There were significant differences between the None and Human (p < 0.01), as well as between the None and System (p < 0.01), conditions. There was no significant difference between the Human and System conditions (see Figure 8b).
Gaze Naturalness: There were significant differences between the None and Human (p < 0.001), as well as between the None and System (p < 0.001), conditions. There was no significant difference between the Human and System conditions (see Figure 8c).

From the experimental results, we confirmed significant differences between the None and Human, as well as between the None and System, conditions. However, we could not confirm any significant difference in the social presence, co-presence, and gaze naturalness between the Human and System conditions. Regarding co-presence, the average scores were 3.71 for the Human condition and 3.72 for the System condition, showing they were very similar. However, the score was below 3 because, in this experiment, the ECAs only provided eye contact and did not respond to the participants’ answers or exhibit gestures consistent with typical conversations. Although our primary objective was to examine the effect of eye contact on the user, this limited interaction may have contributed to the reduced sense of social presence among participants.

In terms of the naturalness of gaze, the Human condition scored 3.70, while the System condition scored 3.36. Although no significant difference was observed, a numerical difference was observed. We believe that there are two main reasons for this finding. First, in the Human condition, the experimenter not only engaged in conversation, but also looked at the users during sudden actions, such as, for example, making abrupt large body movements or performing attention-grabbing behaviors.

Second, this issue arises from the multi-party conversation gaze control system. In this study, the algorithm was developed based on previous research, indicating that transitions between speaker turns typically involve a brief pause of approximately 1.2 s for breathing. Consequently, we identified the individual who spoke the most during this 1.2 s interval as the dominant speaker and directed the system’s gaze toward them. However, a problem occurs when a new speaker begins speaking before the previous speaker has finished, causing the response time to exceed 1 s.

To address this concern, we evaluated the model’s accuracy in predicting user turns. We defined the dominant speaker in a conversation as the individual whose utterance both initiated and concluded the turn, and we assessed the system’s performance based on this criterion. Due to audio recording issues, our analysis was limited to 20 out of 30 videos. The model achieved an accuracy of 89.8%, with most errors resulting from a slight delay rather than an immediate switch of gaze to the next speaker. Previous studies have found that, in conversational situations between robots and humans, naturalness is highest when the delay time is 1 s, and it then decreases thereafter [55]. Based on this, a delay of more than 1 s in the ECA response speed may have reduced naturalness.

In the interviews after the experiment, all but four participants were able to clearly distinguish the None condition from the Human and System conditions. No one explicitly mentioned the difference between the Human and System conditions. Regarding preference for the ECA in the two conditions, most participants expressed preferences related to other factors, such as the ECA’s facial expressions and voice, rather than awareness of the gaze. We found that users generally did not perceive any noticeable differences between the System and Human conditions.

For the objective measures, no statistically significant differences were found in the participants’ speaking time or in the duration of visual attention directed toward the ECA across conditions. These results were presumed to be attributable to a conformity effect induced by the dummy’s consistent responses, which may have led participants to exhibit similar speaking durations and behavioral patterns [56]. During the experiment, we introduced a dummy participant to simulate a multi-party conversational setting. This dummy consistently responded in the same manner across all conditions. Previous research has demonstrated that speakers tend to adopt their interlocutors’ speech characteristics and behaviors—a phenomenon known as convergence [57,58]. To more accurately assess the objective measures, future studies should consider an experimental design that minimizes the conformity effect induced by the interlocutor.

6. Conclusions

In this study, we presented a system for detecting the main speaker in multi-party conversations under noisy conditions. To achieve this goal, we trained an enhanced speaker detection model in noisy settings and developed software capable of identifying the main speaker in multi-party conversations. During model training, we observed a 0.6% improvement in mAP (achieving a final score of 91.6%), compared to the existing lightweight model, and it also consistently outperformed the existing model across various noisy conditions. In the user evaluation, participants assessed whether they could perceive differences between ECAs’ gaze behavior controlled by a person and that controlled by our system. Social presence, co-presence, and gaze naturalness were compared, and the results indicated that participants could not clearly distinguish between human-controlled and system-controlled gaze behaviors. However, significant improvements over a baseline static gaze behavior were observed, suggesting that our model performs comparably to a human-controlled gaze. In summary, our method can accurately detect the active speaker in noisy environments using only a single microphone and a single camera, enabling smooth and natural gaze behavior during conversations. Furthermore, the lightweight design of the proposed model ensures efficient performance even under resource-constrained conditions, making it suitable for deployment in typical kiosk systems.

Despite its overall performance, our method has limitations. The active speaker detection model was trained on synthetic noisy data and tested in a multi-party conversation within an open office. However, it has not been evaluated in real-world outdoor settings, where factors like traffic and construction could affect performance. While the model requires minimal hardware—similar to a standard webcam—it relies on a single camera and mono microphone. Trained at 25 fps, using a webcam with a different frame rate may necessitate fps conversion during preprocessing. Although the gaze system generally mimics natural behavior, a slight difference in gaze naturalness was observed, likely due to the simple control algorithm that directs gaze solely toward the speaker. In real-world scenarios, gaze behavior is influenced by speech content, speaker movements, surrounding individuals, and environmental factors. In addition, our user study only included male–female pairs, which may have introduced gender bias. Future research should address these limitations while enhancing the ECA gaze model by incorporating gestures and environmental changes to achieve more interactive and natural behavior.

Author Contributions

Conceptualization, S.J. and J.K.; methodology and software, S.J.; validation, J.K. and M.L.; formal analysis, J.K.; writing—original draft preparation, S.J. and J.K.; writing—review and editing, M.L.; visualization, S.J.; supervision, project administration, and funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) (grant funded by the Korean government (MSIT) (No. 2022R1F1A1076025)) and by the Institute of Information & communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2024-RS-2023-00254177) (grant funded by the Korean government (MSIT)).

Institutional Review Board Statement

This study was conducted in accordance with the guidelines detailed in the Declaration of Helsinki, and it was also approved by the Institutional Review Board of Pusan National University (protocol code 2023_35_HR, 23 March 2023).

Informed Consent Statement

Informed consent was obtained from all of the subjects involved in this study.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, N.; Chen, J.; Liu, Z.; Zhang, J. Public information system interface design research. In Proceedings of the Human-Computer Interaction—INTERACT 2013, Cape Town, South Africa, 2–6 September 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 247–259. [Google Scholar]
Buisine, S.; Wang, Y.; Grynszpan, O. Empirical investigation of the temporal relations between speech and facial expressions of emotion. J. Multimodal User Interfaces 2009, 3, 263–270. [Google Scholar] [CrossRef]
Freigang, F.; Klett, S.; Kopp, S. Pragmatic multimodality: Effects of nonverbal cues of focus and certainty in a virtual human. In Proceedings of the Intelligent Virtual Agents, Stockholm, Sweden, 27–30 August 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 142–155. [Google Scholar]
He, Y.; Pereira, A.; Kucherenko, T. Evaluating data-driven co-speech gestures of embodied conversational agents through real-time interaction. In Proceedings of the 22nd ACM International Conference on Intelligent Virtual Agents, Faro, Portugal, 6–9 September 2022; pp. 1–8. [Google Scholar]
Ding, Y.; Zhang, Y.; Xiao, M.; Deng, Z. A multifaceted study on eye contact based speaker identification in three-party conversations. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, 6–11 May 2017; pp. 3011–3021. [Google Scholar]
Kendon, A. Some functions of gaze-direction in social interaction. Acta Psychol. 1967, 26, 22–63. [Google Scholar] [CrossRef]
Moubayed, S.A.; Edlund, J.; Beskow, J. Taming Mona Lisa: Communicating gaze faithfully in 2D and 3D facial projections. ACM Trans. Interact. Intell. Syst. 2012, 1, 1–25. [Google Scholar] [CrossRef]
Kum, J.; Jung, S.; Lee, M. The Effect of Eye Contact in Multi-Party Conversations with Virtual Humans and Mitigating the Mona Lisa Effect. Electronics 2024, 13, 430. [Google Scholar] [CrossRef]
Roth, J.; Chaudhuri, S.; Klejch, O.; Marvin, R.; Gallagher, A.; Kaver, L.; Ramaswamy, S.; Stopczynski, A.; Schmid, C.; Xi, Z.; et al. Ava active speaker: An audio-visual dataset for active speaker detection. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4492–4496. [Google Scholar]
Tao, R.; Pan, Z.; Das, R.K.; Qian, X.; Shou, M.Z.; Li, H. Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 3927–3935. [Google Scholar]
Min, K.; Roy, S.; Tripathi, S.; Guha, T.; Majumdar, S. Learning long-term spatial-temporal graphs for active speaker detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 371–387. [Google Scholar]
Liao, J.; Duan, H.; Feng, K.; Zhao, W.; Yang, Y.; Chen, L. A light weight model for active speaker detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 22932–22941. [Google Scholar]
Wang, X.; Cheng, F.; Bertasius, G. Loconet: Long-short context network for active speaker detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 18462–18472. [Google Scholar]
Roxo, T.; Costa, J.C.; Inácio, P.R.; Proença, H. WASD: A Wilder Active Speaker Detection Dataset. IEEE Trans. Biom. Behav. Identity Sci. 2024, 7, 61–70. [Google Scholar] [CrossRef]
Laranjo, L.; Dunn, A.G.; Tong, H.L.; Kocaballi, A.B.; Chen, J.; Bashir, R.; Surian, D.; Gallego, B.; Magrabi, F.; Lau, A.Y.; et al. Conversational agents in healthcare: A systematic review. J. Am. Med. Inform. Assoc. 2018, 25, 1248–1258. [Google Scholar] [CrossRef]
Argyle, M.; Cook, M.; Cramer, D. Gaze and mutual gaze. Br. J. Psychiatry 1994, 165, 848–850. [Google Scholar] [CrossRef]
Kendon, A. Conducting Interaction: Patterns of Behavior in Focused Encounters; CUP Archive: Cambridge, UK, 1990; Volume 7. [Google Scholar]
Mason, M.F.; Tatkow, E.P.; Macrae, C.N. The look of love: Gaze shifts and person perception. Psychol. Sci. 2005, 16, 236–239. [Google Scholar] [CrossRef]
Shimada, M.; Yoshikawa, Y.; Asada, M.; Saiwaki, N.; Ishiguro, H. Effects of observing eye contact between a robot and another person. Int. J. Soc. Robot. 2011, 3, 143–154. [Google Scholar] [CrossRef]
Xu, T.; Zhang, H.; Yu, C. See you see me: The role of eye contact in multimodal human-robot interaction. ACM Trans. Interact. Intell. Syst. (TiiS) 2016, 6, 2. [Google Scholar] [CrossRef]
Kompatsiari, K.; Ciardo, F.; De Tommaso, D.; Wykowska, A. Measuring engagement elicited by eye contact in Human-Robot Interaction. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 6979–6985. [Google Scholar]
Kiilavuori, H.; Sariola, V.; Peltola, M.J.; Hietanen, J.K. Making eye contact with a robot: Psychophysiological responses to eye contact with a human and with a humanoid robot. Biol. Sychol. 2021, 158, 107989. [Google Scholar] [CrossRef]
Bee, N.; André, E.; Tober, S. Breaking the ice in human-agent communication: Eye-gaze based initiation of contact with an embodied conversational agent. In Proceedings of the Intelligent Virtual Agents, Amsterdam, The Netherlands, 14–16 September 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 229–242. [Google Scholar]
Kontogiorgos, D.; Skantze, G.; Pereira, A.; Gustafson, J. The effects of embodiment and social eye-gaze in conversational agents. In Proceedings of the Annual Meeting of the Cognitive Science Society, Montreal, QC, Canada, 24–27 July 2019; Volume 41. [Google Scholar]
Choi, D.S.; Park, J.; Loeser, M.; Seo, K. Improving counseling effectiveness with virtual counselors through nonverbal compassion involving eye contact, facial mimicry, and head-nodding. Sci. Rep. 2024, 14, 506. [Google Scholar] [CrossRef] [PubMed]
Song, K.T.; Hu, J.S.; Tsai, C.Y.; Chou, C.M.; Cheng, C.C.; Liu, W.H.; Yang, C.H. Speaker attention system for mobile robots using microphone array and face tracking. In Proceedings of the 2006 IEEE International Conference on Robotics and Automation (ICRA), Orlando, FL, USA, 15–19 May 2006; pp. 3624–3629. [Google Scholar]
Kim, H.D.; Kim, J.; Komatani, K.; Ogata, T.; Okuno, H.G. Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments. In Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Nice, France, 22–26 September 2008; pp. 1705–1711. [Google Scholar]
Sanchez-Riera, J.; Alameda-Pineda, X.; Wienke, J.; Deleforge, A.; Arias, S.; Čech, J.; Wrede, S.; Horaud, R. Online multimodal speaker detection for humanoid robots. In Proceedings of the 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids), Osaka, Japan, 29 November–1 December 2012; pp. 126–133. [Google Scholar]
Cech, J.; Mittal, R.; Deleforge, A.; Sanchez-Riera, J.; Alameda-Pineda, X.; Horaud, R. Active-speaker detection and localization with microphones and cameras embedded into a robotic head. In Proceedings of the 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids), Atlanta, GA, USA, 15–17 October 2013; pp. 203–210. [Google Scholar]
Ciuffreda, I.; Battista, G.; Casaccia, S.; Revel, G.M. People detection measurement setup based on a DOA approach implemented on a sensorised social robot. Meas. Sens. 2023, 25, 100649. [Google Scholar] [CrossRef]
He, W.; Motlicek, P.; Odobez, J.M. Deep neural networks for multiple speaker detection and localization. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 74–79. [Google Scholar]
Gonzalez-Billandon, J.; Belgiovine, G.; Tata, M.; Sciutti, A.; Sandini, G.; Rea, F. Self-supervised learning framework for speaker localisation with a humanoid robot. In Proceedings of the 2021 IEEE International Conference on Development and Learning (ICDL), Beijing, China, 23–26 August 2021; pp. 1–7. [Google Scholar]
Humblot-Renaux, G.; Li, C.; Chrysostomou, D. Why talk to people when you can talk to robots? Far-field speaker identification in the wild. In Proceedings of the 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), Vancouver, BC, Canada, 8–12 August 2021; pp. 272–278. [Google Scholar]
Qian, X.; Wang, Z.; Wang, J.; Guan, G.; Li, H. Audio-visual cross-attention network for robotic speaker tracking. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 31, 550–562. [Google Scholar] [CrossRef]
Shi, Z.; Zhang, L.; Wang, D. Audio–Visual Sound Source Localization and Tracking Based on Mobile Robot for The Cocktail Party Problem. Appl. Sci. 2023, 13, 6056. [Google Scholar] [CrossRef]
Berghi, D.; Jackson, P.J. Leveraging Visual Supervision for Array-Based Active Speaker Detection and Localization. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 32, 984–995. [Google Scholar] [CrossRef]
Köpüklü, O.; Taseska, M.; Rigoll, G. How to design a three-stage architecture for audio-visual active speaker detection in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1193–1203. [Google Scholar]
Alcázar, J.L.; Cordes, M.; Zhao, C.; Ghanem, B. End-to-end active speaker detection. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 126–143. [Google Scholar]
Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6450–6459. [Google Scholar]
Yu, F. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Gritsenko, A.; Salimans, T.; van den Berg, R.; Snoek, J.; Kalchbrenner, N. A spectral energy distance for parallel speech synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 13062–13072. [Google Scholar]
Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
Ishii, R.; Otsuka, K.; Kumano, S.; Yamato, J. Using respiration to predict who will speak next and when in multiparty meetings. ACM Trans. Interact. Intell. Syst. (TiiS) 2016, 6, 1–20. [Google Scholar] [CrossRef]
Datta, G.; Etchart, T.; Yadav, V.; Hedau, V.; Natarajan, P.; Chang, S.F. Asd-transformer: Efficient active speaker detection using self and multimodal transformers. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 4568–4572. [Google Scholar]
Zhang, Y.; Liang, S.; Yang, S.; Liu, X.; Wu, Z.; Shan, S.; Chen, X. Unicon: Unified context network for robust active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 3964–3972. [Google Scholar]
Wang, W.; Xing, C.; Wang, D.; Chen, X.; Sun, F. A robust audio-visual speech enhancement model. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7529–7533. [Google Scholar]
Sivaraman, A.; Kim, M. Efficient personalized speech enhancement through self-supervised learning. IEEE J. Sel. Top. Signal Process. 2022, 16, 1342–1356. [Google Scholar] [CrossRef]
Zhao, X.; Zhu, Q.; Hu, Y. An Experimental Comparison of Noise-Robust Text-To-Speech Synthesis Systems Based On Self-Supervised Representation. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11441–11445. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Syrdal, D.S.; Dautenhahn, K.; Koay, K.L.; Walters, M.L. The negative attitudes towards robots scale and reactions to robot behaviour in a live human-robot interaction study. In Adaptive and Emergent Behaviour and Complex Systems; The Society for the Study of Artificial Intelligence and the Simulation of Behaviour (AISB): Edinburgh, UK, 2009. [Google Scholar]
Bailenson, J.N.; Blascovich, J.; Beall, A.C.; Loomis, J.M. Interpersonal distance in immersive virtual environments. Personal. Soc. Psychol. Bull. 2003, 29, 819–833. [Google Scholar] [CrossRef] [PubMed]
Harms, C.; Biocca, F. Internal consistency and reliability of the networked minds measure of social presence. In Proceedings of the Seventh Annual International Workshop: Presence, Valencia, Spain, 13–15 October 2004; Volume 2004. [Google Scholar]
Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
Shiwa, T.; Kanda, T.; Imai, M.; Ishiguro, H.; Hagita, N. How quickly should communication robots respond? In Proceedings of the 3rd ACM/IEEE International Conference on Human Robot Interaction, Amsterdam, The Netherlands, 12–15 March 2008; pp. 153–160. [Google Scholar]
Crutchfield, R.S. Conformity and character. Am. Psychol. 1955, 10, 191. [Google Scholar] [CrossRef]
Sanker, C. Comparison of Phonetic Convergence in Multiple Measures; Academia: Melbourne, Australia, 2015. [Google Scholar]
Street, R.L., Jr. Speech convergence and speech evaluation in fact-finding interviews. Hum. Commun. Res. 1984, 11, 139–169. [Google Scholar] [CrossRef]

Figure 1. Proposed model architecture with an audio encoder (upper left), a visual encoder (bottom left), and a detector (right).

Figure 2. (a,b) Two blocks, both of which were designed to extract long-range temporal features.

Figure 3. Detailed model structure with internal components.

Figure 4. Multi-party conversation gaze control system flow.

Figure 5. Graphical user interface (GUI) for ECA gaze control, mode selection buttons (upper right), verbal expression controls for the ECA (lower left), and the camera-captured image used for system processing (upper left). In system mode, the detected speaker is indicated by a green box.

Figure 6. The environment configuration for Experiment 2.

Figure 7. The ECAs’ environment and appearance in Experiment 2.

Figure 8. Results of the subjective measurements (*: p < 0.05, **: p < 0.01, and ***: p < 0.001; ⧫: outlier).

Table 1. Performance comparisons using the original AVA-Speaker Detection dataset.

Method	mAP (%)
Light ASD (without retraining)	90.2
Light ASD (with retraining)	91.0
Ours	91.6

Table 2. Performance comparisons at various SNR levels.

	−10 dB	−5 dB	0 dB	5 dB	10 dB
Light ASD (without retraining)	86.3	88.6	90.1	92.1	92.9
Light ASD (with retraining)	88.1	89.8	91.3	92.3	92.8
Ours	88.7	90.4	91.7	92.6	93.1

Table 3. Performance comparisons of the visual encoders using the TDV block.

Method	mAP (%)
Light ASD’s visual encoder	78.25
Our model’s visual Encoder	81.12

Table 4. Performance evaluations of the TDA and TDV blocks.

Method	mAP (%)
WOD	91.4
AD	91.5
VD	91.3
AD + VD	91.6

Table 5. Performance comparisons with the addition of bidirectional GRU in the encoder.

Method	mAP (%)
woBiGRU	91.1
ABiGRU	91.6
VBiGRU	90.6
ABiGRU + VBiGRU	90.6

Table 6. Performance comparisons of the multi-modal combination method.

Method	mAP (%)
woConcat	91.2
Concat	91.6

Table 7. Performance comparisons for inference time.

Method	Video Frames	Inference Time (ms)
Light ASD	1 (0.04 s)	3.82
	500 (20 s)	23.56
	1000 (40 s)	45.68
Ours	1 (0.04 s)	4.79
	500 (20 s)	23.22
	1000 (40 s)	47.09

Table 8. Example questions used in Experiment 2.

	Questions
1.	Which do you prefer, summer or winter? Why do you like it better?
2.	Can you recommend a restaurant and a dish? Why?
3.	What is the last movie you watched? Can you tell me about it?
4.	So, what is your ideal type? I would love to hear about it?
5.	What did you do last weekend? Tell me all about it?

Table 9. Summary of the Cronbach’s

α

and Friedman test results.

Table 9. Summary of the Cronbach’s

α

and Friedman test results.

	$x^{2}$	p-Value	Cronbach’s $α$
Social presence	17.142	0.005	0.859
Co-presence	22.604	0.001	0.883
Gaze naturalness	9.896	<0.001	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jung, S.; Kum, J.; Lee, M. An Efficient Gaze Control System for Kiosk-Based Embodied Conversational Agents in Multi-Party Conversations. Electronics 2025, 14, 1592. https://doi.org/10.3390/electronics14081592

AMA Style

Jung S, Kum J, Lee M. An Efficient Gaze Control System for Kiosk-Based Embodied Conversational Agents in Multi-Party Conversations. Electronics. 2025; 14(8):1592. https://doi.org/10.3390/electronics14081592

Chicago/Turabian Style

Jung, Sunghun, Junyeong Kum, and Myungho Lee. 2025. "An Efficient Gaze Control System for Kiosk-Based Embodied Conversational Agents in Multi-Party Conversations" Electronics 14, no. 8: 1592. https://doi.org/10.3390/electronics14081592

APA Style

Jung, S., Kum, J., & Lee, M. (2025). An Efficient Gaze Control System for Kiosk-Based Embodied Conversational Agents in Multi-Party Conversations. Electronics, 14(8), 1592. https://doi.org/10.3390/electronics14081592

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Gaze Control System for Kiosk-Based Embodied Conversational Agents in Multi-Party Conversations

Abstract

1. Introduction

2. Related Work

2.1. Gaze Behavior of Social Agents

2.2. Speaker Detection

3. Method

3.1. Real-Time Speaker Detection Model

3.2. Gaze Control System for Multi-Party Conversations

4. Experiment 1

4.1. Dataset

4.2. Implementation Details

4.3. Result and Discussion

5. Experiment 2

5.1. Method

5.2. Materials

5.3. Procedure

5.4. Measurement

5.5. Participants

5.6. Result and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI