Multimodal human–AI systems generally consider facial expressions and body motions as separate input streams, leading to disjointed interpretations and diminished emotional coherence. To overcome this issue, we offer the Engagement-Safe Expressive Alignment (ESEA) paradigm and the Unified Visual Synchrony (UVS) framework as its
[...] Read more.
Multimodal human–AI systems generally consider facial expressions and body motions as separate input streams, leading to disjointed interpretations and diminished emotional coherence. To overcome this issue, we offer the Engagement-Safe Expressive Alignment (ESEA) paradigm and the Unified Visual Synchrony (UVS) framework as its computational implementation. UVS models the coherence between facial expressions and gestures, offering an interpretable visual synchrony signal that can function as adaptive feedback in human–AI interactions. The framework’s key component is the Consistency Index for Affective Synchrony (
), which correlates brief visual segments with scalar synchrony scores through a common latent representation. Facial and gestural signals are processed by modality-specific projection networks into a unified latent space, and
is derived from the similarity and short-term temporal consistency of these latent trajectories. The synchrony index is regarded as an estimation of affective visual coherence within the ESEA paradigm. We formalize the UVS/
framework and conduct a comparative experimental evaluation utilizing matched and mismatched face–gesture segments derived from rendered dialog footage. Utilizing ROC analysis, score distribution comparisons, temporal visualizations, and negative control tests, we illustrate that
effectively captures structured face–gesture alignment that surpasses similarity-based baselines, while also delivering a persistent, time-resolved synchronization signal. These findings establish
as a principled and interpretable feedback signal for future affect-aware, engagement-focused multimodal agents.
Full article