Next Article in Journal
Exploring the Influence of Cloud Computing on Supply Chain Performance: The Mediating Role of Supply Chain Governance
Previous Article in Journal
Consumer Information-Seeking and Cross-Media Campaigns: An Interactive Marketing Perspective on Multi-Platform Strategies and Attitudes Toward Innovative Products
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Harmonizing Sight and Sound: The Impact of Auditory Emotional Arousal, Visual Variation, and Their Congruence on Consumer Engagement in Short Video Marketing

1
School of Business, Nanjing Audit University, Nanjing 210017, China
2
School of Economics and Management, Southwest Jiaotong University, Chengdu 610032, China
*
Author to whom correspondence should be addressed.
J. Theor. Appl. Electron. Commer. Res. 2025, 20(2), 69; https://doi.org/10.3390/jtaer20020069
Submission received: 3 March 2025 / Revised: 1 April 2025 / Accepted: 3 April 2025 / Published: 8 April 2025
(This article belongs to the Topic Interactive Marketing in the Digital Era)

Abstract

:
Social media influencers strategically design the auditory and visual features of short videos to enhance consumer engagement. Among these, auditory emotional arousal and visual variation play crucial roles, yet their interactive effects remain underexplored. Drawing on multichannel integration theory, this study applies multimodal machine learning to analyze 12,842 short videos from Douyin, integrating text analysis, sound recognition, and image processing. The results reveal an inverted U-shaped relationship between auditory emotional arousal and consumer engagement, where moderate arousal maximizes interaction while excessively high or low arousal reduces engagement. Visual variation, however, exhibits a positive linear effect, with greater variation driving higher engagement. Notably, audiovisual congruence significantly enhances engagement, as high alignment between arousal and visual variation optimizes consumer information processing. These findings advance short video marketing research by uncovering the multisensory interplay in consumer engagement. They also provide practical guidance for influencers in optimizing voice and visual design strategies to enhance content effectiveness.

1. Introduction

In recent years, short videos have rapidly become a central channel for information dissemination and brand marketing [1], driven by their fragmented format, entertainment value, and high interactivity [2]. Businesses are increasingly partnering with social media influencers to produce short videos embedded with product information, simultaneously boosting brand exposure and consumer engagement [3]. As of 2024, TikTok’s global user base has surpassed 1.9 billion, with approximately 33% of users making purchase decisions after interacting with influencer-generated product videos [4]. Short videos offer distinct advantages, including brevity, high information density, and rapid dissemination [5]. Since most short videos last between 15 s and a few minutes, content creators must leverage carefully designed visual and auditory elements to quickly capture viewers’ attention [5].
This has made the effective use of sound and visuals in short video content a critical research topic [6], with sound rapidly drawing attention [7] and facilitating faster cognitive processing [8]. Among auditory features, emotional arousal is particularly important, referring to the intensity of emotion conveyed through vocal delivery [9]. Influencers frequently use highly emotionally arousing voices to enhance consumer attention and interaction in short video marketing [10]. For example, employing an expressive and engaging tone enhances a video’s appeal [5]. Research suggests that high-arousal content effectively captures attention, strengthens memory encoding, and enhances emotional experiences [11], particularly in time-sensitive decision-making contexts [9]. However, the impact of auditory emotional arousal on marketing effectiveness often follows a nonlinear pattern. While highly arousing vocal delivery can increase emotional engagement and capture attention, excessive intensity may lead to cognitive overload and induce negative emotions such as irritation or anxiety [12]. Cognitive psychology research suggests that stimulus arousal can have both positive and negative effects on consumer behavior, with the relationship between arousal levels and behavioral responses being complex and variable [13]. However, existing studies on short videos have largely neglected the role of influencers’ vocal delivery, and little research has examined how auditory emotional arousal affects consumer engagement. This gap in understanding often results in a lack of precision in sound design strategies, limiting influencers’ ability to effectively engage consumers.
While auditory stimuli shape the emotional tone of short video content, visual dynamics are equally central in capturing and sustaining user attention [14,15]. Visual variation, which includes fluctuations in scene transitions, perspective shifts, and graphic effects, introduces a temporal rhythm that can heighten sensory engagement [16]. Prior research suggests that moderate levels of such variation enhance cognitive stimulation, facilitate viewer immersion, and improve message encoding and recall [17]. For instance, fast-paced editing sequences may maintain attention by mimicking users’ habitual browsing tempo on short video platforms [18]. On the other hand, excessive visual variation, such as frequent scene cuts or abrupt perspective changes, may disrupt the viewer’s ability to process information smoothly. This can result in fatigue and reduce the persuasive impact of brand-related messages. Although research on video design has addressed static visual elements such as color, composition, resolution, and lighting [15,19,20], there is comparatively little understanding of how dynamic visual structures affect consumer perception and behavior. Further investigation into how different degrees of visual intensity influence engagement is essential for developing more effective content strategies within the context of short video marketing.
More importantly, in short video content, visual and auditory stimuli do not operate independently; their effective integration plays a key role in enhancing consumer engagement [21]. According to consistency theory, when visual and auditory stimuli are congruent, individuals process information and emotions more efficiently [22]. For instance, pairing a highly emotionally arousing voice with fast-paced visual variation effectively conveys urgency and excitement, whereas a low-arousal voice combined with slow-paced visuals may be more suitable for conveying calmness and relaxation. However, mismatched audiovisual elements—such as high emotional arousal paired with low visual variation or low emotional arousal paired with high visual variation—may create cognitive and emotional dissonance, reducing processing efficiency and weakening consumer engagement. Despite its significance, limited research has investigated how the congruence between auditory and visual features in short videos affects consumer behavior.
This study addresses these gaps by examining the effects of auditory emotional arousal and visual variation on consumer engagement in short videos, with a particular focus on their congruence. Using real-world Douyin live-streaming data, this study applies multimodal audio–visual analysis to systematically analyze 12,842 product recommendation short videos from 170 influencers. By exploring different combinations of auditory emotional arousal and visual variation, this study provides a comprehensive examination of their interactive effects on consumer engagement. The findings contribute to the theoretical understanding of short video marketing by addressing the underexplored interplay between auditory and visual elements. Additionally, the study offers practical insights to help brands refine their content strategies, enhance consumer interaction, and improve purchase conversion rates.

2. Literature Review

2.1. Influencer Marketing in Short Video

Influencer marketing refers to the strategic use of individuals with strong social presence to shape consumer perceptions and drive brand outcomes, typically through some form of compensation or collaboration [23]. Within short video contexts, this practice involves creators who promote products or services through video content designed to engage specific audience segments [24]. Influencers often attract viewers by producing material that is entertaining, relatable, or perceived as professionally credible, thereby fostering interaction and consumer interest [25].
Research on influencer marketing in short video environments has generally developed along two primary lines. One stream investigates how influencer attributes affect consumer decision-making [26]. Prior studies have identified factors such as role identity [27], personality traits [26], follower base size [28], narrative skills [29], and selling techniques [30] as influential in shaping consumer attitudes and purchase intentions. Additionally, qualities like authenticity, communicative style, and interactivity contribute to consumers’ evaluation of the influencer and their message [5]. A second research stream focuses on the features of influencer-created video content [31]. Aspects including video length [32], content structure [33], title phrasing [34], and source credibility [35] have been shown to influence consumer engagement. However, this body of work has largely concentrated on textual and static visual features, such as image composition, color palette, and the influencer’s appearance [19]. These studies primarily investigate how linguistic and visual cues shape consumer responses. However, research on dynamic visual sequences and audio-based cues in shaping user responses has received far less empirical attention. Given this gap, dynamic visual elements and emotionally driven auditory cues merit further investigation.
As multimodal research gains prominence, there is growing recognition that content should be understood in terms of integrated sensory experiences rather than in isolated modalities [5]. Simmonds et al. (2020) suggest that synchronized audiovisual cues may foster deeper attention and facilitate more elaborate message processing [36]. Yet the field still lacks systematic investigation into how emotional audio cues and visual dynamism jointly influence user behavior. Addressing this gap may offer both theoretical and practical insights for enhancing the impact of influencer content on short video platforms.

2.2. Auditory Emotional Arousal

Auditory emotional arousal refers to the degree of emotional activation triggered by sound, reflecting individuals’ psychological and physiological responses when exposed to specific auditory stimuli or audio content [9]. This activation can manifest in physiological reactions such as heightened attention, increased heart rate, and altered breathing patterns, as well as psychological experiences like emotional fluctuations [37]. Different sounds elicit varying levels of emotional arousal. For instance, soft sounds may create a low-arousal experience associated with calmness and pleasure, whereas intense and stimulating sounds may induce a high-arousal state, increasing heart rate and alertness [9].
In consumer behavior research, emotional arousal is widely recognized as a factor that enhances consumer experience and brand engagement. Studies suggest that moderate levels of emotional arousal improve consumers’ ability to sustain attention and retain information [38], thereby strengthening brand trust and consumer loyalty [39]. Additionally, higher arousal levels may activate social conformity tendencies, making consumers more inclined to align with prevailing behavioral trends [40,41]. However, excessive emotional arousal can lead to cognitive overload. When arousal intensity is too high, consumers may struggle to allocate their limited cognitive resources effectively, leading to attentional dispersion and reducing their ability to process marketing messages deeply [12]. While prior research has explored the role of emotional valence in marketing communication, studies specifically examining how auditory emotional arousal influences consumer behavior remain relatively scarce, especially in the context of short video marketing.
In short video marketing, influencers frequently use vocal delivery to convey emotional arousal, shaping consumers’ psychological states and behavioral responses. However, most existing research focuses on how consumers subjectively perceive influencers’ vocal expressions, such as how they experience the emotional tone of short videos and its impact on marketing effectiveness [42]. By contrast, limited research has examined auditory emotional arousal from the influencer’s perspective—specifically, how vocal delivery shapes consumers’ information processing and subsequently influences their behavioral responses. Addressing this gap is essential for a more comprehensive understanding of the role of auditory emotional arousal in short video marketing.

2.3. Visual Variation

Visual variation is a key characteristic of short videos [43], referring to the degree of change in elements such as scene transitions, color shifts, and dynamic visual effects [44]. Different levels of visual variation shape the overall viewing experience by influencing the perceived motion and intensity of a video’s visual presentation [19]. Short videos with low visual variation typically feature stable compositions, minimal scene transitions, and limited color adjustments, resulting in a more static and steady information structure. In contrast, high visual variation is characterized by rapid scene transitions, multi-layered visual elements, and complex dynamic effects [44].
Within the growing literature on short video design, visual variation is increasingly recognized as a key component of dynamic visual features [43]. Unlike static elements that remain fixed throughout the duration of a video, dynamic visual features unfold temporally [45] and involve motion-related characteristics such as trajectory patterns [46], visual flow [47], and animated transitions [48]. Prior research has primarily focused on how such elements influence attention capture and facilitate cognitive processing [49]. For instance, recent work by Stuppy et al. (2024) demonstrates that slower video playback speeds can enhance cognitive fluency, allowing viewers to process product-related information more effectively and to assign higher value to the promoted items [49]. Similarly, Yan et al. (2023) highlighted the role of visual flow continuity in shaping user immersion, demonstrating that smooth transitions in visual content appear to improve attentional engagement and contribute to a more fluid and rewarding viewing experience [50].
Despite growing research on the cognitive effects of dynamic visual features, most studies have focused on specific aspects such as video pacing [49], editing techniques [51], or narrative structures [52]. However, fewer studies have explored how the degree of visual variation modulates cognitive load, information fluency, and marketing effectiveness. Compared to research that isolates video playback speed or visual flow, visual variation emphasizes the integrated effects of multiple dynamic elements and how they collectively influence consumer information processing [44]. As a fundamental component of dynamic visual design, visual variation not only shapes users’ attention allocation and cognitive engagement, but also directly impacts the effectiveness of marketing content dissemination [19]. Given the multimodal nature of short videos, further investigation into the cognitive impact of visual variation will not only optimize short video content strategies, but also provide new theoretical perspectives for influencer marketing research.

2.4. Multiple Resource Theory

Multiple resource theory, rooted in cognitive psychology, offers a framework for understanding how individuals distribute attention across concurrent sensory inputs [53]. The theory conceptualizes cognition as comprising several distinct but partially overlapping processing channels, each limited in its capacity to handle information [53]. These channels correspond to different sensory modalities, such as auditory and visual systems. When tasks engage separate channels, they are less likely to interfere with one another, which may allow for more efficient information processing under certain conditions [54].
Multiple resource theory has been widely applied to examine how multisensory inputs influence information processing, particularly the interaction between visual and auditory modalities [54]. Research suggests that when visual and auditory information is well matched and coordinated, individuals experience greater processing fluency, reduced cognitive load, and improved memory and comprehension [55]. Montero (2018) found that synchronized audiovisual inputs enhance attentional focus and immersion, facilitating long-term memory storage [56]. Similarly, Baumgartner (2022) analyzed the effects of different audiovisual combinations and noted that mismatched visual and auditory stimuli may increase cognitive interference, thereby impairing information processing [57].
In the context of short video marketing, multiple resource theory provides a crucial theoretical foundation for understanding how visual and auditory elements work together to shape consumers’ information processing and brand perception [58]. The dissemination of short video content heavily relies on both visual and auditory inputs, with these two sensory systems interacting in cognitive processes to jointly influence users’ attention allocation, information encoding, and engagement behavior [59]. When visual and auditory information is congruent, consumers can integrate information more effectively, leading to enhanced comprehension and memory retention [60]. Holiday et al. (2023) further highlighted that well-coordinated audiovisual inputs not only minimize information conflicts, but also increase audience immersion and emotional resonance, ultimately strengthening brand appeal [61]. Although existing research underscores the importance of audiovisual congruence, most studies provide a broad perspective rather than examining the specific intensities of different stimuli or their interactive effects—such as the relationship between visual variation and auditory emotional arousal. Addressing this gap is essential for a more nuanced understanding of multisensory processing in short video marketing.

3. Hypotheses and Research Model

3.1. Auditory Emotional Arousal and Consumer Engagement

When individuals process external information, both emotional experience and cognitive evaluation influence their information processing mechanisms [62,63]. In the context of short video marketing, the level of auditory emotional arousal significantly affects both dimensions [9]. When auditory emotional arousal is low, auditory stimuli remain relatively flat, allowing consumers to process short video content with ease [42]. However, due to the lack of sufficient emotional stimulation, consumers’ attention span tends to be short, leading to a noticeable decline in focus on video content [49,64]. In this state, deep logical processing becomes difficult, and emotional resonance is unlikely to occur [65]. As a result, consumers’ memory retention and interest in the content are limited, significantly reducing their willingness to engage with the short video [42].
As the level of auditory emotional arousal increases, consumers’ sensory systems become more engaged, prompting them to allocate greater cognitive resources toward understanding and processing the short video content [42]. Studies suggest that moderate emotional stimulation enhances attention and fosters emotional resonance with the content [9]. When both attention and emotional resonance reach high levels, consumers are more likely to engage in deeper cognitive integration, leading to a stronger sense of identification with the video content [42]. Consequently, a moderate level of emotional arousal not only maintains information processing fluency, but also stimulates consumers’ emotional participation, encouraging active engagement behaviors such as liking, commenting, and sharing.
When auditory emotional arousal becomes excessively intense, as reflected in characteristics such as unusually high pitch, rapid tempo, or abrupt tonal fluctuations, it may interfere with consumers’ ability to process information effectively [66]. Strong auditory stimulation can elevate cognitive load, which, in turn, increases the likelihood of message avoidance [62] and may evoke negative emotional reactions, including anxiety or tension [67]. These outcomes are consistent with emotional arousal theory [68], which suggests that moderate levels of arousal are generally more likely to elicit approach-oriented behavior, whereas both low and high extremes tend to produce avoidance responses [69]. Although heightened arousal may capture attention in the short term, it often undermines cognitive clarity and emotional receptivity, ultimately diminishing consumer engagement with short video content. This leads to the following hypothesis:
H1. 
The relationship between influencers’ auditory emotional arousal and consumer engagement in short videos follows an inverted U-shaped pattern: as emotional arousal increases, consumer engagement initially rises, but when emotional arousal exceeds an optimal range, engagement declines.

3.2. Visual Variation and Consumer Engagement

In short video marketing, visual stimuli serve as a primary channel for information transmission, playing a crucial role in consumer attention allocation, depth of information processing, and emotional responses [70]. When the level of visual variation in a short video is low, the overall visual structure remains relatively static and stable, allowing consumers to process information smoothly without experiencing cognitive overload from rapid scene transitions [49]. However, due to the weak visual stimulation, consumers may quickly lose interest in the content if it lacks sufficient novelty and dynamic elements, leading to reduced information engagement and lower interaction intentions [52]. While low visual variation ensures ease of processing, it may fail to generate sufficient visual impact and rhythmic appeal, making it less effective in evoking emotional resonance and encouraging consumer participation.
As visual variation increases, short videos incorporate more frequent scene transitions and richer visual elements [52]. In this process, moderate visual variation enhances sensory stimulation, making the content more dynamic and engaging, thereby strengthening consumer immersion and emotional resonance [71]. At the same time, consumers tend to allocate more cognitive resources to information processing [52]. Thus, an optimal level of visual rhythm ensures that the content is neither too monotonous to be ignored nor too complex to induce cognitive overload [72], making it most likely to drive active consumer engagement.
However, when the degree of visual variation becomes excessive, consumers’ perceived fluency in information processing may be disrupted. From a cognitive perspective, high levels of visual variation can increase the perceived difficulty of processing information. As highlighted in Paas’s (2010) work on cognitive load theory, excessive sensory input can reduce the efficiency of information processing and diminish user experience and perception [73]. Excessively high visual variation can overload consumers’ cognitive capacity, making it difficult for them to focus effectively on the core message of the video [52]. Overly rapid scene transitions may cause visual fatigue and information overload, impairing consumers’ ability to integrate video content [74]. When information becomes too difficult to decode and process, consumers may develop resistance, leading them to disengage from the content as a means of alleviating cognitive discomfort [75]. Consequently, excessive visual variation negatively impacts consumer cognition, emotions, and engagement behavior. Based on this analysis, the following hypothesis is proposed:
H2. 
The relationship between visual variation and consumer engagement in short videos follows an inverted U-shaped pattern: as visual variation increases, consumer engagement initially rises, but when visual variation exceeds an optimal range, engagement declines.

3.3. Auditory Emotional Arousal, Visual Variation, and Consumer Engagement

In the process of multiple resource allocation, when the rhythm, emotional intensity, and dynamic characteristics of visual and auditory stimuli are aligned, information processing becomes more coherent, and consumers allocate cognitive resources more efficiently [76]. Matched audiovisual inputs reduce cognitive conflict during information reception, enhance attentional engagement, and facilitate the smoother comprehension of video content [54]. When information processing is fluid, consumers are more likely to experience positive emotional reactions, thereby increasing their preference for the target content [62].
Building on this cognitive mechanism, this study proposes that when the level of visual variation and the degree of auditory emotional arousal in influencer-generated short videos are congruent, consumers’ information processing becomes more efficient, leading to improved multisensory integration and stronger engagement intentions. Conversely, when visual variation is low while auditory emotional arousal is high, the intense auditory stimulation may create a perceptual mismatch with the stable visual presentation [57], making it difficult for consumers to construct a coherent information framework, thereby leading to processing biases. For example, if highly energetic and fast-paced background music is played over a static image, consumers may experience confusion and a sense of disconnection [77]. Similarly, when visual variation is high but auditory emotional arousal is low, rapid scene transitions may lack sufficient auditory cues to guide consumers’ information organization, making it difficult for them to sustain attention effectively and reducing memory retention of the marketing content [77]. Additionally, the Elaboration Likelihood Model also suggests that congruent multisensory stimuli may facilitate central route processing, increasing message involvement and behavioral intention [78]. When audiovisual cues are misaligned, the cognitive dissonance may shift processing to the peripheral route, weakening consumer engagement. Based on this analysis, the following hypotheses are proposed:
H3. 
The congruence between auditory and visual features in influencer-generated short video content influences consumer engagement.
H3a. 
High auditory emotional arousal combined with high visual variation leads to higher consumer engagement.
H3b. 
Low auditory emotional arousal combined with low visual variation leads to higher consumer engagement.
Figure 1 presents the conceptual framework that guides this study. The model outlines three core hypotheses. The first two focus on the independent effects of auditory emotional arousal and visual variation on consumer engagement. The third hypothesis addresses their combined influence, where X in the figure represents the interactive effect of these two variables, emphasizing the role of audio–visual congruence in shaping user responses. This framework builds upon multiple resource theory [57] and draws additional support from emotional arousal theory [79] and cognitive models of information processing [80], as discussed in earlier sections.

4. Method

4.1. Data Source

This study collected data from 12,842 product-related short videos published by 170 influencers on the Douyin platform between May and July 2024. The data were obtained from Huitun (www.huitun.com, accessed on 3 May 2024), a publicly accessible analytics platform that provides detailed Douyin influencer metrics and video-level engagement statistics. A custom Python-based web crawler was developed and employed to systematically extract relevant information, including video URLs, publication dates, engagement metrics (likes, comments, shares), and metadata such as influencer IDs, follower counts, and video categories. As China’s leading short-video-driven social commerce platform, Douyin hosts over 700 million daily active users and demonstrates exceptionally high engagement rates, particularly among younger demographics [76]. Influencers on the platform play a central role in product promotion through visually appealing and emotionally engaging short video content.
To ensure the quality and representativeness of the sample, influencers were selected based on the following criteria: verified account status, consistent posting activity during the data collection period, and the availability of complete audiovisual information. The dataset covers a wide range of product categories, including beauty, fashion, consumer electronics, food, and everyday household items.
Each video record in the dataset includes detailed attributes such as video duration and the number of likes, saves, comments, and shares. In addition, to provide a more comprehensive view of influencer effectiveness in short video marketing, supplementary information was collected on influencer characteristics, including follower count, historical content volume, and content category distribution.

4.2. Data Preprocessing

Before conducting data analysis, we performed preprocessing on the short video data generated by influencers. First, we converted the short videos into audio files using the Freemake 4.1.12 video processing software and extracted textual information from the audio using natural language processing techniques. To separate vocals from background music, we utilized the Ultimate Vocal Remover v5 [81] and the UVR GUI package v5.6 (https://github.com/Anjok07/ultimatevocalremovergui, accessed on 3 May 2024) [82].
Next, we further processed the vocal signals using Praat 6.2, a widely used speech analysis software capable of extracting and analyzing various acoustic features from audio files. We began by applying speech framing, a process that segments continuous speech data into short time frames to ensure the extraction of stable and meaningful features. Following common speech processing practices, we set each frame length to 20–30 milliseconds, using 512 data points per frame with an overlap of 256 data points between consecutive frames. After framing, we applied the Hanning window function to each speech frame to ensure continuity between adjacent frames.
After framing and windowing, we performed Short-Time Fourier Transform (STFT) to extract both time-domain and frequency-domain features from the speech signal. Through these preprocessing steps, Praat generated a spectrogram of the input audio and extracted fundamental vocal characteristics, such as speech rate, pitch, and loudness.
Figure 2 presents an example of feature extraction and spectrogram analysis generated using Praat. The sample short video is 1 min long, with the upper section displaying the waveform of the input audio over time. The lower section illustrates the spectrogram, where the x-axis represents time, and the y-axis represents frequency (in kHz). The spectrogram records frequency data ranging from 0 to 5000 Hz and loudness levels from 50 to 100 dB. The gray areas in the spectrogram indicate energy density, with brighter regions signifying lower energy densities. If a dark region appears at around the 30 s mark at 4000 Hz, it suggests that the sound carried a significant amount of energy in the high-frequency range at that moment. The blue curve represents pitch variation over time, the yellow curve represents loudness variation over time, and the red dashed lines indicate formant trajectories.
To characterize the overall acoustic profile of each advertisement, we computed the average values of different acoustic features over the entire duration of the audio.

4.3. Variable Measurement

4.3.1. Independent Variables

For auditory emotional arousal, we utilized the SpeechBrain 1.0 open-source toolkit to identify the emotional arousal and valence of each audio file. SpeechBrain’s built-in model, based on Wav2vec 2.0, quantifies the emotional dimensions of speech with an accuracy of up to 80% and has been widely applied in academic research [83].
Visual variation is a key dimension for assessing the dynamism and content richness of short videos, and scene transition rate serves as a core indicator of visual variation frequency. Scene transition rate refers to the frequency of scene changes per unit time, where a scene change is defined as a transition from one continuous shot to another with significant visual or semantic differences. This metric effectively captures the dynamism and variation of video content. The measurement process for scene transition rate in this study consists of three key steps: video preprocessing, scene transition detection, and transition rate computation and validation.
First, we extracted key attributes from each video—frame rate, resolution, and total duration—using the FFmpeg 4.1.12 multimedia processing tool, which is widely used for handling and converting multimedia files. Next, we decomposed each video into individual frames and applied OpenCV algorithms to extract visual features such as color histograms, edge features, and keypoint distributions. We then calculated the inter-frame feature differences, and when the difference exceeded a predefined dynamic threshold, the frame was marked as a scene transition point. The dynamic threshold was determined using the following formula:
T h r e s h o l d = μ + k × σ
where μ represents the mean inter-frame difference, σ is the standard deviation, and k is an adjustment coefficient, typically set to 1.5. This dynamic thresholding approach ensures sensitivity to significant scene transitions while reducing false detections caused by lighting changes or rapid motion.
To validate the reliability of the detection method, we conducted manual verification on 10% of randomly sampled videos, focusing on whether the identified scene transitions accurately reflected semantic changes in the video (e.g., changes in scene background, main content, or camera perspective) and whether false detections due to lighting shifts or gradual transitions were avoided. The validation results showed that the automated scene transition detection achieved an accuracy of 93.7%, confirming the high precision of the method.
After identifying scene transition points, we computed the total number of scene transitions per video and used this value to calculate the scene transition rate using the following formula:
R = N T
where R represents the scene transition rate, N is the total number of scene transitions, and T is the total video duration. The scene transition rate provides a comprehensive measure of the dynamism of and variation in video content, offering a critical variable for analyzing the impact of video pacing on user behavior. All variable processing was conducted using Python 3.13.2, integrating OpenCV and Pandas, and the processed results were stored in a structured database to ensure high-quality data for subsequent statistical analysis and model validation.

4.3.2. Dependent Variable

The dependent variable in this study is consumer engagement with short videos, measured as the total number of likes, shares, saves, and comments received by each video [84,85]. A distribution analysis of the dataset revealed that the dependent variable exhibited a highly right-skewed distribution. Following prior research practices, we applied a natural logarithm transformation to the dependent variable to ensure the reliability of the analysis. This measurement approach provides a comprehensive representation of consumer engagement behavior after watching short videos.

4.3.3. Control Variables

To better validate the research hypotheses and control for potential confounding variables, we considered a series of factors that could influence the results.
First, regarding textual content, the linguistic features of short videos may impact consumer engagement. Therefore, we controlled for the textual characteristics of short videos. Using natural language processing (NLP) techniques, we converted the audio into text and analyzed textual characteristics using Linguistic Inquiry and Word Count (LIWC) 2022 software. Specifically, we computed the proportion of function words, cognitive words, emotional words, and social process words in each short video. Additionally, we measured the readability of the text. By calculating the proportion of basic vocabulary words (based on the 2021 International Chinese Proficiency Standards, which includes 2245 elementary-level Chinese words) in the text, we derived a readability score for each short video.
We also extracted various audio-related variables for control. To measure the speech rate, we combined speech duration data and speech transcription text, calculating the number of words spoken per second in each short video. Pitch characteristics were determined by analyzing the fundamental frequency (F0) using cepstral analysis in Praat, with the average fundamental frequency of each short video serving as its pitch feature. Loudness characteristics were represented by the sound pressure level (SPL), which was obtained by applying log transformation to the amplitude of the audio signal in each time window, as calculated by Praat. Additionally, we measured background music tempo by extracting the beats per minute (BPM) of non-speech background music using Tunebat 1.03 (http://tunebat.com/, accessed on 28 July 2024), a tool widely used by music professionals due to its high accuracy [86].
In terms of image-related controls, we used Google Colab Notebook to identify and control for key visual characteristics in each short video. Specifically, we applied the Laplacian filter to detect the sharpness of frame edges, which serves as an indicator of image quality, closely related to lighting conditions and motion blur.
Finally, we incorporated content creator characteristics as control variables to account for the potential influence of different influencers on consumer engagement. We controlled for follower count and gender. Follower count represents the level of influence an influencer holds, as influencers with larger follower bases tend to have greater authority and credibility, which can drive higher consumer engagement [87]. Additionally, gender may influence audience emotional resonance and receptivity, potentially affecting engagement [88]. Therefore, we included it as a control variable to enhance the accuracy of the study. Table 1 presents the descriptive statistics for all variables used.

4.3.4. Testing Methods

To examine the hypothesized inverted U-shaped relationship, we estimated a quadratic regression model, which is commonly used to test curvilinear effects [89]. The quadratic regression model is formulated as follows:
y = m + a x + b x 2 + ɛ   C o n t r o l s
In this equation, x denotes the independent variable, x 2 its squared term, m represents the intercept, and C o n t r o l s captures a vector of control variables. The coefficient of the control variables is denoted by ɛ . A statistically significant negative coefficient for b would support the presence of an inverted U-shaped effect. Additionally, we also calculated the turning point, defined as x m a x = − a /2 b , and assessed whether it lies within the observed range of the independent variable, as recommended in prior literature [90].
Furthermore, we examined the matching effect between auditory emotional arousal and visual variation by constructing a matching index, which is defined as follows:
M a t c h i n g = x s
Here, x represents auditory emotional arousal, and s denotes visual variation. The closer the matching index is to zero, the more aligned the two levels are, indicating a higher degree of congruence between auditory and visual elements.

4.4. Results

Due to differences in the measurement units of the independent variables, we first standardized all independent variables, setting their mean to 0 and standard deviation to 1. Based on the above analysis, we employed Ordinary Least Squares (OLS) regression to examine the linear effects of short video auditory features on consumer engagement. To assess potential multicollinearity issues, we calculated and inspected the Variance Inflation Factor (VIF). In all cases, the VIF values were below 5, indicating that multicollinearity was not a concern.
Using an analytical model that includes all control variables, we employed quadratic regression to assess two focal relationships: first, the potential inverted U-shaped effects of auditory emotional arousal and visual variation on consumer engagement, and second, the role of audiovisual matching in shaping engagement outcomes.
As shown in Table 2, Model 1 includes only the control variables and serves as the baseline specification. Model 2 adds the linear and quadratic terms for auditory emotional arousal. The inclusion of these variables significantly improves model fit (R2 = 0.18, p < 0.01). The linear coefficient for auditory emotional arousal is positive and statistically significant (β = 0.24, p < 0.001), while the quadratic term is negative and also significant (β = –0.12, p < 0.01), providing empirical support for a curvilinear relationship. To further evaluate the inverted U-shaped pattern, we calculated the turning point of the curve, which occurs at a value of 0.54. This value lies within the observed range of the auditory emotional arousal variable, lending additional support to Hypothesis H1.
Model 3 extends Model 1 by incorporating visual variation and its quadratic term. The results indicate a good model fit (R2 = 0.21, p < 0.01). The linear term of visual variation is significant (β = 0.27, p < 0.001), whereas the quadratic term is nonsignificant (β = 0.002, p = 0.528). This suggests that visual variation has a significant positive linear effect on consumer engagement, but does not exhibit an inverted U-shaped relationship; thus, Hypothesis H2 is not supported.
Model 4 extends the previous specification by including both visual variation and its interaction with auditory emotional arousal. The inclusion of these terms improves the model fit (R2 = 0.21, p < 0.01). The coefficient for visual variation remains positive and statistically significant (β = 0.25, p < 0.001), lending further support to its role in enhancing consumer engagement. In addition, the interaction term is also significant (β = 0.18, p < 0.001), suggesting that the effect of auditory emotional arousal on engagement is amplified when it occurs alongside higher levels of visual variation. These results provide empirical support for the proposed interaction effect outlined in Hypothesis H3.
To further test the matching effect, Model 5 builds on Model 2 by incorporating the matching index between auditory emotional arousal and visual variation. The results indicate a good model fit (R2 = 0.24, p < 0.01), with the matching index significantly and positively influencing engagement behavior (β = 0.22, p < 0.001), providing strong support for Hypothesis H3.
Finally, we generated a three-dimensional interaction plot using Python’s Matplotlib 3.10.1 and NumPy libraries, as shown in Figure 3. The 3D interaction plot reveals that auditory emotional arousal exhibits an inverted U-shaped relationship with consumer engagement, with engagement peaking at moderate levels of auditory emotional arousal. Meanwhile, visual variation demonstrates a positive linear relationship with consumer engagement, indicating that higher visual variation leads to greater engagement.
Additionally, the matching effect between visual variation and auditory emotional arousal has a significant impact on engagement behavior. When the two dimensions are aligned—that is, high visual variation is paired with high auditory emotional arousal, and low visual variation is paired with low auditory emotional arousal—engagement levels significantly increase. Conversely, when auditory emotional arousal and visual variation are mismatched, consumer engagement declines noticeably. These results confirm that the matching effect serves as an important moderating factor in influencing consumer engagement.
Regarding the failure to support the hypothesized inverted U-shaped relationship for visual variation, we propose two possible explanations. First, the defining characteristics of short videos include brevity, high information density, and fast pacing. Consumers typically expect to maximize information intake and emotional stimulation within a short time frame [5]. As a result, high visual variation effectively meets this demand, fostering stronger engagement within a limited duration. Second, with the increasing proliferation and evolution of short video platforms, consumers have gradually adapted to highly dynamic and fast-paced content environments. This has led to a higher threshold for information overload and cognitive load tolerance [18]. Consequently, even with high levels of visual variation, consumers can efficiently process information without experiencing significant cognitive difficulties.

4.5. Robustness Test

The robustness test consisted of two parts. First, we examined the stability of the dataset. By assessing the statistical distribution of the dataset, we found that the dependent variable exhibited characteristics of count data and followed a positively skewed distribution. Given this distribution pattern, negative binomial regression was deemed a suitable method for model estimation [91]. Therefore, we re-estimated the models using negative binomial regression, allowing us to effectively account for the high variability in consumer engagement behavior.
Second, we conducted a sensitivity analysis focused on the dependent variable to assess the robustness of the findings. A subset of short videos in the dataset recorded no consumer engagement, raising the possibility that zero-inflated observations could bias the results. We addressed this concern by removing 785 such cases and re-estimating the models using the reduced sample. As reported in Table 3, the key coefficients remained stable in direction and significance across specifications, lending further confidence to the validity of the main findings.

5. Conclusions

5.1. Study Summary

This study examines how auditory emotional arousal, visual variation, and their matching effects in influencer-generated product recommendation videos influence consumer engagement in short video marketing. Leveraging machine learning techniques, we integrated speech recognition algorithms and image analysis methods to conduct an in-depth analysis of multimodal data (including audio, visual, and textual elements) in short videos. By applying empirical analysis to 12,842 videos from Douyin, a leading short video platform in China, this study systematically explores how auditory and visual features interact through multiple resource integration to shape consumer engagement behavior.
The findings suggest a curvilinear relationship between auditory emotional arousal and consumer engagement. Engagement levels are highest when influencers convey a moderate degree of emotional arousal, which appears to foster favorable cognitive evaluations and emotional responses. In contrast, visual variation exhibits a positive linear effect, indicating that higher levels of visual dynamism are generally associated with greater consumer engagement.
In addition to these main effects, the coordination between auditory and visual cues plays an important role in shaping information processing. The effectiveness of short video content is not solely determined by the intensity of individual sensory inputs, but also by how well these inputs are synchronized. When high auditory arousal is paired with high visual variation, consumers are more likely to experience heightened immersion, which can lead to increased engagement. On the other hand, the combination of low auditory and visual intensity may support attention stability in low-stimulation environments, thereby facilitating information absorption and comprehension.

5.2. Theoretical Implications

Situated at the intersection of multimodal communication and interactive marketing [92,93], this study provides a theoretical lens to understand how audiovisual design in influencer-generated content shapes consumer engagement in short video contexts. Building on this foundation, the study offers several key theoretical contributions and innovations. First, it contributes to the growing literature on short video marketing by drawing attention to the understudied role of auditory cues in influencer-generated content. In recent years, short videos have become a crucial medium for brand communication and influencer marketing, with existing studies analyzing the effects of various short video attributes on consumer behavior [31]. Prior research has examined video comments [94], storytelling techniques [95], influencer image [96], color schemes [97], and camera perspectives [19]. However, despite the critical role of auditory stimuli in shaping short video content [5], research has primarily focused on textual and visual characteristics, leaving the impact of sound features underexplored. Even among studies that address auditory characteristics, the focus has typically been limited to physical attributes such as pitch, volume, and timbre and their effects on consumer cognition, emotions, and behavior [9]. This study introduces the concept of auditory emotional arousal to capture the affective impact of vocal expression, offering a new lens through which to understand how sound influences consumer engagement in short video contexts.
Second, the study extends existing research on visual content by shifting the focus from static to dynamic visual features. Although prior studies have explored visual characteristics in short videos [98], they have primarily focused on static visual elements such as color composition, framing, and lighting effects [19], with limited attention to dynamic visual features. By examining how variation in visual stimuli shapes user cognition and response, the findings contribute to a more nuanced understanding of multimodal processing in short-form digital media [44]. By analyzing how visual variation influences consumer information processing and decision-making, perspective also informs more effective design strategies for marketers seeking to optimize consumer attention and engagement through coordinated audiovisual content.
Furthermore, this study extends the applicability of multiple resource theory (MRT) to short video marketing. MRT suggests that the competition and complementarity between different sensory modalities affect information processing efficiency and consumer decision-making [54]. Existing research has primarily focused on single-modality information processing, such as visually driven ad design [99] or audio-dominant brand communication [5], with limited attention to short videos as a multisensory medium. By adopting the MRT framework, this study explores the interaction between auditory and visual elements in short video content, enhancing the explanatory power of MRT in dynamic digital media environments. Notably, our findings reveal that while low auditory emotional arousal combined with low visual variation reduces sensory stimulation, it stabilizes attention allocation and minimizes cognitive load, thereby improving information absorption. This finding challenges the traditional assumption in MRT that richer multisensory inputs always enhance information processing efficiency [53]. Instead, it suggests that short video content design must balance sensory modality alignment and cognitive resource optimization to maximize marketing effectiveness.
Finally, this study contributes to social media research by integrating machine learning techniques for multimodal data analysis. With the increasing application of big data analytics in marketing research [100], short video marketing presents a unique challenge due to the complex multimodal structure of data (text, images, and audio) [101]. Although multimodal machine learning improves the accuracy of consumer information analysis, challenges remain in determining the interpretability, relative contribution, and structuring of different modalities [102]. Consequently, research on the interactive effects of multimodal data in social media remains limited [70,103], with audio-based studies being particularly scarce in marketing research [104]. To address this gap, this study developed a speech mining framework for short video content, extracting auditory emotional arousal features from influencer speech and integrating speech mining techniques into short video marketing research. Additionally, by applying multimodal machine learning techniques, this study combines textual, visual, and auditory data to quantify short video content characteristics and analyze how different sensory inputs interact to influence consumer behavior. Compared to traditional single-modality analysis, this multimodal interaction approach provides a more accurate representation of consumer information processing in short video environments, offering a new analytical framework for social media research.

5.3. Practical Implications

This study provides practical insights for short video content creators, brands, and short video platforms, offering actionable strategies to optimize content design, enhance consumer interaction, and strengthen marketing effectiveness.
For short video content creators, this study highlights the importance of auditory emotional arousal and audiovisual congruence in driving consumer engagement. Results show that moderate levels of emotional arousal are most effective. Influencers should avoid overly flat or overly intense vocal expressions and instead aim for a balanced tone that fosters both cognitive processing and emotional resonance. Acoustic analysis suggests that a “moderate” level of auditory arousal typically features moderate pitch variation (with average fundamental frequency around 340–360 Hz), steady loudness (approximately 50–60 dB), and a speech rate of about 4–5 words per second. These quantifiable indicators can serve as practical references for content creators to calibrate their vocal delivery and strengthen emotional resonance with viewers. In addition, the match between auditory and visual intensity is key to enhancing user experience. High-intensity visuals (e.g., fast transitions) should be paired with high-arousal audio (e.g., energetic voice or music), while low-intensity visuals (e.g., static shots) pair better with low-arousal audio to maintain sensory harmony and avoid overload. By aligning audiovisual elements, creators can increase immersion, stimulate interactions, and improve content spread.
For brands, the study offers strategic guidance for selecting collaborating influencers. Brands should not only consider influencer follower count or interaction metrics, but also evaluate the alignment between an influencer’s vocal expression style and short video content characteristics. Selecting influencers who maintain auditory emotional arousal within an optimal range and align auditory and visual features effectively can enhance consumer brand perception and engagement, thereby improving marketing effectiveness. Additionally, brands can apply A/B testing to assess different auditory–visual matching strategies in advertisements and analyze which combinations yield higher interaction and conversion rates. For instance, in the fashion and beauty industry, a low emotional arousal–low visual variation design, combined with a soft and approachable vocal style, may foster trust and brand affinity. Conversely, for technology or sports brands, a high emotional arousal–high visual variation approach, featuring energetic and dynamic vocal expression, may reinforce a youthful and high-energy brand image. By implementing auditory–visual matching in precision marketing strategies, brands can increase the appeal and effectiveness of short video advertisements, driving higher consumer engagement and brand loyalty.
For short video platforms, this study provides empirical evidence for optimizing recommendation algorithms and content distribution strategies. Current mainstream short video recommendation systems primarily rely on user browsing history, interaction behaviors, and textual metadata, without fully considering the impact of multimodal information (e.g., auditory–visual matching) on user experience. Our findings indicate that consumer acceptance of short videos is significantly influenced by auditory–visual congruence. Therefore, platforms could integrate speech emotion analysis and visual variation recognition into recommendation algorithms, allowing users to discover videos better aligned with their sensory preferences, thereby increasing watch time and engagement rates. Moreover, platforms can develop intelligent content creation tools to assist creators in optimizing video production. For example, AI-powered tools could automatically analyze an influencer’s vocal characteristics and recommend the most suitable video editing styles and visual effects. Alternatively, platforms could use AI-based visual analysis to suggest the most appropriate background music, speech tempo, or vocal tone based on video dynamics, thereby enhancing the overall quality of short video content. This data-driven content optimization approach not only improves creator efficiency, but also ensures that published videos better align with user preferences, ultimately boosting user retention and increasing overall platform engagement.

5.4. Study Limitations and Future Study Directions

Although the findings provide new insights into audiovisual processing in short video contexts, several limitations should be acknowledged. First, the analysis did not systematically examine whether the relationship between audiovisual congruence and consumer engagement varies across product categories. It is plausible that different types of products may elicit distinct cognitive and emotional responses, which could moderate the effects of sensory alignment. Future research might explore these dynamics more explicitly, particularly by examining how congruence influences attention, emotion, and behavioral outcomes across categories such as hedonic versus utilitarian or durable versus non-durable goods.
Second, despite efforts to account for robustness, the observational nature of the dataset limits the ability to make strong causal inferences. Experimental designs, such as randomized controlled trials, would allow researchers to better isolate the effects of auditory and visual stimuli on engagement outcomes.
Third, this study relied on machine learning models for audio feature analysis. Although these models offer advantages in large-scale data processing, they may struggle with interpreting subtle emotional expressions. For example, existing speech emotion analysis techniques may struggle to accurately differentiate subtle variations in influencer vocal expression, such as sarcasm, humor, or deep emotional resonance. Future research could combine human evaluation with machine learning to improve accuracy.
Additionally, although we analyzed the moderating effects of product categories (e.g., hedonic vs. utilitarian, durable vs. non-durable), no significant moderation was found. This may be due to the categorizations used and the sample’s limitations. Future research could explore more diverse market segments or audience profiles to validate these moderating effects.
Finally, the study focused exclusively on data from Douyin, a dominant short video platform in China. While structurally similar to international platforms such as TikTok, platform-specific user behaviors and cultural norms may limit the generalizability of the results. Nonetheless, the analytical framework developed here is adaptable and could be extended to cross-platform or cross-cultural contexts in future research.

Author Contributions

Conceptualization, Q.Y. and Y.W.; Funding acquisition, Y.J.; Methodology, Q.W. and Y.W.; Supervision, Q.Y. and J.L.; Writing—original draft, Q.Y. and Y.W.; Writing—review and editing, Y.W. and Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The research is supported by the National Natural Science Foundation of China Youth Program (72402097), the Key Project of the Sichuan Provincial Philosophy and Social Science Foundation (SCJJ23ND39), and the Jiangsu Provincial Social Science Foundation Youth Program (23GLC025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wei, K.S. The Impact of Barrage System Fluctuation on User Interaction in Digital Video Platforms: A Perspective from Signaling Theory and Social Impact Theory. J. Res. Interact. Mark. 2023, 17, 602–619. [Google Scholar] [CrossRef]
  2. Liu, X.; Shi, S.W.; Teixeira, T.; Wedel, M. Video Content Marketing: The Making of Clips. J. Mark. 2018, 82, 86–101. [Google Scholar]
  3. Yin, X.; Li, J.; Si, H.; Wu, P. Attention Marketing in Fragmented Entertainment: How Advertising Embedding Influences Purchase Decision in Short-Form Video Apps. J. Retail. Consum. Serv. 2024, 76, 103572. [Google Scholar] [CrossRef]
  4. Oestreicher, G. 50 Tiktok Statistics in 2024 for Social Media Marketing. Available online: https://metricool.com/tiktok-statistics/ (accessed on 27 January 2025).
  5. Yang, Q.; Wang, Y.; Song, M.; Jiang, Y.; Li, Q. Sonic Strategies: Unveiling the Impact of Sound Features in Short Video Ads on Enterprise Market Entry Performance. J. Bus.-Bus. Mark. 2025, 32, 95–116. [Google Scholar] [CrossRef]
  6. Rutten, S.; Santoro, R.; Hervais-Adelman, A.; Formisano, E.; Golestani, N. Cortical Encoding of Speech Enhances Task-Relevant Acoustic Information. Nat. Hum. Behav. 2019, 3, 974–987. [Google Scholar] [CrossRef]
  7. Charest, I.; Pernet, C.R.; Rousselet, G.A.; Quiñones, I.; Latinus, M.; Fillion-Bilodeau, S.; Chartrand, J.; Belin, P. Electrophysiological Evidence for an Early Processing of Human Voices. BMC Neurosci. 2009, 10, 127. [Google Scholar] [CrossRef]
  8. Aeschlimann, M.; Knebel, J.F.O.; Murray, M.M.; Clarke, S. Emotional Pre-Eminence of Human Vocalizations. Brain Topogr. 2008, 20, 239–248. [Google Scholar] [CrossRef]
  9. Miao, M.; Wang, Y.; Li, J.; Jiang, Y.; Yang, Q. Audio Features and Crowdfunding Success: An Empirical Study Using Audio Mining. J. Theor. Appl. Electron. Commer. Res. 2024, 19, 3176–3196. [Google Scholar] [CrossRef]
  10. Shao, Z. Revealing Consumers’ Hedonic Buying in Social Media: The Roles of Social Status Recognition, Perceived Value, Immersive Engagement and Gamified Incentives. J. Res. Interact. Mark. 2024; ahead of print. [Google Scholar]
  11. Mather, M.; Sutherland, M.R. Arousal-Biased Competition in Perception and Memory. Perspect. Psychol. Sci. 2011, 6, 114–133. [Google Scholar]
  12. Buechel, E.C.; Townsend, C.; Fischer, E.; Moreau, P. Buying Beauty for the Long Run: (Mis)Predicting Liking of Product Aesthetics. J. Consum. Res. 2018, 45, 275–297. [Google Scholar]
  13. Berridge, C.W.; Arnsten, A.F.T. Psychostimulants and Motivated Behavior: Arousal and Cognition. Neurosci. Biobehav. Rev. 2013, 37, 1976–1984. [Google Scholar]
  14. Cian, L.; Krishna, A.; Elder, R.S. A Sign of Things to Come: Behavioral Change through Dynamic Iconography. J. Consum. Res. 2015, 41, 1426–1446. [Google Scholar]
  15. Ryu, S. From Pixels to Engagement: Examining the Impact of Image Resolution in Cause-Related Marketing on Instagram. J. Res. Interact. Mark. 2024, 18, 709–730. [Google Scholar] [CrossRef]
  16. Rumpf, C.; Boronczyk, F.; Breuer, C. Predicting Consumer Gaze Hits: A Simulation Model of Visual Attention to Dynamic Marketing Stimuli. J. Bus. Res. 2020, 111, 208–217. [Google Scholar]
  17. Sample, K.L.; Hagtvedt, H.; Brasel, S.A. Components of Visual Perception in Marketing Contexts: A Conceptual Framework and Review. J. Acad. Mark. Sci. 2020, 48, 405–421. [Google Scholar]
  18. Li, C.; Lu, H.; Xiang, Y.; Gao, R. Geo-Dmp: A Dtn-Based Mobile Prototype for Geospatial Data Retrieval. Isprs Int. J. Geo-Inf. 2020, 9, 8. [Google Scholar] [CrossRef]
  19. Gan, J.; Shi, S.; Filieri, R.; Leung, W.K.S. Short Video Marketing and Travel Intentions: The Interplay Between Visual Perspective, Visual Content, and Narration Appeal. Tour. Manag. (1982) 2023, 99, 104795. [Google Scholar]
  20. Ramezani Nia, M.; Shokouhyar, S. Analyzing the Effects of Visual Aesthetic of Web Pages on Users’ Responses in Online Retailing Using the Visawi Method. J. Res. Interact. Mark. 2020, 14, 357–389. [Google Scholar]
  21. Hagtvedt, H.; Brasel, S.A. Cross-Modal Communication: Sound Frequency Influences Consumer Responses to Color Lightness. J. Mark. Res. 2016, 53, 551–562. [Google Scholar]
  22. Fiez, J.A.; Balota, D.A.; Raichle, M.E.; Petersen, S.E. Effects of Lexicality, Frequency, and Spelling-to-Sound Consistency on the Functional Anatomy of Reading. Neuron 1999, 24, 205–218. [Google Scholar]
  23. Leung, F.F.; Gu, F.F.; Li, Y.; Zhang, J.Z.; Palmatier, R.W. Influencer Marketing Effectiveness. J. Mark. 2022, 86, 93–115. [Google Scholar] [CrossRef]
  24. Monroe Meng, L.; Kou, S.; Duan, S.; Bie, Y. The Impact of Content Characteristics of Short-Form Video Ads on Consumer Purchase Behavior: Evidence from Tiktok. J. Bus. Res. 2024, 183, 114874. [Google Scholar] [CrossRef]
  25. Ki, C.W.C.; Kim, Y.K. The Mechanism by Which Social Media Influencers Persuade Consumers: The Role of Consumers’ Desire to Mimic. Psychol. Mark. 2019, 36, 905–922. [Google Scholar] [CrossRef]
  26. Shao, Z. How the Characteristics of Social Media Influencers and Live Content Influence Consumers’ Impulsive Buying in Live Streaming Commerce? The Role of Congruence and Attachment. J. Res. Interact. Mark. 2024, 18, 506–527. [Google Scholar] [CrossRef]
  27. Shen, X.; Wang, J. How Short Video Marketing Influences Purchase Intention in Social Commerce: The Role of Users’ Persona Perception, Shared Values, and Individual-Level Factors. Hum. Soc. Sci. Commun. 2024, 11, 213–290. [Google Scholar] [CrossRef]
  28. Tian, Z.; Dew, R.; Iyengar, R. Mega or Micro? Influencer Selection Using Follower Elasticity. J. Mark. Res. 2024, 61, 472–495. [Google Scholar] [CrossRef]
  29. Chen, H.; Ren, J.; Salvendy, G.; Wei, J. The Effect of Influencer Persona on Consumer Decision-Making Towards Short-Form Video Ads—From the Angle of Narrative Persuasion; Springer International Publishing: Cham, Switzerland, 2022; pp. 223–234. ISBN 0302-9743. [Google Scholar]
  30. Yang, J.; Zhang, J.; Zhang, Y. Engagement that Sells: Influencer Video Advertising on Tiktok. Mark. Sci. 2024, 44, 247–489. [Google Scholar] [CrossRef]
  31. Zhang, J.; Li, C. The Influence and Configuration Effect of Content Characteristics on Customer Input in the Context of Short Video Platforms. J. Res. Interact. Mark. 2024, 19, 482–497. [Google Scholar] [CrossRef]
  32. Dong, X.; Liu, H.; Xi, N.; Liao, J.; Yang, Z. Short Video Marketing: What, When and How Short-Branded Videos Facilitate Consumer Engagement. Internet Res. 2024, 34, 1104–1128. [Google Scholar] [CrossRef]
  33. Wang, D.; Luo, X.R.; Hua, Y.; Benitez, J. Customers’ Help-Seeking Propensity and Decisions in Brands’ Self-Built Live Streaming E-Commerce: A Mixed-Methods and Fsqca Investigation from a Dual-Process Perspective. J. Bus. Res. 2023, 156, 113540. [Google Scholar] [CrossRef]
  34. Wang, Y. Humor and Camera View on Mobile Short-Form Video Apps Influence User Experience and Technology-Adoption Intent, an Example of Tiktok (Douyin). Comput. Hum. Behav. 2020, 110, 106373. [Google Scholar] [CrossRef]
  35. Al-Emadi, F.A.; Ben Yahia, I. Ordinary Celebrities Related Criteria to Harvest Fame and Influence on Social Media. J. Res. Interact. Mark. 2020, 14, 195–213. [Google Scholar] [CrossRef]
  36. Simmonds, L.; Bogomolova, S.; Kennedy, R.; Nenycz Thiel, M.; Bellman, S. A Dual-Process Model of How Incorporating Audio-Visual Sensory Cues in Video Advertising Promotes Active Attention. Psychol. Mark. 2020, 37, 1057–1067. [Google Scholar] [CrossRef]
  37. Noteboom, J.T.; Fleshner, M.; Enoka, R.M. Activation of the Arousal Response Can Impair Performance on a Simple Motor Task. J. Appl. Physiol. 2001, 91, 821–831. [Google Scholar] [CrossRef]
  38. Pribram, K.H.; Mcguinness, D. Arousal, Activation, and Effort in the Control of Attention. Psychol. Rev. 1975, 82, 116. [Google Scholar] [CrossRef]
  39. Wang, B.; Han, Y.; Kandampully, J.; Lu, X. How Language Arousal Affects Purchase Intentions in Online Retailing? The Role of Virtual Versus Human Influencers, Language Typicality, and Trust. J. Retail. Consum. Serv. 2025, 82, 104106. [Google Scholar] [CrossRef]
  40. Mehrabian, A. Affiliation as a Function of Attitude Discrepancy with Another and Arousal-Seeking Tendency. J. Pers. 1975, 43, 582–592. [Google Scholar] [CrossRef]
  41. Smith, P.C.; Curnow, R. “Arousal Hypothesis” and the Effects of Music on Purchasing Behavior. J. Appl. Psychol. 1966, 50, 255. [Google Scholar] [CrossRef]
  42. Yin, D.; Bond, S.D.; Zhang, H. Keep Your Cool or Let It Out: Nonlinear Effects of Expressed Arousal on Perceptions of Consumer Reviews. J. Mark. Res. 2017, 54, 447–463. [Google Scholar] [CrossRef]
  43. Yoo, J.; Kim, J.; Kim, M.; Park, M. Imagery Evoking Visual and Verbal Information Presentations in Mobile Commerce: The Roles of Augmented Reality and Product Review. J. Res. Interact. Mark. 2024, 18, 182–197. [Google Scholar] [CrossRef]
  44. Li, X.; Shi, M.; Wang, X.S. Video Mining: Measuring Visual Information Using Automatic Methods. Int. J. Res. Mark. 2019, 36, 216–231. [Google Scholar]
  45. Fan, S.; Shen, Z.; Koenig, B.L.; Ng, T.; Kankanhalli, M.S. When and Why Static Images are More Effective than Videos. IEEE Trans. Affect. Comput. 2023, 14, 308–320. [Google Scholar]
  46. Noldus, L.P.J.J.; Spink, A.J.; Tegelenbosch, R.A.J. Computerised Video Tracking, Movement Analysis and Behaviour Recognition in Insects. Comput. Electron. Agric. 2002, 35, 201–227. [Google Scholar] [CrossRef]
  47. Lu, S.; Yu, M.; Wang, H. What Matters for Short Videos’ User Engagement: A Multiblock Model with Variable Screening. Expert Syst. Appl. 2023, 218, 119542. [Google Scholar] [CrossRef]
  48. Rav-Acha, A.; Pritch, Y.; Peleg, S. Making a Long Video Short: Dynamic Video Synopsis. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006. [Google Scholar]
  49. Stuppy, A.; Landwehr, J.R.; Mcgraw, A.P. The Art of Slowness: Slow Motion Enhances Consumer Evaluations by Increasing Processing Fluency. J. Mark. Res. 2024, 61, 185–203. [Google Scholar]
  50. Yan, Y.; He, Y.; Li, L. Why Time Flies? The Role of Immersion in Short Video Usage Behavior. Front. Psychol. 2023, 14, 1127210. [Google Scholar]
  51. Wöllner, C.; Hammerschmidt, D.; Albrecht, H.; Jäncke, L. Slow Motion in Films and Video Clips: Music Influences Perceived Duration and Emotion, Autonomic Physiological Activation and Pupillary Responses. PLoS ONE 2018, 13, e0199161. [Google Scholar]
  52. Yin, Y.; Jia, J.S.; Zheng, W. The Effect of Slow Motion Video on Consumer Inference. J. Mark. Res. 2021, 58, 1007–1024. [Google Scholar]
  53. Wickens, C.D. Multiple Resources and Performance Prediction. Theor. Issues Ergon. Sci. 2002, 3, 159–177. [Google Scholar] [CrossRef]
  54. Wickens, C.D. Multiple Resources and Mental Workload. Hum. Factors 2008, 50, 449–455. [Google Scholar]
  55. Mayer, R.E.; Moreno, R. Nine Ways to Reduce Cognitive Load in Multimedia Learning. Educ. Psychol. 2003, 38, 43–52. [Google Scholar]
  56. Montero Perez, M.; Peters, E.; Desmet, P. Vocabulary Learning through Viewing Video: The Effect of Two Enhancement Techniques. Comput. Assist. Lang. Learn. 2018, 31, 1–26. [Google Scholar]
  57. Baumgartner, S.E.; Wiradhany, W.; Shackleford, K. Not All Media Multitasking is the Same: The Frequency of Media Multitasking Depends on Cognitive and Affective Characteristics of Media Combinations. Psychol. Pop. Media 2022, 11, 1–12. [Google Scholar]
  58. Biswas, D.; Szocs, C. The Smell of Healthy Choices: Cross-Modal Sensory Compensation Effects of Ambient Scent on Food Purchases. J. Mark. Res. 2019, 56, 123–141. [Google Scholar]
  59. Unnava, H.R.; Burnkrant, R.E.; Erevelles, S. Effects of Presentation Order and Communication Modality on Recall and Attitude. J. Consum. Res. 1994, 21, 481–490. [Google Scholar]
  60. Vroomen, J.; de Gelder, B. Sound Enhances Visual Perception: Cross-Modal Effects of Auditory Organization on Vision. J. Exp. Psychol. Hum. Percept. Perform. 2000, 26, 1583–1590. [Google Scholar] [CrossRef]
  61. Holiday, S.; Hayes, J.L.; Park, H.; Lyu, Y.; Zhou, Y. A Multimodal Emotion Perspective on Social Media Influencer Marketing: The Effectiveness of Influencer Emotions, Network Size, and Branding on Consumer Brand Engagement Using Facial Expression and Linguistic Analysis. J. Interact. Mark. 2023, 58, 414–439. [Google Scholar] [CrossRef]
  62. Janiszewski, C.; Meyvis, T. Effects of Brand Logo Complexity, Repetition, and Spacing on Processing Fluency and Judgment. J. Consum. Res. 2001, 28, 18–32. [Google Scholar] [CrossRef]
  63. Zhang, J.; Li, X.; Zhang, J.; Wang, L. Effect of Linguistic Disfluency on Consumer Satisfaction: Evidence from an Online Knowledge Payment Platform. Inf. Manag. 2023, 60, 103725. [Google Scholar]
  64. Frau, M.; Cabiddu, F.; Frigau, L.; Tomczyk, P.; Mola, F. How Emotions Impact the Interactive Value Formation Process During Problematic Social Media Interactions. J. Res. Interact. Mark. 2023, 17, 773–793. [Google Scholar] [CrossRef]
  65. Dhar, R.; Gorlin, M. A Dual-System Framework to Understand Preference Construction Processes in Choice. J. Consum. Psychol. 2013, 23, 528–542. [Google Scholar] [CrossRef]
  66. Douce, L.; Willems, K.; Chaudhuri, A. Bargain Effectiveness in Differentiated Store Environments: The Role of Store Affect, Processing Fluency, and Store Familiarity. J. Retail. Consum. Serv. 2022, 69, 103085. [Google Scholar] [CrossRef]
  67. Reyna, V.F. How People Make Decisions that Involve Risk: A Dual-Processes Approach. Curr. Dir. Psychol. Sci. J. Am. Psychol. Soc. 2004, 13, 60–66. [Google Scholar] [CrossRef]
  68. Mehrabian, A.; Russell, J.A. The Basic Emotional Impact of Environments. Percept. Mot. Skills 1974, 38, 283–301. [Google Scholar] [CrossRef] [PubMed]
  69. Neiss, R. Reconceptualizing Arousal: Psychobiological States in Motor Performance. Psychol. Bull. 1988, 103, 345–366. [Google Scholar] [CrossRef]
  70. Al-Qershi, O.M.; Kwon, J.; Zhao, S.; Li, Z. Predicting Crowdfunding Success with Visuals and Speech in Video Ads and Text Ads. Eur. J. Mark. 2022, 56, 1610–1649. [Google Scholar] [CrossRef]
  71. Wedel, M.; Pieters, R.; Liechty, J.; Rogers, W.A. Attention Switching During Scene Perception: How Goals Influence the Time Course of Eye Movements Across Advertisements. J. Exp. Psychol. Appl. 2008, 14, 129–138. [Google Scholar] [CrossRef]
  72. Waldner, M.; Le Muzic, M.; Bernhard, M.; Purgathofer, W.; Viola, I. Attractive Flicker—Guiding Attention in Dynamic Narrative Visualizations. IEEE Trans. Vis. Comput. Graph. 2014, 20, 2456–2465. [Google Scholar] [CrossRef]
  73. Paas, F.; van Gog, T.; Sweller, J. Cognitive Load Theory: New Conceptualizations, Specifications, and Integrated Research Perspectives. Educ. Psychol. Rev. 2010, 22, 115–121. [Google Scholar] [CrossRef]
  74. Valtchanov, D.; Ellard, C.G. Cognitive and Affective Responses to Natural Scenes: Effects of Low Level Visual Properties on Preference, Cognitive Load and Eye-Movements. J. Environ. Psychol. 2015, 43, 184–195. [Google Scholar] [CrossRef]
  75. Wang, Q.; Yang, S.; Liu, M.; Cao, Z.; Ma, Q. An Eye-Tracking Study of Website Complexity from Cognitive Load Perspective. Decis. Support Syst. 2014, 62, 1–10. [Google Scholar]
  76. Lloyd, D. In Touch with the Future: The Sense of Touch from Cognitive Neuroscience to Virtual Reality. Presence 2014, 23, 226–227. [Google Scholar]
  77. Spence, C.; Squire, S. Multisensory Integration: Maintaining the Perception of Synchrony. Curr. Biol. 2003, 13, R519–R521. [Google Scholar] [PubMed]
  78. Cacioppo, J.T.; Petty, R.E.; Chuan, F.K.; Rodriguez, R. Central and Peripheral Routes to Persuasion: An Individual Difference Perspective. J. Pers. Soc. Psychol. 1986, 51, 1032–1043. [Google Scholar]
  79. Su, L.; Cheng, J.; Swanson, S.R. The Impact of Tourism Activity Type on Emotion and Storytelling: The Moderating Roles of Travel Companion Presence and Relative Ability. Tour. Manag. 2020, 81, 104138. [Google Scholar]
  80. Joseph, J.; Gaba, V. Organizational Structure, Information Processing, and Decision-Making: A Retrospective and Road Map for Research. Acad. Manag. Ann. 2020, 14, 267–302. [Google Scholar] [CrossRef]
  81. Takahashi, N.; Mitsufuji, Y. Multi-Scale Multi-Band Densenets for Audio Source Separation. In Proceedings of the 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 15–18 October 2017. [Google Scholar]
  82. Solovyev, R.; Stempkovskiy, A.; Habruseva, T. Benchmarks and Leaderboards for Sound Demixing Tasks. arXiv 2023, arXiv:2305.07489. [Google Scholar]
  83. Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Cornell, S.; Lugosch, L.; Subakan, C.; Dawalatabad, N.; Heba, A.; Zhong, J.; et al. Speechbrain: A General-Purpose Speech Toolkit. arXiv 2021, arXiv:2106.04624. [Google Scholar]
  84. de Vries, L.; Gensler, S.; Leeflang, P.S.H. Popularity of Brand Posts on Brand Fan Pages: An Investigation of the Effects of Social Media Marketing. J. Interact. Mark. 2012, 26, 83–91. [Google Scholar]
  85. Lee, D.; Hosanagar, K.; Nair, H.S. Advertising Content and Consumer Engagement on Social Media: Evidence from Facebook. Manag. Sci. 2018, 64, 5105–5131. [Google Scholar]
  86. Barnes, S.J. Smooth Talking and Fast Music: Understanding the Importance of Voice and Music in Travel and Tourism Ads Via Acoustic Analytics. J. Travel Res. 2023, 63, 1070–1085. [Google Scholar]
  87. De Veirman, M.; Cauberghe, V.; Hudders, L. Marketing through Instagram Influencers: The Impact of Number of Followers and Product Divergence on Brand Attitude. Int. J. Advert. 2017, 36, 798–828. [Google Scholar]
  88. Jin, S.V.; Muqaddam, A.; Ryu, E. Instafamous and Social Media Influencer Marketing. Mark. Intell. Plan. 2019, 37, 567–579. [Google Scholar] [CrossRef]
  89. Simonsohn, U. Two Lines: A Valid Alternative to the Invalid Testing of U-Shaped Relationships with Quadratic Regressions. Adv. Methods Pract. Psych. Sci. 2018, 1, 538–555. [Google Scholar]
  90. Haans, R.F.J.; Pieters, C.; He, Z. Thinking about U: Theorizing and Testing U- and Inverted U-Shaped Relationships in Strategy Research. Strateg. Manag. J. 2016, 37, 1177–1195. [Google Scholar]
  91. Yang, Q.; Li, H.; Lin, Y.; Jiang, Y.; Huo, J. Fostering Consumer Engagement with Marketer-Generated Content: The Role of Content-Generating Devices and Content Features. Internet Res. 2022, 32, 307–329. [Google Scholar]
  92. Wang, C.L. Editorial—What is an Interactive Marketing Perspective and What are Emerging Research Areas? J. Res. Interact. Mark. 2024, 18, 161–165. [Google Scholar]
  93. Wang, C.L. Editorial: Demonstrating Contributions through Storytelling. J. Res. Interact. Mark. 2025, 19, 1–4. [Google Scholar]
  94. Zhang, X.; Zhao, Z.; Wang, K. The Effects of Live Comments and Advertisements on Social Media Engagement: Application to Short-Form Online Video. J. Res. Interact. Mark. 2024, 18, 485–505. [Google Scholar]
  95. Wang, X.; Lai, I.K.W.; Lu, Y.; Liu, X. Narrative or Non-Narrative? The Effects of Short Video Content Structure on Mental Simulation and Resort Brand Attitude. J. Hosp. Market. Manag. 2023, 32, 593–614. [Google Scholar] [CrossRef]
  96. Yin, J.; Li, T.; Ni, Y.; Cui, Y. The Power of Appeal: Do Good Looks and Talents of Vloggers in Tourism Short Videos Matter Online Customer Citizenship Behavior? Asia Pac. J. Tour. Res. 2024, 29, 1555–1572. [Google Scholar] [CrossRef]
  97. Deng, D.S.; Seo, S.; Li, Z.; Austin, E.W. What People Tiktok (Douyin) About Influencer-Endorsed Short Videos on Wine? An Exploration of Gender and Generational Differences. J. Hosp. Tour. Technol. 2022, 13, 683–698. [Google Scholar] [CrossRef]
  98. Lin, T.M.; Lu, K.Y.; Wu, J.J. The Effects of Visual Information in Ewom Communication. J. Res. Interact. Mark. 2012, 6, 7–26. [Google Scholar] [CrossRef]
  99. Nordhielm, C.L. The Influence of Level of Processing on Advertising Repetition Effects. J. Consum. Res. 2002, 29, 371–382. [Google Scholar] [CrossRef]
  100. Balducci, B.; Marinova, D. Unstructured Data in Marketing. J. Acad. Mark. Sci. 2018, 46, 557–590. [Google Scholar] [CrossRef]
  101. Ballouli, K.; Heere, B. Sonic Branding in Sport: A Model for Communicating Brand Identity through Musical Fit. Sport Manag. Rev. 2015, 18, 321–330. [Google Scholar] [CrossRef]
  102. Grewal, R.; Gupta, S.; Hamilton, R. Marketing Insights from Multimedia Data: Text, Image, Audio, and Video. J. Mark. Res. 2021, 58, 1025–1033. [Google Scholar] [CrossRef]
  103. Huang, Z.; Zhu, Y.; Hao, A.; Deng, J. How Social Presence Influences Consumer Purchase Intention in Live Video Commerce: The Mediating Role of Immersive Experience and the Moderating Role of Positive Emotions. J. Interact. Mark. 2023, 17, 493–509. [Google Scholar] [CrossRef]
  104. Ngai, E.W.T.; Wu, Y. Machine Learning in Marketing: A Literature Review, Conceptual Framework, and Research Agenda. J. Bus. Res. 2022, 145, 35–48. [Google Scholar] [CrossRef]
Figure 1. Conceptual research model.
Figure 1. Conceptual research model.
Jtaer 20 00069 g001
Figure 2. Sound spectrum generated by Praat 6.2 software.
Figure 2. Sound spectrum generated by Praat 6.2 software.
Jtaer 20 00069 g002
Figure 3. Three-dimensional interaction: Auditory emotional arousal, visual variation, and consumer engagement.
Figure 3. Three-dimensional interaction: Auditory emotional arousal, visual variation, and consumer engagement.
Jtaer 20 00069 g003
Table 1. Descriptive statistics.
Table 1. Descriptive statistics.
VariableMeanSDMin.Max.
Like601.991727.350104,378
Share343.841047.45012,234
Bookmarks175.96316.6608582
Comment237.51692.82011,563
Emotional arousal0.520.160.180.89
Visual variation7.493.52134.40
Speech rate4.973.021.748.75
Pitch341.7262.77124.52878.92
Loudness55.6220.7231.52119.78
Emotional valence0.590.140.180.78
Number of followers532,981.329389.9511,1521,382,657
Functional words ratio0.220.810.040.31
Cognitive words ratio0.150.220.070.38
Emotional words ratio0.130.120.040.35
Social process words ratio0.170.140.080.25
Readability0.710.250.240.92
Music rhythm110.8525.5265.82179.23
Image quality0.870.150.630.92
Table 2. Effects of auditory emotional arousal and visual variation on consumer engagement.
Table 2. Effects of auditory emotional arousal and visual variation on consumer engagement.
VariablesModel 12345
βββββ
Emotional arousal 0.24 *** 0.21 ***0.22 ***
Emotional arousal2 –0.12 ** –0.10 **–0.12 **
Visual variation 0.27 ***0.25 ***0.21 ***
Visual variation2 0.002
Emotional arousal * Visual variation 0.18 ***
Matching degree 0.22 ***
Controlled variables
Number of influencer followers0.12 **0.10 **0.11 **0.09 **0.08 *
Influencer gender–0.05–0.04–0.03–0.03–0.02
Speech rate0.080.070.060.060.05
Pitch0.030.020.030.020.01
Loudness0.070.060.070.050.04
Emotional valence0.10.090.070.080.07
Functional words ratio0.040.030.020.020.02
Cognitive words ratio0.020.010.050.010.01
Emotional words ratio0.060.050.020.040.03
Social process words ratio0.030.020.010.020.01
Readability0.050.040.040.030.03
Music rhythm0.090.080.060.070.06
Image quality0.070.060.060.060.05
R20.18 **0.21 **0.21 **0.25 **0.24 **
Max VIF1.63.52.73.53.4
AIC2245.82132.62153.42024.22085.4
N12,84212,84212,84212,84212,842
Note: * p < 0.05, ** p < 0.01, *** p < 0.001. The term ‘Emotional arousal * Visual variation’ denotes the combined effect of emotional arousal and visual variation.
Table 3. Results of robustness checks.
Table 3. Results of robustness checks.
Robustness of DataRobustness of the Dependent Variable
Model 6Model 7Model 8Model 9Model 10Model 11
Emotional arousal0.13 ***0.11 ***0.11 ***0.27 ***0.21 ***0.24 ***
Emotional arousal2–0.08 ***–0.05 ***–0.07 ***–0.14 **–0.11 **–0.13 **
Visual variation 0.12 *** 0.21 ***
Emotional arousal * Visual variation 0.07 *** 0.15 **
Matching degree 0.15 *** 0.25 ***
ControlsYYYYYY
AIC5428.25246.85275.21782.61525.51522.4
Notes: ** p < 0.01, *** p < 0.001. The term ‘Emotional arousal * Visual variation’ denotes the combined effect of emotional arousal and visual variation.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Q.; Wang, Y.; Wang, Q.; Jiang, Y.; Li, J. Harmonizing Sight and Sound: The Impact of Auditory Emotional Arousal, Visual Variation, and Their Congruence on Consumer Engagement in Short Video Marketing. J. Theor. Appl. Electron. Commer. Res. 2025, 20, 69. https://doi.org/10.3390/jtaer20020069

AMA Style

Yang Q, Wang Y, Wang Q, Jiang Y, Li J. Harmonizing Sight and Sound: The Impact of Auditory Emotional Arousal, Visual Variation, and Their Congruence on Consumer Engagement in Short Video Marketing. Journal of Theoretical and Applied Electronic Commerce Research. 2025; 20(2):69. https://doi.org/10.3390/jtaer20020069

Chicago/Turabian Style

Yang, Qiang, Yudan Wang, Qin Wang, Yushi Jiang, and Jingpeng Li. 2025. "Harmonizing Sight and Sound: The Impact of Auditory Emotional Arousal, Visual Variation, and Their Congruence on Consumer Engagement in Short Video Marketing" Journal of Theoretical and Applied Electronic Commerce Research 20, no. 2: 69. https://doi.org/10.3390/jtaer20020069

APA Style

Yang, Q., Wang, Y., Wang, Q., Jiang, Y., & Li, J. (2025). Harmonizing Sight and Sound: The Impact of Auditory Emotional Arousal, Visual Variation, and Their Congruence on Consumer Engagement in Short Video Marketing. Journal of Theoretical and Applied Electronic Commerce Research, 20(2), 69. https://doi.org/10.3390/jtaer20020069

Article Metrics

Back to TopTop