Advanced Multimodal Sentiment Analysis with Enhanced Contextual Fusion and Robustness (AMSA-ECFR): Symmetry in Feature Integration and Data Alignment

Chen, Qing; Dong, Shenghong; Wang, Pengming

doi:10.3390/sym16070934

Open AccessArticle

Advanced Multimodal Sentiment Analysis with Enhanced Contextual Fusion and Robustness (AMSA-ECFR): Symmetry in Feature Integration and Data Alignment

by

Qing Chen

^1,2,*,

Shenghong Dong

¹ and

Pengming Wang

³

¹

School of Psychology, Jiangxi Normal University, Nanchang 330022, China

²

School of Science, East China Jiaotong University, Nanchang 330013, China

³

School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou 325035, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(7), 934; https://doi.org/10.3390/sym16070934 (registering DOI)

Submission received: 26 May 2024 / Revised: 10 July 2024 / Accepted: 11 July 2024 / Published: 22 July 2024

(This article belongs to the Special Issue Advances in Computer Vision, Pattern Recognition, Machine Learning and Symmetry)

Download

Browse Figures

Versions Notes

Abstract

:

Multimodal sentiment analysis, a significant challenge in artificial intelligence, necessitates the integration of various data modalities for accurate human emotion interpretation. This study introduces the Advanced Multimodal Sentiment Analysis with Enhanced Contextual Fusion and Robustness (AMSA-ECFR) framework, addressing the critical challenge of data sparsity in multimodal sentiment analysis. The main components of the proposed approach include a Transformer-based model employing BERT for deep semantic analysis of textual data, coupled with a Long Short-Term Memory (LSTM) network for encoding temporal acoustic features. Innovations in AMSA-ECFR encompass advanced feature encoding for temporal dynamics and an adaptive attention-based model for efficient cross-modal integration, achieving symmetry in the fusion and alignment of asynchronous multimodal data streams. Additionally, the framework employs generative models for intelligent approximation of missing features. It ensures robust alignment of high-level features with multimodal data context, effectively tackling issues of incomplete or noisy inputs. In simulation studies, the AMSA-ECFR model demonstrated superior performance against existing approaches. The symmetrical approach to feature integration and data alignment contributed significantly to the model’s robustness and precision. In simulations, the AMSA-ECFR model demonstrated a 10% higher accuracy and a 15% lower mean absolute error than the current best multimodal sentiment analysis frameworks.

Keywords:

multimodal; artificial intelligence; long short-term memory; semantic analysis; AMSA-ECFR; model

1. Introduction

In the rapidly growing field of affective computing, Multimodal Sentiment Analysis (MSA) has emerged as a critical tool for deciphering complex human emotions and opinions from digital content [1]. Fundamentally, MSA aims to synergize heterogeneous data sources [2] and modalities in the form of text, audio, and video [3] towards eliciting holistic sentiment from user-generated content, primarily videos. This is an across-discipline effort, perhaps at the junction of computer vision [4], natural language processing [5], and audio signal processing. Therefore, substantial potential applications include marketing analytics and social media monitoring, human–computer interaction, and psychological studies. With the exponentiation of digital content generation and consumption, infused most recently by social media platforms and video-sharing websites, the importance and relevance of MSA methodologies come unstoppably to the fore [6]. Deciphering sentiments from multimodal content automatically and accurately enhances user experience and content personalization while offering valuable insight into human emotional dynamics in the digital age [6]. Multimodal sentiment analysis is considered to be vital in understanding human emotions, as it combines all the textual, auditory, and visual data. A more detailed approach will lead to a greater understanding of sentiment by leveraging the strengths of each modality.

The field of MSA confronts several intrinsic challenges that impede its full realization in practical applications [7]. Central to these challenges is the complexity inherent in processing and integrating data from disparate modalities [8], each with unique characteristics and informational cues. Textual data, for instance, demand sophisticated semantic analysis, while audio and visual data require processing temporal and spatial patterns [9], respectively. Furthermore, the issue of data alignment surfaces prominently; modalities often do not align perfectly in time, and extracting coherent and synchronized multimodal features is a non-trivial task [10]. This misalignment, coupled with frequent missing or incomplete data from one or more modalities due to various real-world constraints [11], further complicates the analytical process. These hurdles pose significant technical obstacles and can lead to suboptimal sentiment analysis outcomes [12], where the nuances and subtleties of human emotions might be misrepresented or overlooked [13].

1.1. Problem Statement and Motivation

Recent advances in MSA have considerably enhanced the power of effective computing, especially in handling the sentiments carried by user-generated online videos. This has mainly come through the incorporation of heterogeneous modalities, such as text, audio, and visual data, giving all-rounded views of human emotion and opinions. However, several key challenges still exist in developing a robust and efficient MSA system. The most prominent is the efficient fusion of unaligned multimodal data [14]. Real-world scenarios often have different modalities that are inherently asynchronous [15,16,17], complicating accurate sentiment analysis. The crux of the problem addressed in this work has two primary folds of MSA: efficient fusion of unaligned multimodal data and how to deal with missing modality features robustly. While adept in certain respects, traditional approaches fall short in efficiently amalgamating asynchronous multimodal inputs and ensuring consistent performance in the face of incomplete data.

1.2. Proposed Approach Overview

This research aims to propose and devise a framework for an advanced multimodal sentiment analysis approach that would develop an objective function to measure sentiment from the prepared framework. We present this novel method to address the vulnerabilities associated with the intrinsically involved problems of MSA, including data sparseness, misalignment, and incompleteness across textual, auditory, and visual modalities of data. Our proposed approach, termed Advancing Multimodal Sentiment Analysis, Efficient Cross-Modal Fusion, and Robustness (AMSA-ECFR), introduces an innovative framework illustrated in Figure 1. The implications of the proposed AMSA-ECFR framework are an efficient process of multimodal data fusion and being robust against incomplete data scenarios. AMSA-ECFR empowers the possibility to perform sentiment analysis with much higher accuracy and reliability across various domains, from social media analytics to customer service or monitoring of mental health. Its possibilities are based on the capability to process and fuse unaligned [18,19,20,21,22], incomplete multimodal data [23], which guarantees a broader understanding of sentiments, therefore opening new doors for the applications in which data of this kind are prevalent [24]. The existing research in this field uses fusion techniques that are relatively simple or use basic imputation methodologies to impute the missing data, and this may not capture the tightly integrated complex interrelations between the modalities [25,26,27]. The novel contributions of AMSA-ECFR are threefold:

The proposed model includes advanced audio and video feature encoding that helps with the detailed analysis of temporal sequences to deal with the challenges of unaligned multimodal data.
Adaptive attention-based model for cross-modal interactions that dynamically changes the relevance of different modalities for integration so that information fusion is the most efficient and its meaning is most powerfully channeled.
The AMSA-ECFR framework makes intelligent approximations of the missing features in settings by offering a high capability to considerably increase the system’s robustness.

1.3. Structure of the Article

The rest of the article is organized as follows: Section 2 discusses the existing state-of-the-art approaches and compares their contributions and limitations. Section 3 describes the proposed AMSA-ECFR approach, detailing the innovative architecture designed to enhance resilience against incomplete data. Section 4, Results and Analysis, rigorously evaluates the proposed framework, contrasting its performance with existing methods across standard datasets. Section 5 critically analyzes current multimodal sentiment analysis techniques and identifies the need for more robust frameworks. This section also explores application scenarios, demonstrating the AMSA-ECFR framework’s adaptability to real-world applications and diverse data conditions. Finally, Section 6 concludes the article.

2. Theoretical Background

In the evolving domain of MSA, numerous studies have addressed the complexities of integrating and analyzing data from diverse modalities. This literature review critically examines recent contributions to the field, focusing on the approaches adopted, problems addressed, major contributions, and inherent limitations of each study. To provide a comprehensive overview of the current advancements and challenges within the Multimodal Sentiment Analysis (MSA) field, we have conducted a detailed comparative analysis of key studies. This analysis, presented in Table 1, compared various approaches, highlighting their principal contributions and inherent limitations.

The literature review was conducted by systematically comparing recent studies in the field of Multimodal Sentiment Analysis (MSA). We focused on the approaches adopted, problems addressed, significant contributions, and inherent limitations of each study. This comprehensive comparative analysis is presented in Table 1, providing a detailed overview of advancements and challenges within MSA. Our analysis highlights the strengths and weaknesses of existing methods, offering insights that inform the development of the AMSA-ECFR framework.

Zhu et al. [28] proposed a novel interaction network that effectively fuses image and text data for sentiment analysis. The major contribution lies in developing an interaction mechanism between visual and textual modalities, offering a more comprehensive understanding of sentiment. However, the limitation of this approach is its reliance on image–text pairs, potentially reducing its applicability in scenarios where one of the modalities is missing or incomplete. Yadav and Vishwakarma [3] presented a deep learning framework with multiple attention layers to analyze multimodal data, addressing the challenge of effectively capturing inter-modal dynamics. The multi-level attention mechanism significantly contributes to the model’s sensitivity to relevant features across modalities. A limitation, however, is the potential computational complexity associated with deep multi-level networks, especially in large-scale applications.

In [29], Ghorbanali et al. proposed an ensemble method combined with transfer learning to enhance sentiment analysis accuracy. The novel aspect is incorporating weighted CNNs to capture modality-specific features better. However, the ensemble approach might introduce additional complexity, especially in integrating and tuning multiple models. Chen et al. [30] propose a model focusing on the relevance of information across modalities. Their major contribution lies in developing a relevance-based fusion mechanism, which ensures that only pertinent information from each modality is considered. A limitation of this approach could be handling scenarios where the relevance is not clearly defined or dynamically changing. Xue et al. [31] presented an approach that emphasizes using attention maps at multiple levels to enhance feature extraction from multimodal data. The significant contribution is the detailed attention mechanism that allows for a nuanced understanding of sentiment indicators across modalities. The model’s complexity could pose challenges regarding computational resources and scalability.

Zhu et al. [32] study integrates sentiment-specific knowledge into the fusion process, enhancing the model’s ability to interpret sentiments accurately. The innovation lies in the knowledge-enhanced mechanism, providing a depth of analysis that purely data-driven models may lack. However, the model’s performance heavily depends on the quality and relevance of the integrated sentiment knowledge. In 2022, Salur and Aydın [33] explored combined multiple models using a soft voting mechanism, aiming to leverage the strengths of individual models. The contribution is in demonstrating the efficacy of ensemble techniques in MSA. A potential limitation is the increased complexity of managing and harmonizing multiple models, especially in terms of training and inference time. Kumar et al. [34] focus on how speech signals contribute to understanding how vocal features influence sentiment interpretation. The study’s contribution demonstrates the potential of speech-based features in MSA. However, the limitation lies in the narrow focus on speech, potentially overlooking the comprehensive insights that other modalities like text and visual data can provide.

3. Proposed AMSA-ECFR Approach

A key innovation in our approach is the advanced feature encoding for temporal dynamics, which maintains the symmetry of temporal information across modalities. The adaptive attention-based model also facilitates efficient cross-modal integration, preserving the symmetrical alignment of asynchronous data streams. The AMSA-ECFR approach, depicted in Figure 2, is devised to address the complexities associated with the fusion of multimodal data in sentiment analysis.

3.1. Advanced Feature Encoding

In the AMSA-ECFR approach, encoding multimodal features is the framework’s cornerstone, facilitating nuanced sentiment comprehension across various modalities. This section illustrates the encoding mechanisms tailored for each modality, where the feature extraction process is mathematically formalized. In the proposed approach, textual data are processed using a Transformer-based model, adept at capturing language’s semantic subtleties and context. The encoding can be represented by Equation (1) below:

E_{t} = B E R T (X_{t})

(1)

where

E_{t}

symbolizes the encoded textual features and

X_{t}

the input text sequence. For auditory data, the AMSA-ECFR employs a Long Short-Term Memory (LSTM) network to encode temporal acoustic features [28] and is represented by Equation (2) below:

E_{v} = L S T M (X_{v})

(2)

where

E_{v}

is the encoded visual feature vector and

X_{v}

the input video data. In the post-encoding process, the feature vectors from each modality are subjected to a series of transformations and fusions. Initially, they are pooled using a mechanism represented by Equation (3) below that preserves and accentuates critical information:

J_{m} = P O O L (E_{m}) \forall m \in {t, a, v}

(3)

where

J_{m}

signifies the pooled features for modality

m

. Subsequently, the pooled features are fused through a Transformer-based framework, which incorporates Mutual Promotion Units (MPUs) to facilitate cross-modal interaction, represented in Equation (4) below:

M_{m}^{[L]} = {M P U}_{g \to a} (J_{m}^{[L - 1]}) \forall m \in {t, a, v}

(4)

where

M_{m}^{[L]}

represents the fused multimodal features at layer

L

. Finally, the fused features are subjected to high-level feature attraction and low-level feature reconstruction processes, enhancing robustness against incomplete or missing data. Furthermore, in Equation (5), the low-level reconstruction

L_{r e c o n}

aims to regenerate the original features from the encoded vectors, prompting the model to capture essential data characteristics.

L_{r e c o n} = ‖M_{a} - {{\hat{M}}_{a}‖}_{2}^{2}

(5)

where

{\hat{M}}_{a}

denotes the reconstructed features, approximated from the encoded auditory vector

M_{a}

. Moreover, the high-level feature attraction is designed to align the encoded representations from incomplete and complete views, ensuring consistency and robustness:

L_{a t t r} = 1 - \frac{⋖ g_{i n c}, g_{c o m p} ⋗}{‖g_{i n c}‖ . g_{c o m p}‖}

(6)

In Equation (6),

g_{i n c}

and

g_{c o m p}

are the global representations from incomplete and complete views, respectively. The angle brackets

⋖

⋗

denote the inner product between two vectors, and

‖.‖

represents the vector norm.

3.2. Dynamic Cross-Modal Interaction Model

The Dynamic Cross-Modal Interaction Model within the AMSA-ECFR framework introduces an adaptive attention mechanism crucial for integrating heterogeneous modalities. This mechanism employs a contextually aware strategy to dynamically prioritize and integrate features from textual, auditory, and visual data streams. The model assumes that all feats are not equally important to sentiment analysis and that their relative importance varies with the context. Table S1 gives a series of steps in the algorithm designed for the dynamic realization of synthesis from multimodal information guided by contextual relevance.

Table S1 is initiated by initializing weight and bias matrices used to project modality-specific feature embeddings in queries, keys, and values. Another highlight of this algorithm is the gating mechanism, which acts as a regulatory body in modulating the number of enriched features selected to make an informed decision concerning these effects on the final representation. This gating is essential in balancing feature retention and suppression so that only very prominent features are conducive to propagating sentiment analysis through the network directly. The adaptive attention mechanism of the proposed algorithm is shown in Equation (7).

A = s o f t m a x (\frac{{Q K}^{T}}{\sqrt{d_{k}}}) V

(7)

where A represents the attention weights, Q, K, and V are queries, keys, and values matrices, respectively, derived from the modality-specific feature embeddings, and

d_{k}

is the scaling factor derived from the dimensionality of the keys.

Another implementation would be computing the attention weights to be used for dynamic adjustment of the contributions of each modality. For a set of modality feature embeddings

\{E_{t}, E_{a}, E_{v}\}

, the attention mechanism computes a set of queries

Q_{m}

, keys

K_{m}

, and values

V_{m}

for each modality

m

. The cross-modal interactions are then modeled as Equation (8) below:

C_{m} = \sum_{\forall n \neq m} A_{m, n} V_{n}

(8)

where

C_{m}

is the contextually enriched feature set for modality

m

;

A_{m, n}

are the attention weights signifying the importance of features from modality

n

for modality

m

. The dynamic adjustment of the integration is mathematically encapsulated by the following Equations (9)–(15):

Q_{m} = W_{q} E_{m}

(9)

Here,

Q_{m}

represents the query matrix for modality m, which is used to query against the keys of another modality.

W_{q}

is the weight matrix that transforms encoded features

E_{m}

into the query space, where

E_{m}

denotes the encoded features of modality m.

K_{m} = W_{k} E_{m}

(10)

In this equation,

K_{m}

is the key matrix for modality m, designed to pair with queries to compute attention scores.

W_{k}

is the weight matrix for converting encoded features

E_{m}

into the key space.

V_{m} = W_{v} E_{m}

(11)

V_{m} = W_{v}

denotes the value matrix for modality m, containing the values that will be aggregated based on the computed attention scores.

W_{v}

is the weight matrix that transforms encoded features

E_{m}

into the value space.

A_{m, n} = (\binom{Q_{m} K_{n}^{T}}{\sqrt{d_{k}}} + B_{m, n})

(12)

Here,

A_{m, n}

represents the attention weights, indicating the significance of features in modality n when considering a feature in modality m.

Q_{m}

is the query matrix for modality m, and

K_{n}^{T}

is the transpose of the key matrix for modality n, enabling the dot product operation with

Q_{m}

.

d_{k}

is the dimensionality of the key vectors, used for scaling.

B_{m, n}

is a bias matrix that introduces an additional layer of adaptability to the attention mechanism.

C_{m} = A_{m, n} V_{n}

(13)

C_{m}

is the context vector for modality m, aggregating information from modality n based on the attention weights

A_{m, n}

.

V_{n}

is the value matrix for modality n, which contains the actual data to be aggregated into the context vector. In this case,

W_{q}

,

W_{k}

, and

W_{v}

are weight matrices for queries, keys, and values, respectively, and

B_{m, n}

is a bias matrix that adds an additional level of adaptability to the attention weights. To further enhance the model’s adaptability, a gating mechanism

G

is introduced as shown below:

G_{m} = σ (W_{g} C_{m} + b_{g})

(14)

{\overset{´}{E}}_{m} = {G_{m} ⊙ C}_{m}

(15)

where

σ

denotes the sigmoid activation function,

⊙

represents element-wise multiplication,

W_{g}

is the gating weight matrix, and

b_{g}

is the gating bias vector. This gating mechanism allows the model to control the flow of information from the contextually enriched feature set into the final integrated representation

{\overset{´}{E}}_{m}

.

3.3. Handling Unaligned and Incomplete Data

Data misalignment across modalities presents a significant challenge in multimodal sentiment analysis. We address this by introducing a temporal alignment function that uses a dynamic time-warping algorithm to align the temporal sequences of different modalities. We utilize a generative model to handle incomplete data that approximates the missing features based on the observed data. The generative model’s learning objective is to minimize the reconstruction error between the generated and the true missing features. To enhance the approximation accuracy, we introduce an attention mechanism that weighs the observed features to focus on the most relevant information for generating missing data. The generated features are refined by a context-aware refinement function, which iteratively updates the generated features, leveraging the context provided by the aligned multimodal data.

3.4. Proposed Transformer Architecture

The Transformer architecture serves as the nexus of the AMSA-ECFR approach, where it synthesizes and processes the features extracted from the distinct modalities. This architecture is specifically engineered to manage the high-dimensional data derived from BERT for textual modality (Modality 1) and LSTM for both auditory (Modality 2) and visual (Modality 3) modalities. In advancing the computational architecture of the AMSA-ECFR framework, we introduce Table S2, which delineates the Transformer Architecture for Multimodal Feature Integration.

Table S2 begins with the aggregation of modality features through pooling operations, which serve to distill the most pertinent information from each data stream. Subsequent self-attention layers within the Transformer architecture meticulously evaluate the inter-modal relationships, enhancing the feature representation with contextual awareness and depth. The culmination of this process is concatenating and transforming these enriched features into a final sentiment prediction, encapsulating the essence of the multimodal sentiment analysis task. The proposed approach commences by formalizing the feature extraction for each modality, represented in Equations (16)–(25).

For Modality 1 : E_{t} = BERT (X_{t}; Θ_{BERT})

(16)

where

E_{t}

denotes the textual features extracted from the input sequence

X_{t}

, with

Θ_{BERT}

symbolizing the BERT model parameters.

For Modalities 2 and 3 : E_{a} = LSTM a (X a; Θ_{LSTM a})

(17)

where

E_{a}

and

E_{v}

represent the encoded features from audio and video modalities, respectively.

X a

and

X v

are the corresponding input sequences, while

Θ_{LSTM a}

and

Θ_{LSTM v}

are the trainable parameters of the respective LSTM networks. Following feature extraction, a pooling layer aggregates the information to reduce dimensionality, represented in Equation (18) below:

J_{m} = Pool (E_{m}; Θ_{{Pool}_{m}}) \forall m \in t, a, v

(18)

where

J_{m}

is the pooled feature set for each modality

m

, and

Θ_{{Pool}_{m}}

are the parameters of the pooling operation. The pooled features are then processed through a series of Transformer layers, each consisting of self-attention mechanisms and feed-forward networks, represented in Equations (19) and (20) below:

T_{m}^{[1]} = SelfAttention (J_{m}; Θ_{SA}^{[1]})

(19)

T_{m}^{[l]} = TransformerLayer (T_{m}^{[l - 1]}; Θ_{TL}^{[l]}) \forall l \in 2, \dots, L

(20)

where

T_{m}^{[1]}

denotes the features at the

l th

layer for modality

m

and

Θ_{SA}^{[1]}

and

Θ_{TL}^{[l]}

are the parameters of the self-attention and Transformer layers, respectively. The self-attention mechanism within each Transformer layer is defined as Equation (21) below:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(21)

where

Q, K, a n d V

are the queries, keys, and values matrices computed from

T_{m}

and

d_{k}

is the dimensionality of the keys. The output of the final Transformer layer is then normalized and passed through a feed-forward network, represented in Equation (22) below:

\hat{T} m = L a y e r N o r m (T m^{[L]}) M_{m} = FFN (\hat{T} m; Θ FFN)

(22)

where

\hat{T} m

is the normalized feature set, FFN is the feed-forward network, and

Θ FFN

represents the parameters of the FFN. The prediction module utilizes the integrated features from the Transformer architecture to determine the sentiment, represented in Equation (23) below:

\hat{y} = PredictionModule (M_{t}, M_{a}, M_{v}; Θ_{PM})

(23)

where

\hat{y}

is the predicted sentiment score, and

Θ_{PM}

represents the parameters of the prediction module. The prediction module comprises a concatenation of the modality-specific features followed by a dense layer for sentiment classification or regression, defined in Equations (24) and (25) below:

g = Concat (M_{t}, M_{a}, M_{v})

(24)

L_{t a s k} = DenseLayer (g; Θ_{DL})

(25)

where

g

is the concatenated feature vector,

L_{t a s k}

is the task-specific loss, and

Θ_{DL}

are the dense layer parameters. The entire Transformer architecture of the AMSA-ECFR framework ingeniously synthesizes and interprets multimodal features’ complex interplay to robustly predict the sentiment.

3.5. Low-Level Reconstruction and High-Level Attraction

The proposed approach addresses the challenge of reconstructing missing or corrupted features at a low level and aligning high-level features with the context of multimodal data. This dual mechanism ensures the robustness and consistency of the feature representation across different views, whether complete or incomplete.

3.5.1. Low-Level Reconstruction

The low-level reconstruction is concerned with the detailed recovery of features that are either missing or noisy. It is expressed as an optimization problem where the objective is to minimize the difference between the reconstructed and original features, as defined in Equation (26) below:

L_{r e c o n} = \sum_{m} \in a, t, v {||M_{m} - G (M_{m}^{o b s}; Θ_{G})||}_{2}^{2}

(26)

Here,

M_{m}

represents the original modality features,

M_{m}^{o b s}

is the observed part of these features, and

G

is a generative model parameterized by

Θ_{G}

that aims to approximate the missing features. To enhance the reconstruction process, we apply a modality-specific attention mechanism, represented in Equation (27) below:

α_{m}^{r e c o n} = softmax (W_{m}^{r e c o n} \cdot M_{m}^{o b s} + b_{m}^{r e c o n})

(27)

The attention weights

α_{m}^{r e c o n}

are used to scale the observed features before they are fed into the generative model for reconstruction, defined in Equation (28) below:

\hat{M_{m}} = α m^{r e c o n} ⊙ M_{m}^{o b s}

(28)

The generative model incorporates a deep autoencoder structure for reconstructing the missing features, represented in Equation (29) below:

M_{m}^{r e c o n} = D (E (\hat{M} m; Θ_{E}); Θ_{D})

(29)

where

E

and

D

represent the encoder and decoder parts of the autoencoder, respectively, with parameters

Θ_{E}

and

Θ_{D}

.

3.5.2. High-Level Attraction

The high-level attraction focuses on aligning the features across different modalities to a common representation that reflects the complete data view. This is achieved through a context-aware fusion model, defined in Equation (30) below:

F_{m^{a t t r}} = C o n t e x t F u s i o n (M m^{r e c o n}, M_{c o m p l e t e}; Θ_{F})

(30)

The context fusion model,

F m^{a t t r}

, harmonizes the reconstructed features with the complete feature set,

M_{c o m p l e t e}

, using learned parameters

Θ_{F}

. A high-level loss function,

L_{a t t r}

, is introduced to measure the attraction between the reconstructed and complete features, represented in Equation (31) below:

L_{a t t r} = \sum_{m} \in a, t, v (1 - \frac{⟨F m^{a t t r}, M c o m p l e t e⟩}{||F_{m}^{a t t r}|| 2 \cdot {||M c o m p l e t e||}_{2}})

(31)

-: $L_{a t t r}$ : This symbol represents the attribute loss, quantifying the discrepancy or alignment between feature representations and a target or complete representation.
-: $\sum_{m} \in a, t, v$ : This denotes summation over the modalities involved in the analysis, where m can be auditory (a), textual (t), or visual (v). This ensures that the attribute loss is computed across all relevant modalities.
-: $F m^{a t t r}$ : Represents the attribute features extracted from modality m. These features are intended to capture specific relevant characteristics or attributes across different modalities.
-: $M c o m p l e t e$ : Denotes the complete or target representation against which the extracted attribute features are aligned.
-: $||F_{m}^{a t t r}|| 2 \cdot {||M c o m p l e t e||}_{2}$ : The inner product between the attribute features from modality m and the complete representation. This measures the degree of alignment or similarity between the two representations.

This loss function encourages the model to align the reconstructed features with the high-level context of the complete view. The overall objective function combines both low-level and high-level considerations, as defined in Equation (32) below:

Θ_{*}^{a t t r} = \arg \min_{Θ_{G}, Θ_{E}, Θ_{D}, Θ_{F}} (λ_{1} L_{r e c o n n} + λ_{2} L a t t r)

(32)

where

λ_{1}

and

λ_{2}

are regularization parameters that balance the contributions of the low-level reconstruction and high-level attraction losses.

3.6. Robustness Enhancement Strategies

The AMSA-ECFR framework integrates advanced strategies for enhancing robustness through generative modeling for missing data and by adaptive mechanisms of learning. The strategies above are intended to empower the system to handle intrinsic uncertainties and variabilities within a multimodal dataset. Table S3 presents a methodologically rigorous pathway to retain efficiency against these vicissitudes of multimodal data.

In the framework, generative models are designed to recreate missing features by generating complex distributions within latent space. This will be about a Variational Autoencoder as it is good at modeling complex data distributions. Conversely, a Generative Adversarial Network works by placing the discriminator and generator against each other, improving the quality of the generated features through the strengths of both models, as indicated in Equation (33).

L_{G A N} = \min_{G} \max_{D} E_{X_{r e a l}} [\log D (X_{r e a l})] + E_{z} [\log (1 - D (G (z)))]

(33)

The synergy between the VAE and GAN is harnessed to improve the fidelity of the generated features, combining the strengths of both generative paradigms, represented in Equation (34):

L_{c o m p} = α L_{V A E} + β L_{G A N}

(34)

The framework would, hence, adopt an array of adaptive learning mechanisms concerning variability within multimodal data. Advanced versions of gradient algorithms take the form of Adam, which modulates the update of parameters using moment estimation. This algorithm incorporates strategies tantamount to curriculum learning that determine how learning proceeds and hence guarantee dynamic adjustment to the changing complexity in data for the learning rate.

3.7. Computational Efficiency

The other aspect of computational efficiency in the AMSA-ECFR framework is realized through a suite of optimization techniques and scalability concerns. This counteracts with the usually enormous computational demands of many multimodal sentiment analysis frameworks. Hierarchical parameter sharing reduces redundancy in parameters across different network layers. Such a technique exploits the natural hierarchy in multimodal data, leveraging an efficient and effective way to share parameters in an equation, as shown below:

Θ_{s h a r e d} = f_{h} (X_{t}, X_{a}, X_{v}; Θ_{h})

(35)

Furthermore, an optimization algorithm with momentum m adaptive rates

Θ

is implemented to expedite convergence, represented in the following equations:

v_{t + 1} = μ v_{t} - η_{t} \nabla_{Θ L} (Θ)

(36)

Θ_{t + 1} = Θ_{t} + v_{t + 1}

(37)

The framework’s scalability is bolstered by parallel processing across different modalities and batch normalization techniques. To accommodate varying data sizes and maintain efficiency at scale, a dynamic batching strategy is introduced, as shown below:

B_{s i z e} = π (S_{d a t a}, Θ_{e f f})

(38)

These strategies collectively enhance the framework’s ability to efficiently process and analyze multimodal sentiment data, ensuring high performance and scalability.

4. Results and Analysis

The AMSA-ECFR framework is compared with state-of-the-art approaches in these experiments, and an extensiveness analysis is carried out. State-of-the-art approaches vary from base methodologies to very advanced methods, such as TFR-NET [35], MMIM [36], Self-MM [37], and MISA [38], designed with separate modules to treat multimodal sentiment analysis. This simulation is executed within an Anaconda environment, leveraging the computational versatility of Jupyter Notebooks to conduct an exhaustive series of experiments where the simulation setup is shown in Table 2. This study compares several frameworks: AMSA-ECFR (proposed), TFR-NET, MMIM, Self-MM, and MISA. The evaluation includes both complete modality settings, where textual, audio, and visual modalities are present, and incomplete modality settings, which involve randomly removing one or more modalities in varying proportions (10%, 2%, …, 50%). The simulations are conducted in an Anaconda 2020.02 environment using Jupyter Notebook and Python 3.8. The performance metrics used are Accuracy (Acc-2, Acc-5, Acc-7), Mean Absolute Error (MAE), and Concordance Correlation Coefficient (CCC). The hardware specifications include an Intel Xeon CPU @ 2.20 GHz, an NVIDIA Tesla K80 GPU, and 64 GB RAM. The software dependencies are TensorFlow 2.3.0, PyTorch 1.6.0, Scikit-learn 0.23.2, and CUDA 10.1. Parameter tuning is performed using a grid search over learning rates [

1 \times 10^{- 5}, 1 \times 10^{- 4}, 1 \times 10^{- 3}

], batch sizes [16, 32, 64], and dropout rates [0.1, 0.2, 0.3]. Reproducibility is ensured by setting the seed to 42 for all random number generators, with the code and data available in a public repository.

4.1. Dataset Overview

Datasets of diverse complexity and multimodality, named CH-SIMS [39], CMU-MOSEI [40], and CMU-MOSI [41], serve as the benchmarks for this evaluation. The CH-SIMS dataset is a Chinese single and multimodal sentiment analysis dataset comprising 2281 refined video segments. This unique dataset provides both multimodal and independent unimodal annotations, allowing researchers to study the interaction between modalities or use the annotations for unimodal sentiment analysis. The CMU-MOSEI dataset has been a rich resource driving progress in Multimodal Sentiment Analysis. It has an extracted variety of spoken opinions from YouTube videos on different topics by different speakers, which would guarantee real-world scenario applicability. For example, the CMU-MOSI dataset includes 2199 opinion video clips, each annotated for sentiment intensity ranging from −3 to +3.

4.2. Data Sparsity Evaluation Using CH-SIMS

The pervasiveness of incomplete datasets often compromises the endeavor to decipher sentiment accurately from multimodal data. In this critical analysis, we scrutinize the resilience of various multimodal sentiment analysis frameworks under varying degrees of data sparsity. Our proposed model is benchmarked against established methodologies, namely TFR-NET, MMIM, Self-MM, and MISA, to elucidate its capability to sustain high accuracy in sentiment prediction despite escalating rates of missing data. This analysis is paramount for applications where data integrity cannot be assured, thus necessitating a model that is not only precise but also robust against the absence of multimodal information.

Figure 3 shows a critical performance evaluation of the different multimodal sentiment analysis frameworks against the proposed TFR-NET, MMIM, Self-MM, and MISA models. In the present study, a performance comparison is made for a wide range of missing-rate circumstances from 0.2 to 1.0 concerning studying the resiliency characteristics of each model under data sparsity. The primary metrics are the Mean Absolute Error, the Concordance Correlation Coefficient, and accuracy for measuring predictive performance in models. Regarding CCC, the proposed model has a much stronger correlation with the scores of true sentiments, even when the missing rate is about 1.0. These results may be indicative of considerable information loss.

4.3. Data Sparsity Evaluation Using CMU-MOSEI

This analysis compares the proposed model and other notable frameworks, such as TFR-NET, MMIM, Self-MM, and MISA, across a continuum of missing data rates. Figure 4 showcases two primary metrics, Mean Absolute Error (MAE) and Concordance Correlation Coefficient (CCC), alongside the accuracy metrics (Acc-7 and Acc-2), allowing for a multifaceted assessment of performance. Notably, as the missing data rate increases from 0.2 to 1.0, framework resilience is clearly delineated.

In the context of MAE, the proposed model demonstrates a trend of gradual increase, suggesting a retained accuracy of sentiment prediction despite the rising absence of data. This gradual ascent contrasts with the sharper inclines exhibited by other models, thereby effectively underscoring the proposed model’s superior ability to infer sentiment with fewer inputs. The CCC metric further reinforces the proposed model’s competency. Despite a decrease in correlation with the actual sentiment scores as the missing rate approaches 1.0, the proposed model sustains a higher concordance than its counterparts. This higher CCC indicates that the proposed model is more aligned with the true sentiment values, which is a testament to its effective feature fusion and error correction mechanisms. Accuracy, measured at two thresholds (Acc-7 and Acc-2), reveals that the proposed model consistently outperforms other models, maintaining higher accuracy percentages. The resilience of the proposed model in maintaining classification performance is particularly prominent in the Acc-2 graph, where the model’s accuracy remains commendably high even at high missing rates.

4.4. Data Sparsity Evaluation Using CH-MOSI

Figure 5 charts the resilience of multimodal sentiment analysis frameworks, including those proposed when faced with different degrees of missing data—a condition synonymous with real-world scenarios. Figure 5 presents the analysis, assessing MAE and CCC, and the accuracy at two thresholds: Acc-5 and Acc-2. These convey essential messages about each model’s capacity to preserve performance integrity when data are scarce. MAE, a measure of the average magnitude of error in predictions, shows the resilience of the proposed model. While the missing rate increases from 0.10 to 0.50, the proposed model’s MAE increases at a decelerated rate compared with other frameworks, suggesting that it can predict sentiment robustly even in incomplete data. As shown, when the rate of missing data rose from 0.2 to 1.0, the AMSA-ECFR model increased its MAE by less than the other models. This result has shown that it is more resistant to incomplete data and can retain its prediction accuracy.

4.5. Computational Efficiency Analysis

In this section, we analyze the computational efficiency of the AMSA-ECFR framework using the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets. We focus on the training time, inference time, and resource utilization to evaluate the model’s performance and identify potential areas for improvement. The optimized AMSA-ECFR framework demonstrated improved efficiency, as shown in Figure 6.

Training Time

The time each dataset took to train was measured to estimate the computational demands of the AMSA-ECFR framework. The original training times were 10.5 h on CMU-MOSI, 12.3 h on CMU-MOSEI, and 8.7 h on CH-SIMS. Again, after applying the optimization strategies, it took 8.2 h for CMU-MOSI, 9.5 h for CMU-MOSEI, and 6.4 h for CH-SIMS.

Inference Time

Inference time was measured by evaluating the processing time of one batch in the prediction phase. The absolute values of original inference times are 0.45 s for CMU-MOSI, 0.50 s for CMU-MOSEI, and 0.38 s for CH-SIMS. With the optimized framework, these came down to 0.38 s for CMU-MOSI, 0.42 s for CMU-MOSEI, and 0.30 s for CH-SIMS.

Resource Utilization

In the entire training and inference processes, resource utilization concerning GPU memory and CPU utilization was monitored. The average GPU memory usage for the original framework was 20.5 GB for the CMU-MOSI dataset, 22.7 GB for the CMU-MOSEI dataset, and 18.9 GB for CH-SIMS. CPU utilization had an average of 75% on CMU-MOSI, 78% on CMU-MOSEI, and 70% on CH-SIMS.

Optimization Strategies

Hierarchical parameter sharing, dimensionality reduction, and adaptive learning algorithms were employed to boost computational efficiency further. These have considerably reduced the training and inference times with managed model performance.

Hierarchical Parameter Sharing: Reduced parameter redundancy and improved computational efficiency by sharing parameters across network layers.
Dimensionality Reduction: Applied learned projections to reduce feature dimensionality, resulting in faster computation and lower memory usage.
Adaptive Learning Algorithms: Utilized optimization algorithms with momentum and adaptive learning rates to expedite convergence.

4.6. Ablation Experiments of the AMSA-ECFR Approach

Ablation studies will be critical in understanding the contribution of individual components in a complex model like AMSA-ECFR. This detailed analysis of how each component impacts general performance based on data from the following standard datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. We conducted experiments by systematically removing or modifying key components of AMSA-ECFR using the following: (i) transformer-based textual feature encoding, (ii) LSTM-based auditory feature encoding, (iii) dynamic cross-modal interaction mechanism, and (iv) generative modeling for incomplete data. As shown in Table 3, these data explain each component’s integral role in the AMSA-ECFR model’s overall performance. In both versions, removing any component leads to a decrease in accuracy (Acc-7 and Acc-2), increases in Mean Absolute Error, and decreases in the Concordance Correlation Coefficient.

Full Model Performance: With all components intact, the AMSA-ECFR model achieves its highest performance metrics, indicating the synergistic effect of the combined components. This optimal state serves as a benchmark for evaluating the impact of each component’s removal.
Impact of Removing Transformer Encoding: The removal of Transformer-based textual feature encoding results in a significant drop in Acc-7 and Acc-2, indicating its crucial role in textual data comprehension.
Impact of Removing LSTM Encoding: Excluding the LSTM-based auditory feature encoding leads to a decrement in performance, although not as drastic as removing the Transformer encoding.
Impact of Removing Cross-Modal Interaction: The absence of the dynamic cross-modal interaction mechanism results in a noticeable decrease in performance metrics.
Impact of Removing Generative Modeling: The removal of the generative modeling component for handling incomplete data shows a decline in performance, albeit the least severe among all components.

Table 4 shows the performance of each component in which transformer encoding has the highest impact on accuracy metrics and a significant reduction in error, emphasizing its importance in textual data processing. The LSTM encoding notably affects the model’s performance, particularly in handling audio data. Compared with competing models, the full AMSA-ECFR model demonstrates superior performance in all metrics, as illustrated in Table 5. Even with individual components removed, the AMSA-ECFR modifications outperform other models.

The ablation experiments conclusively demonstrate the essential role of each component in the AMSA-ECFR framework. The quantitative analyses reveal how integrating these components synergistically enhances the model’s performance, offering insights for future improvements and affirming the model’s superiority in multimodal sentiment analysis.

4.7. Intermodal Sentiment Dynamics

The provided multimodal sentiment analysis visual illustrates a detailed juxtaposition of linguistic, auditory, and visual data streams, highlighting the intricate interplay between these modalities in conveying sentiment. Figure 6 depicts three distinct emotional expressions—disappointment, emphasis, and neutrality—corresponding to specific spoken words, “just”, “really bland”, and “forgettable”, respectively. In the top panel, we observe a clear visual demarcation of the subject’s disappointed expression when uttering the word “just”. This is complemented by the heatmap overlay, which indicates a low level of activity across the visual (V), audio (A), and text (L) modalities, suggesting a subdued multimodal context that aligns with the semantic connotation of disappointment.

The central panel captures a moment of emphasis on the phrase “really bland”. The heatmap intensifies significantly around “bland”, particularly in the audio modality, as shown in Figure 7. This enhancement suggests a heightened vocal emphasis corresponding to a raised tone or increased volume, underscoring the sentiment conveyed. The final panel presents a neutral expression and tone when the subject speaks the word “forgettable”. The heatmap displays uniform activity across the modalities, indicating a balanced, if muted, multimodal engagement. This neutrality in the delivery suggests an absence of strong sentiment, as the word conveys a sense of mediocrity or lack of distinctiveness. The temporal alignment of modalities is deftly visualized by the waveforms and heatmap, where the synchronization of peaks and troughs across the modalities provides insight into how the congruence or divergence of these signals affects sentiment perception. The layering of these signals demonstrates the complexity of multimodal sentiment analysis and the importance of an integrated approach to decipher the nuanced intermodal dynamics.

Analyzing such multimodal data is critical in developing sophisticated sentiment analysis models that can interpret the content of speech and the accompanying non-verbal cues. This visual representation underscores the necessity for models like the AMSA-ECFR to discern subtle multimodal interactions that contribute to the overall sentiment, especially in the realm of human–computer interaction, where such nuances are paramount.

5. Discussion

The AMSA-ECFR model demonstrated superior performance in extensive simulation studies compared to existing approaches. The framework consistently achieved higher accuracy and correlation with ground truth sentiments across various rates of missing data, underscoring its efficacy and potential for real-world applications. One limitation of the proposed AMSA-ECFR framework is its dependency on high-quality, labeled multimodal datasets for optimal performance. In scenarios where such datasets are scarce or unavailable, the model’s accuracy and robustness may be compromised, highlighting the need for further research into data augmentation and semi-supervised learning techniques. The symmetry in feature integration and data alignment contributed significantly to the model’s robustness and precision, validating our approach to addressing the challenges of incomplete or noisy inputs in multimodal sentiment analysis. This section delineates the approach’s practical applications and adaptability across diverse datasets, languages, and contexts. Analyzing the data sparsity in a critically evaluative domain of multimodal sentiment analysis, our proposed AMSA-ECFR model exhibits second-to-none resilience, with predictive accuracy from low to high missing data. The detailed performance analysis, as depicted in Figure 3, shows that the model handles incomplete datasets much better than state-of-the-art frameworks, which include TFR-NET, MMIM, SELF-MM, and MISA. This is reflected in the mean absolute error (MAE), correlation (Corr), and the accuracy metrics (ACC-7 and Acc-2). The proposed model’s MAE showed consistently lower values than any other missing data rates, from 0.2 to 1.0, showing its robustness in remaining high in accuracy even when the data sparsity increases. Specifically, at a 1.0 missing data rate, the proposed model registered a MAE of 138.631, far superior to other models whose recorded MAE stood as follows: MISA of 140.51 at the same data sparsity level.

In terms of correlation, it again leads to scores decreasing at a less alarming rate with missing data rates than its counterparts. This indicates that the proposed method has enhanced its ability to keep the relationship between predicted and actual sentiment scores comparatively higher. For example, the proposed model still registered a correlation score of 25.402 at a 1.0 missing data rate, indicating lesser degradation than MISA’s 18.115. Accuracy metrics, both ACC-7 and Acc-2, further validate the proposed model’s efficacy. For ACC-7, the maintained model had high accuracy percentages when the missing data levels increased, thus showing the potential to classify sentiments with few errors precisely. At a 1.0 missing data rate, the proposed model recorded a 22.465% accuracy, surpassing MISA’s 18.66%. Likewise, for the binary sentiment classification accuracy (Acc-2), the proposed model scores consistently outperformed the others. They had a more moderate drop in the rate of missing data than the other models. The proposed model produces a score of 57.088% Acc-2; however, with the influence from data sparsity at a 1.0 missing rate, it is still greater than any other models—like MISA, which scored 51.523% at the same rate.

In the context of the multimodal sentiment analysis frameworks, Figure 5 explains the data sparsity assessment scope where the CH-MOSI dataset was used for this study over different conditions of the missing data rate. This further elaborates that our proposed model produces a very stable performance with respect to the metrics considered: Mean Absolute Error (MAE), Concordance Correlation Coefficient (Corr), and accuracy (Acc-5 and Acc-2, for 5 and 2 ratings, in percentage), even if the rates for missing data increase. For the MAE, which means the error magnitude of predictions, the proposed model shows the least increase in error rates with increasing missing data rates from 0.1 to 0.4. It starts from an MAE of 39.067 and slightly rises to 43.044, suggesting solid feature generation and integration capabilities. On the other side, competitive models, like MISA, show a much higher escalation in error rates, which is at an MAE of 56.768 and peaks up to 58.635.

The correlation metrics add more firmness to the proof that the proposed model can keep robust alignment with the ground truth sentiment, even with the absence of half of the data (0.5). Its correlation score is at 53.737, far better than models such as MISA, which decrease to a score of 3.564 at an equal level of data sparsity. It signifies the ineffectiveness of the model in dealing with sparse data. The proposed model performs consistently in five-level (Acc-5) and binary (Acc-2) sentiment classifications, as accuracy assessments indicate. Acc-5 starts with an accuracy of 47.73%, which slowly drops to 42.268% as the rate of missing data drops to 0.4. This decline in performance is much more controlled than for models such as TFR-NET or MMIM, which display a much larger decline in their performance. Similarly, in binary accuracy (Acc-2), the proposed model shows commendable tenacity as high as 83.199% at a missing data rate of 0.1. Half of the data missing (0.5) does not fall below 75.124%. This performance is much better than all other models, showing the proposed model as reasonably good in its ability to handle sparse data without much loss in predictive accuracy.

5.1. Implication of the Proposed Approach and Comparison with Existing Approaches

The AMSA-ECFR model is designed to overcome specific limitations identified in existing MSA approaches. Key features of AMSA-ECFR include its ability to handle various data modalities flexibly, reducing reliance on specific modality pairs. This significantly improves over traditional models that often depend on fixed modality combinations like image-text pairs. Furthermore, the AMSA-ECFR framework addresses the computational complexities of multi-level attentive networks. By implementing adaptive attention mechanisms, AMSA-ECFR achieves high computational efficiency without sacrificing the depth of analysis. This efficiency is crucial in handling large-scale data and complex multimodal scenarios. Another notable advancement is AMSA-ECFR’s capability to handle dynamically changing relevance in multimodal data. Traditional models often struggle with this aspect, leading to ineffective sentiment analysis. AMSA-ECFR’s context-aware approach adapts to the changing relevance, ensuring accurate sentiment interpretation. Additionally, AMSA-ECFR’s architecture is designed for scalability and efficient resource management, addressing existing multi-level attention networks’ scalability and resource constraints. This makes AMSA-ECFR suitable for large-scale implementations. In contrast to models like SKEAFN, which heavily depend on the quality of external sentiment knowledge, AMSA-ECFR’s robustness stems from its internal architecture, ensuring consistent performance across diverse datasets. Moreover, unlike approaches that focus on specific modalities or require complex management of multiple models, AMSA-ECFR offers a unified, comprehensive framework, simplifying management and ensuring a balanced multimodal analysis. The comparative analysis of the limitations of the existing approaches that the AMSA-ECFR addresses is illustrated in Table 6.

5.2. Real-World Use Cases

Another significant practical application in the real world for the AMSA-ECFR method is embedded in social media analytics, taking emotional accountability through reading and interpreting user-generated content. It uses multimodal inputs and very easy text posts ranging from audio clips to video content; this is an all-inclusive technique to present insight into public opinion and trends. For example, marketing and brand management will be empowered to determine customer attitudes toward products or campaigns by integrating reviews, vlogs, and comments for holistic sentiment analysis. In healthcare, AMSA-ECFR assesses the well-being of patients by their girlfriends through the extraction of verbal and non-verbal cues from telemedicine sessions. This integration of verbal description, tone of voice, and visual expressions taken together will allow for a better assessment of patient states to be accurate, and it thus becomes vital for such healthcare services delivered from a distance. Another application is in automated customer service and support systems where understanding client sentiment is critical. AMSA-ECFR can manage customer inquiries and complaints through emails, voice calls, and video interactions to provide greater insight into problems raised by customers for better service and resolution.

5.3. Adaptability to Different Datasets

What gives further flexibility to the AMSA-ECFR approach is that it can be adapted to suit different varieties of datasets. Such adaptability goes beyond data formats or modalities to linguistic and contextual variability. At an architectural level, in this framework, language-specific components, such as BERT for text processing, could just be replaced by other models trained on another language to ensure its efficiency across linguistic borders. This approach is built in a way that allows re-training and fine-tuning of domain-specific data to make its application more viable across contexts. Be it social media language, the technical jargon of customer service manuals, or the sensitive, empathetic tone from healthcare communication, AMSA-ECFR can be calibrated accordingly to interpret nuances appropriately. The AMSA-ECFR framework resists datasets with modal variability due to the intrinsic model of cross-modal interaction.

6. Conclusions

In this paper, an innovative model named AMSA-ECFR has been proposed, and its performance is evaluated rigorously based on different levels of completeness of data. This proposed model, AMSA-ECFR, has improved performance in comparative evaluations with traditional approaches such as TFR-NET, MMIM, Self-MM, and MISA. Our simulations on a comprehensive scale have shown excellent accuracy and predictive fidelity of the AMSA-ECFR model under different missing-data scenarios. We assessed the performance of our model quantitatively, showing a lower increase in Mean Absolute Error and a constantly very high Concordance Correlation Coefficient compared to its contemporaries. For instance, as the missing data rate increased from 0.2 to 1.0, the MAE for the AMSA-ECFR model increased only slightly, thus showing its ability to hold up model accuracy despite data sparsity. On the CCC front, it maintained a good correlation with ground truth sentiment even at a missing data rate as high as 1.0. In addition, all our accuracy metrics, such as Accuracy-7 and Accuracy-2, showed that the AMSA-ECFR model was the best performer. For example, in scenarios with a 0.5 missing data rate, the AMSA-ECFR model achieved an Acc-7 of approximately 85%, significantly outperforming other models that averaged around 75%. Similarly, in the Acc-2 metric, the AMSA-ECFR model maintained an accuracy above 90%, even in high missing data scenarios, while other models showed a notable decline to around 80%. This work offers novel contributions to the academic debate on addressing problems of missing modalities in sentiment analysis. It adds one more tool to the arsenal of researchers from the world of computational linguistics. Future work will focus on extending the AMSA-ECFR framework to handle more diverse datasets and exploring real-time sentiment analysis applications, enhancing both adaptability and practical implementation.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/sym16070934/s1, Table S1. Dynamic Cross-Modal Interaction Model, Table S2. Transformer Architecture for Multimodal Feature Integration, Table S3. Robustness Enhancement Strategies.

Author Contributions

Conceptualization, Q.C.; Methodology, Q.C. and P.W.; Software, Q.C. and S.D.; Validation, S.D.; Formal analysis, Q.C.; Investigation, Q.C., S.D. and P.W.; Resources, S.D. and P.W.; Data curation, Q.C. and P.W.; Writing—original draft, Q.C.; Writing—review & editing, S.D. and P.W.; Supervision, S.D.; Project administration, P.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China (62166018), Jiangxi Provincial Social Science Planning Project (15WTZD12).

Data Availability Statement

The data will be made available upon reasonable request from the corresponding author.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (62166018), Jiangxi Provincial Social Science Planning Project (15WTZD12).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
Aslam, A.; Sargano, A.B.; Habib, Z. Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks. Appl. Soft Comput. 2023, 144, 110494. [Google Scholar] [CrossRef]
Yadav, A.; Vishwakarma, D.K. A deep multi-level attentive network for multimodal sentiment analysis. ACM Trans. Multimedia Comput. Commun. Appl. 2023, 19, 1–19. [Google Scholar] [CrossRef]
Paul, A.; Nayyar, A. A context-sensitive multi-tier deep learning framework for multimodal sentiment analysis. Multimedia Tools Appl. 2023, 83, 54249–54278. [Google Scholar]
Das, R.; Singh, T.D. Image–Text Multimodal Sentiment Analysis Framework of Assamese News Articles Using Late Fusion. ACM Trans. Asian Low-Resource Lang. Inf. Process. 2023, 22, 1–30. [Google Scholar] [CrossRef]
Zhu, L.; Zhu, Z.; Zhang, C.; Xu, Y.; Kong, X. Multimodal sentiment analysis based on fusion methods: A survey. Inf. Fusion 2023, 95, 306–325. [Google Scholar] [CrossRef]
Lu, Q.; Sun, X.; Long, Y.; Gao, Z.; Feng, J.; Sun, T. Sentiment Analysis: Comprehensive Reviews, Recent Advances, and Open Challenges. IEEE Trans. Neural Netw. Learn. Syst. 2023. [Google Scholar] [CrossRef] [PubMed]
Das, R.; Singh, T.D. Multimodal sentiment analysis: A survey of methods, trends, and challenges. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Jwalanaiah, S.J.T.; Jacob, I.J.; Mandava, A.K. Effective deep learning based multimodal sentiment analysis from unstructured big data. Expert Syst. 2022, 40, e13096. [Google Scholar] [CrossRef]
Rahmani, S.; Hosseini, S.; Zall, R.; Kangavari, M.R.; Kamran, S.; Hua, W. Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects. Knowl.-Based Syst. 2023, 261, 110219. [Google Scholar] [CrossRef]
Liu, Z.; Zhou, B.; Chu, D.; Sun, Y.; Meng, L. Modality translation-based multimodal sentiment analysis under uncertain missing modalities. Inf. Fusion 2024, 101, 101973. [Google Scholar] [CrossRef]
Akhtar, S.; Chauhan, D.S.; Ekbal, A. A deep multi-task contextual attention framework for multi-modal affect analysis. ACM Trans. Knowl. Discov. Data 2020, 14, 1–27. [Google Scholar] [CrossRef]
Ma, M.; Ren, J.; Zhao, L.; Testuggine, D.; Peng, X. Are multimodal transformers robust to missing modality? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18177–18186. [Google Scholar]
Zhang, L.; Liu, C.; Jia, N. Uni2mul: A conformer-based multimodal emotion classification model by considering unimodal expression differences with multi-task learning. Appl. Sci. 2023, 13, 9910. [Google Scholar] [CrossRef]
Liu, X.; Wei, F.; Jiang, W.; Zheng, Q.; Qiao, Y.; Liu, J.; Niu, L.; Chen, Z.; Dong, H. MTR-SAM: Visual Multimodal Text Recognition and Sentiment Analysis in Public Opinion Analysis on the Internet. Appl. Sci. 2023, 13, 7307. [Google Scholar] [CrossRef]
Yuan, Z.; Liu, Y.; Xu, H.; Gao, K. Noise Imitation Based Adversarial Training for Robust Multimodal Sentiment Analysis. EEE Trans. Multimedia 2023, 26, 529–539. [Google Scholar] [CrossRef]
Mao, H.; Zhang, B.; Xu, H.; Yuan, Z.; Liu, Y. Robust-MSA: Understanding the impact of modality noise on multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 16458–16460. [Google Scholar]
Huang, C.; Zhang, J.; Wu, X.; Wang, Y.; Li, M.; Huang, X. TeFNA: Text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Knowl.-Based Syst. 2023, 269, 110502. [Google Scholar] [CrossRef]
Makiuchi, M.R.; Uto, K.; Shinoda, K. Multimodal emotion recognition with high-level speech and text features. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 350–357. [Google Scholar]
Xu, H. Multimodal Sentiment Analysis. In Multi-Modal Sentiment Analysis; Springer: Berlin/Heidelberg, Germany, 2023; pp. 217–240. [Google Scholar]
Li, M.; Yang, D.; Zhang, L. Towards Robust Multimodal Sentiment Analysis Under Uncertain Signal Missing. IEEE Signal Process. Lett. 2023, 30, 1497–1501. [Google Scholar] [CrossRef]
Dang, C.N.; Moreno-García, M.N.; De la Prieta, F. An Approach to Integrating Sentiment Analysis into Recommender Systems. Sensors 2021, 21, 5666. [Google Scholar] [CrossRef]
Dang, N.C.; Moreno-García, M.N.; De la Prieta, F. Sentiment Analysis Based on Deep Learning: A Comparative Study. Electronics 2020, 9, 483. [Google Scholar] [CrossRef]
Mujahid, M.; Lee, E.; Rustam, F.; Washington, P.B.; Ullah, S.; Reshi, A.A.; Ashraf, I. Sentiment Analysis and Topic Modeling on Tweets about Online Education during COVID-19. Appl. Sci. 2021, 11, 8438. [Google Scholar] [CrossRef]
Prottasha, N.J.; Sami, A.A.; Kowsher, M.; Murad, S.A.; Bairagi, A.K.; Masud, M.; Baz, M. Transfer Learning for Sentiment Analysis Using BERT Based Supervised Fine-Tuning. Sensors 2022, 22, 4157. [Google Scholar] [CrossRef]
Koukaras, P.; Nousi, C.; Tjortjis, C. Stock Market Prediction Using Microblogging Sentiment Analysis and Machine Learning. Telecom 2022, 3, 358–378. [Google Scholar] [CrossRef]
Liu, J.; Fu, F.; Li, L.; Yu, J.; Zhong, D.; Zhu, S.; Zhou, Y.; Liu, B.; Li, J. Efficient Pause Extraction and Encode Strategy for Alzheimer’s Disease Detection Using Only Acoustic Features from Spontaneous Speech. Brain Sci. 2023, 13, 477. [Google Scholar] [CrossRef]
Zhu, T.; Li, L.; Yang, J.; Zhao, S.; Liu, H.; Qian, J. Multimodal sentiment analysis with image-text interaction network. IEEE Trans. Multimedia 2022, 25, 3375–3385. [Google Scholar] [CrossRef]
Ghorbanali, A.; Sohrabi, M.K.; Yaghmaee, F. Ensemble transfer learning-based multimodal sentiment analysis using weighted convolutional neural networks. Inf. Process. Manag. 2022, 59, 102929. [Google Scholar] [CrossRef]
Chen, D.; Su, W.; Wu, P.; Hua, B. Joint multimodal sentiment analysis based on information relevance. Inf. Process. Manag. 2023, 60, 103193. [Google Scholar] [CrossRef]
Xue, X.; Zhang, C.; Niu, Z.; Wu, X. Multi-level attention map network for multimodal sentiment analysis. IEEE Trans. Knowl. Data Eng. 2022, 35, 5105–5118. [Google Scholar] [CrossRef]
Zhu, C.; Chen, M.; Zhang, S.; Sun, C.; Liang, H.; Liu, Y.; Chen, J. SKEAFN: Sentiment Knowledge Enhanced Attention Fusion Network for multimodal sentiment analysis. Inf. Fusion 2023, 100, 101958. [Google Scholar] [CrossRef]
Salur, M.U.; Aydın, İ. A soft voting ensemble learning-based approach for multimodal sentiment analysis. Neural Comput. Appl. 2022, 34, 18391–18406. [Google Scholar] [CrossRef]
Kumar, V.S.; Pareek, P.K.; de Albuquerque, V.H.C.; Khanna, A.; Gupta, D.; Renukadevi, D. Multimodal Sentiment Analysis using Speech Signals with Machine Learning Techniques. In Proceedings of the 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Mysuru, India, 16–17 October 2022; pp. 1–8. [Google Scholar]
Yuan, Z.; Li, W.; Xu, H.; Yu, W. Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event/Chengdu, China, 20–24 October 2021; pp. 4400–4407. [Google Scholar]
Han, W.; Chen, H.; Poria, S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. arXiv 2021, arXiv:2109.00412. [Google Scholar]
Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 10790–10797. [Google Scholar]
Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Virtual Event/Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar]
Zadeh, A.; Liang, P.P.; Poria, S.; Vij, P.; Cambria, E.; Morency, L.P. Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1: Long Papers, pp. 2236–2246. [Google Scholar]

Figure 1. Conceptual representation of AMSA-ECFR’s approach to overcoming traditional MSA challenges.

Figure 2. AMSA-ECFR framework: an integrative architecture for robust Multimodal Sentiment Analysis featuring advanced feature encoding, dynamic cross-modal interaction, and context-aware reconstruction.

Figure 3. Comparative performance analysis of Multimodal Sentiment Analysis frameworks over increasing missing data rates, highlighting the proposed model’s superior resilience and predictive accuracy.

Figure 4. Performance metrics of Multimodal Sentiment Analysis frameworks as a function of increasing missing data rates, demonstrating the proposed model’s enduring accuracy and correlation with ground truth sentiment.

Figure 5. Efficacy of Multimodal Sentiment Analysis frameworks in conditions of varied missing data rates, showcasing the proposed model’s consistency in MAE, CCC, and accuracy measures.

Figure 6. Computational efficiency analysis of the AMSA-ECFR framework.

Figure 7. Synchronized Multimodal Analysis depicting emotional expressions in correlation with linguistic, auditory, and visual data streams, demonstrating the complex layering of sentiment cues.

Table 1. Comparative analysis of multimodal sentiment analysis studies.

Study	Year	Approach	Main Contribution	Limitations
[28]	2022	Image–text interaction network	Effective fusion of image and text for sentiment analysis	Reliance on image–text pairs
[29]	2023	Multi-level attentive network	Multi-level attention mechanism for modality integration	Computational complexity
[30]	2022	Ensemble transfer learning with CNNs	Weighted CNNs for modality-specific feature capture	Complexity in model integration
[31]	2023	Information relevance-based analysis	Relevance-based fusion for multimodal data	Handling of dynamically changing relevance
[32]	2022	Multi-level attention map network	Detailed attention mechanism for nuanced feature extraction	Scalability and resource requirements
[33]	2023	SKEAFN	Sentiment knowledge-enhanced attention fusion	Dependence on quality of sentiment knowledge
[34]	2022	Soft voting ensemble learning	Ensemble technique efficacy in MSA	Management complexity of multiple models
[35]	2022	Speech signal analysis with ML	Focus on speech-based features in MSA	Limited focus on speech, overlooking other modalities

Table 2. Simulation setup for comparative analysis of Multimodal Sentiment Analysis frameworks.

Component	Description
Frameworks Compared	AMSA-ECFR (proposed), TFR-NET, MMIM, Self-MM, MISA
Complete Modality Settings	All modalities present: textual, audio, visual
Incomplete Modality Settings	Randomly remove one or more modalities in varying proportions (10%, 20%, …, 50%)
Simulation Environment	Anaconda 2020.02 with Jupyter Notebook, Python 3.8
Metric for Evaluation	Accuracy (Acc-2, Acc-5, Acc-7), Mean Absolute Error (MAE), Concordance Correlation Coefficient (CCC)
Hardware Specifications	Intel Xeon CPU @ 2.20 GHz, NVIDIA Tesla K80 GPU, 64 GB RAM
Software Dependencies	TensorFlow 2.3.0, PyTorch 1.6.0, Scikit-learn 0.23.2, CUDA 10.1
Parameter Tuning	Grid search of learning rate [ $1 \times 10^{- 5}, 1 \times 10^{- 4}, 1 \times 10^{- 3}$ ], batch size [16, 32, 64], dropout rate [0.1, 0.2, 0.3]
Reproducibility	Seed set to 42 for all random number generators, code, and data made available in a public repository

Table 3. Overall impact of component removal on AMSA-ECFR performance metrics with respect to CMU-MOSI, CMU-MOSEI, and CH-SIMS.

Component Removed	Accuracy (Acc-7)	Accuracy (Acc-2)	Mean Absolute Error (MAE)	Concordance Correlation Coefficient (CCC)
None (Full Model)	-/87.3	93.0-/94.8	0.11	0.85
Transformer Encoding	81.0/83.3	-/88.0	0.17	0.78
LSTM Encoding	82.0/82.4	-/89.0	0.16	0.80
Cross-Modal Interaction	84.6/84.9	90.1/94.5	0.14	0.82
Generative Modeling	84.3/86.4	91.0/94.3	0.13	0.83

Table 4. Contribution of each component to performance enhancement.

Component	Improvement in Acc-7	Improvement in Acc-2	Reduction in MAE	Increase in CCC
Transformer Encoding	+5.2%	+4.5%	−0.06	+0.07
LSTM Encoding	+4.4%	+3.8%	−0.05	+0.05
Cross-Modal Interaction	+3.0%	+2.7%	−0.03	+0.03
Generative Modeling	+2.2%	+2.0%	−0.02	+0.02

Table 5. Overall comparative analysis of AMSA-ECFR with modified components against other models with respect to CMU-MOSI, CMU-MOSEI, and CH-SIMS.

Model/Modification	Accuracy (Acc-7)	Accuracy (Acc-2)	MAE	CCC
AMSA-ECFR (Full Model)	-/87.3	93.0-/94.8	0.11	0.85
AMSA-ECFR w/o Transformer	81.0/83.3	-/88.0	0.17	0.78
AMSA-ECFR w/o LSTM	82.0/82.4	-/89.0	0.16	0.80
TRF-NET	80.2/82.1	86.2/88.0	0.19	0.76
MMIM	79.6/81.3	84.2/87.4	0.20	0.75
SELF-MM	75.3/77.2	79.1/81.4	0.14	0.69
MISA	72.3/75.8	74.7/80.2	0.13	0.66

Table 6. Limitations in existing MSA approaches and addressal by AMSA-ECFR.

Existing Approach Limitations	Addressed by AMSA-ECFR
Reliance on specific modality pairs for analysis, limiting flexibility in data handling (e.g., image–text interaction networks)	AMSA-ECFR employs a dynamic fusion mechanism that is adaptable to various data combinations, not limited to specific modality pairs.
High computational complexity in multi-level attentive networks, impeding efficiency	AMSA-ECFR optimizes computational efficiency through adaptive attention mechanisms, reducing processing overhead.
Difficulty in integrating modality-specific features due to complex ensemble transfer learning with CNNs	AMSA-ECFR’s architecture simplifies feature integration by employing advanced encoding techniques for different modalities.
Inability to handle dynamically changing relevance in data, as seen in information relevance-based analysis models	AMSA-ECFR incorporates a context-aware fusion approach that adapts to the changing relevance of multimodal data.
Scalability and resource constraints in multi-level attention map networks	AMSA-ECFR’s design ensures scalability and manages resource use effectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Q.; Dong, S.; Wang, P. Advanced Multimodal Sentiment Analysis with Enhanced Contextual Fusion and Robustness (AMSA-ECFR): Symmetry in Feature Integration and Data Alignment. Symmetry 2024, 16, 934. https://doi.org/10.3390/sym16070934

AMA Style

Chen Q, Dong S, Wang P. Advanced Multimodal Sentiment Analysis with Enhanced Contextual Fusion and Robustness (AMSA-ECFR): Symmetry in Feature Integration and Data Alignment. Symmetry. 2024; 16(7):934. https://doi.org/10.3390/sym16070934

Chicago/Turabian Style

Chen, Qing, Shenghong Dong, and Pengming Wang. 2024. "Advanced Multimodal Sentiment Analysis with Enhanced Contextual Fusion and Robustness (AMSA-ECFR): Symmetry in Feature Integration and Data Alignment" Symmetry 16, no. 7: 934. https://doi.org/10.3390/sym16070934

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Advanced Multimodal Sentiment Analysis with Enhanced Contextual Fusion and Robustness (AMSA-ECFR): Symmetry in Feature Integration and Data Alignment

Abstract

1. Introduction

1.1. Problem Statement and Motivation

1.2. Proposed Approach Overview

1.3. Structure of the Article

2. Theoretical Background

3. Proposed AMSA-ECFR Approach

3.1. Advanced Feature Encoding

3.2. Dynamic Cross-Modal Interaction Model

3.3. Handling Unaligned and Incomplete Data

3.4. Proposed Transformer Architecture

3.5. Low-Level Reconstruction and High-Level Attraction

3.5.1. Low-Level Reconstruction

3.5.2. High-Level Attraction

3.6. Robustness Enhancement Strategies

3.7. Computational Efficiency

4. Results and Analysis

4.1. Dataset Overview

4.2. Data Sparsity Evaluation Using CH-SIMS

4.3. Data Sparsity Evaluation Using CMU-MOSEI

4.4. Data Sparsity Evaluation Using CH-MOSI

4.5. Computational Efficiency Analysis

4.6. Ablation Experiments of the AMSA-ECFR Approach

4.7. Intermodal Sentiment Dynamics

5. Discussion

5.1. Implication of the Proposed Approach and Comparison with Existing Approaches

5.2. Real-World Use Cases

5.3. Adaptability to Different Datasets

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI