1. Introduction
Sentiment analysis can be applied in various domains, including marketing, customer service, brand management, political analysis, and social listening [
1]. It has emerged as a highly active research area due to the enormous volume of data generated daily on social media platforms and the World Wide Web. This abundance of data provides a rich source for sentiment analysis research and applications. The conventional sentiment analysis model primarily concentrates on analyzing text-based content [
2]. Nonetheless, advancements in technology have provided individuals with the means to express their opinions and emotions through various channels, including text, images, and videos. Due to these developments, sentiment analysis is transitioning from a focus on a single modality to considering multiple modalities. This shift brings about novel possibilities in sentiment analysis, driven by the rapid growth of this field. The integration of complementary data streams facilitates enhanced sentiment detection, surpassing the limitations of text-based analysis [
3].
Recent advancements in multimodal sentiment analysis architectures can be categorized into ten distinct categories [
4]. Different fusion methods have various strengths and limitations. Although multimodal fusion can solve the limitations of a single modality, in real open environments, multimodal data are usually disturbed by noise, defects, and abnormal points, making it difficult to satisfy the complementarity and consistency of multimodality [
5]. In the past few years, numerous researchers have conducted studies on sentiment analysis based on images and text. However, many existing approaches in this field either rely on a straightforward concatenation of features extracted from different modalities [
6] or only capturing coarse-level relationships between images and text [
7]. Indeed, in the real open world, the sentiment polarity of text and visual content is not always completely aligned, which is one of the key challenges that need to be addressed for reliable multimodal learning.
In recent years, numerous studies have used MVSA datasets [
8] (MVSA-single, MVSA-multiple) as benchmarks for exploring the sentiment analysis of images and text. These studies have pointed out that in the real open world, the sentiment polarity of text and visual content is not entirely consistent. Therefore, researchers often preprocess MVSA datasets by removing samples with opposite sentiment polarities. If one modality expresses a neutral sentiment, while the other modality is positive or negative, they are classified as positive or negative. Indeed, even after removing samples with opposite sentiment polarities, there are still a considerable number of inconsistent sentiment samples in the MVSA datasets. For convenience, we can classify samples where one modality expresses a neutral sentiment while the other modality is positive or negative as inconsistent samples, while samples where both modalities have consistent polarities can be categorized as polarity-consistent samples.
To conduct in-depth research, we classify the data in the filtered MVSA datasets, where one modality expresses a neutral sentiment while the other modality is positive or negative as inconsistent samples. On the other hand, samples where both modalities have consistent polarities are referred to as polarity-consistent samples. In
Figure 1, detailed data analysis of the filtered MVSA datasets shows that a considerable proportion of samples exhibit inconsistent polarities. For example, MVSA-single contains 42.5% of samples with inconsistent polarities, while MVSA-multiple has 26.0% of samples with inconsistent polarities. We understand that within the samples, we define as having inconsistent polarities, one modality’s data are neutral while the other is positive or negative. This means that we would need to incur an additional cost of 42.5% or 26.0% to inform our classifier that these originally neutral data points need to be classified as either positive or negative. This poses a significant challenge for any model. Indeed, looking at it from another perspective, when a modality’s sentiment is initially neutral, the classifier needs to learn to associate it with a positive or negative sentiment in conjunction with the other modality. The classifier must effectively address the high uncertainty that arises during the polarity transformation process to achieve consistent and balanced learning between modalities. This entails capturing the nuanced relationships between modalities and understanding how they contribute to the overall sentiment analysis task. The classifier must strike a balance and minimize the ambiguity inherent in polarity conversion to achieve reliable and accurate results.
In particular, we should note that for the MVSA-multiple dataset, each pair is shown to three annotators, and each annotator independently judges the sentiments of the text and image. For the same text–image pair, the sentiment polarities given by different annotators are mostly different, indicating the widespread presence of high uncertainty in the model learning process. We must address the issue of high uncertainty in the model learning process to make our model’s classification more robust.
However, it is regrettable that the current studies [
9,
10,
11] on multimodal sentiment analysis rarely focuses on uncertainty calibration, thus overlooking the crucial significance of uncertainty calibration in improving model performance. Recently, there have been studies [
12,
13] focusing on improving model performance from the perspective of uncertainty calibration, whereas these methods often discuss the issue from an unimodal perspective, neglecting the challenge of inconsistent sentiment polarities across different modalities, which poses a new challenge to uncertainty calibration.
Therefore, it is necessary to conduct further research to explore how to achieve effective uncertainty calibration in multimodal sentiment analysis. By accurately estimating and calibrating the uncertainty of models, we can enhance their reliability and robustness, thereby better addressing the differences in sentiment expression across different modalities and providing more accurate and consistent sentiment analysis results.
Based on the above analysis, we propose an uncertain-aware late fusion method based on hybrid uncertainty calibration (ULF-HUC) to enhance the calibration and classification of the model. The main contributions of this paper are summarized as follows:
- •
We propose a hybrid uncertainty calibration (HUC) method, which utilizes the labels of both modalities to impose uncertainty constraints on each modality separately, aiming to reduce the uncertainties in each modality and enhance the calibration ability of the model.
- •
We propose an uncertain-aware late fusion (ULF) method to enhance the classification ability of the model.
- •
We add common types of noise, such as Gaussian noise and salt-and-pepper noise, to the test set. Experimental results demonstrate that our proposed model exhibits greater generalization ability.
The rest of the paper is organized as follows. In
Section 2, we put our approach in the context of relevant existing work. Then, in
Section 3, we present a detailed description of our proposed method. In
Section 4, we conduct an experimental evaluation and analysis of our approach. Finally,
Section 5 provides a summary of our findings and concludes the paper.
5. Conclusions
To address the issue of traditional multimodal sentiment analysis methods being unable to effectively solve the uncertainty estimation problem among different modalities, we propose an uncertain-aware late fusion method based on hybrid uncertainty calibration (ULF-HUC). The core idea of this paper is to introduce a late fusion strategy based on uncertainty estimation and then use hybrid uncertainty calibration to learn the sentimental features of the two modalities. To successfully implement this core idea, we propose a series of methods. Firstly, we conduct an in-depth analysis of the sentiment polarity distribution in sentiment analysis datasets. Secondly, to minimize the high uncertainty caused by inconsistent sentiment polarities in different modalities, we propose a fusion strategy based on uncertainty estimation. Next, to achieve a balance between model accuracy and uncertainty, we use a learning method with hybrid uncertainty calibration, effectively reducing uncertainty when the model is accurate and reducing certainty when the model is inaccurate. Finally, we add different types of noise (namely Gaussian noise and Salt–Pepper noise) to verify the model’s classification and calibration capabilities. Experimental results show that our proposed ULF-HUC method overcomes the limitations of unimodal models and improves performance after fusion. Additionally, our method outperforms the comparison methods in terms of classification performance and calibration performance on three MVSA datasets, improving evaluation metrics such as accuracy, weighted F1, and expected uncertainty calibration error (UCE).
This research work has the following limitations: (1) The study focuses on multimodal sentiment analysis. (2) The impact of noise on the model’s performance and how to mitigate its effects is a relevant and worthy topic for further exploration.
In the future, to address the issue of disparate learning capabilities among different modalities, we will consider methods that are more suitable for calibrating modality learning capabilities in existing multimodal fusion strategies. Additionally, we will explore new methods for uncertainty calibration and consider the challenges of accuracy and uncertainty estimation calibration brought by more complex multimodal fusion.