In order to further extract sentiment-related commonalities across the three modalities and capture the feature space differences arising from modality interactions, we designed a cross-modal text augmentation module. This module adopts a Siamese network architecture comprising two components: the text augmentation encoder and the projection layer. Specifically, the text augmentation encoder treats the features from auxiliary modalities (audio and visual) as supplements to and augmentations of text features, thus integrating the features from auxiliary modalities into text features to enrich them while preserving textual contextual information. The projection layer performs spatial mapping on the augmented text features to acquire high-dimensional abstract information. Within this module, we combine two contrastive learning tasks based on instance prediction and sentiment polarity, achieving the implicit multimodal fusion of sentiment information and augmenting the model’s ability to differentiate between different sentimental states, thereby improving overall performance.
3.3.1. Siamese Network Structure
The cross-modal text augmentation module consists mainly of a text augmentation encoder and a projection layer. Due to parameter sharing between branches, we illustrate the network structure and parameter-passing mechanism by using the text and visual modality branches as examples, as depicted in
Figure 2.
Initially, we outlined the structure of the text augmentation encoder, a pivotal component in enhancing the multimodal sentiment analysis procedure. As illustrated in
Figure 2, the text augmentation encoder comprises
K stacked multi-head cross-attention units. Cross-attention mechanisms are commonly employed to extract modality interaction information and perform feature fusion. In the TCMCL model, these mechanisms are designed to delve into the augmentation effects of audio and visual information on text. Consequently, to preserve the contextual representation advantages of text features better than traditional methods that emphasize dependencies between sequences, our attention mechanisms focus more on the information correlation between different modality features. In the preceding feature extraction stage, we obtained diverse and rich information from the audio and visual modalities using specialized tools. By leveraging feature-level attention mechanisms, we can more clearly identify which features of the auxiliary modalities are most advantageous for text representation, effectively augmenting text features.
In order to emphasize attention toward the features, we first transposed the text and visual features to obtain and . Subsequently, we utilized as Q, as K, and V for multi-head attention computation. Such transformation facilitates feature attention during the subsequent attention computation and changes the dimensions of the different modality features from their respective d to the sequence length, N, due to transposition. This enables direct parameter sharing between the two branches of the text augmentation encoder. The specific computation process of the text augmentation encoder is as follows.
Initially, we map
Q,
K, and
V into distinct subspaces through linear transformations:
where
,
, and
represent the learnable weight matrix, and
i denotes the
i-th attention head. After linear mapping, attention is computed for each head:
At this stage, captures feature-level attention when moving from text to visuals, with a matrix size of , instead of the sequence-level attention of . Here, the denominator is employed to scale the dot product results, preventing gradient vanishing issues. The softmax function is applied to each row to transform scores into probabilities.
Subsequently, the outputs from all heads are concatenated and passed through another linear transformation:
where
,
represents another learnable weight matrix, and
h denotes the total number of heads. Thus, we obtain the feature-level attention when moving from the text to the visual modality
.
Finally, we generate residual connection between
Q and
to maximize the preservation of textual features’ characteristics, forming the output of the current multi-head cross-attention unit:
will serve as the input
Q for the next iteration of the cross-attention unit, as illustrated in
Figure 2. After iterating K times, we obtain our feature fusion module’s final output:
.
Through a feature-level attention mechanism, we successfully integrate and complement the targeted augmentation of the visual-assisted modality’s feature information into text information. By maintaining the predominant role of textual information, this fusion strategy introduces beneficial attention to feature-level visual information. We consider the module output as an augmented encoding of the text, which, in conjunction with the augmented encoding from another branch, elevates the sentimental expression of the text, laying a solid foundation for downstream tasks.
Next, we devised a projection layer to map the aforementioned augmented features into a new space, aiming to capture abstract and robust sentiment features better. The specific operations of the projection layer are expressed as follows:
We obtained a higher-level abstract representation by introducing a projection layer, denoted as . During this process, our objective was to reduce dimensionality and extract key sentiment features from the raw data.
Similarly, through the text audio branch, we acquired text features that are augmented by audio () and projected features .
3.3.2. Contrastive Learning Task: IPCL and SPCL
In the cross-modal text augmentation module, we introduced two contrastive learning tasks: instance prediction-based contrastive learning (IPCL) and sentiment polarity-based contrastive learning (SPCL). These tasks are employed to overcome the assumption of feature space consistency and effectively utilize feature space information containing inter-sample correlations and modality interactions to enhance the performance of multimodal sentiment analysis.
Firstly, in order to learn more abstract and robust sentiment representations within the augmented text features, we drew inspiration from contrastive learning efforts [
35,
46] and proposed an instance-based contrastive learning task. Specifically, we employed the prediction of augmented features using projection features, a prediction process that is cross-branched. In other words, we predict
using
and
using
. We treat (
,
) and (
,
) as our (query/key) pairs. During training, the contrastive learning task adheres to the principle of instance discrimination, where the query and key from the same sample form positive pairs, and other instances within the batch are treated as negative samples. The contrastive learning loss is computed using the widely used InfoNCE loss [
52]. The specific calculation process is where we first compute the contrastive learning loss for (
,
) and (
,
) as follows:
where
n denotes the batch size,
is a similarity calculation function, and
is a temperature parameter used to control the scaling of similarity scores. Subsequently, the sum of the two losses above constitutes the loss
for the instance prediction contrastive learning task.
By pulling the distances between the positive sample pairs that we selected closer and pushing the distances between negative sample pairs further apart, the instance prediction-based contrastive learning task primarily achieves the following functions. By taking the contrastive learning of (, ) as an example, represents the high-dimensional information of visually augmented text features, and represents the auditory augmented text clues. Bringing the distances between positive samples closer implies the alignment of the spatial features of the text features containing different modal information, integrating information from different modalities during the process, and accomplishing implicit multimodal feature fusion, as well as retaining the common sentimental features of different modalities. Since we use projected features to predict augmented features, this process also promotes the augmented features to approach higher-dimensional abstract sentiment, enhancing the stability of sentimental expression in the augmented features. At the same time, contrastive learning also pushes the distances between the negative sample pairs further apart, making the sentimental expressions of different samples more discriminative in space.
Following this, we extended the contrastive learning paradigm to supervised tasks. In the sentiment polarity-based contrastive learning task (SPCL), we leveraged label information to guide the model in learning more discriminative sentiment features. Specifically, we categorized the data into three classes: positive, neutral, and negative, based on the sentiment polarity criterion. In selecting sample pairs, samples with the same sentiment polarity as the target sample are considered positive samples, whereas samples with different sentiment polarities are treated as negative samples. We conducted contrastive learning on the projection features; it is noteworthy that within the same batch,
and
are treated as a set by following the above partitioning rules and participating in the contrastive learning loss computation. This process also employs InfoNCE to compute the loss, which is expressed by the following formula:
where
∈ {ProjectedT
v, ProjectedT
a} and
represent a situation with a different sentiment polarity from
within the mentioned set.
is a temperature parameter.
SPCL continuously utilizes label information to adapt the projected features to sentiment analysis domain tasks. During this process, we adjusted the feature space of the projection features for the two branches, which can be regarded as a form of clustering operation. Simultaneously, this operation effectively enhances the model’s focus on challenging samples within the features.
In the SPCL task, the projected features learned to be more sentimentally sensitive expressions, and this enhancement was also reflected in the IPCL task by influencing the augmented text features through cross-projection. The two tasks complement and influence each other, working together in different ways to promote augmented text features to exhibit a spatial distribution of features that is more consistent with sentiment analysis.