A Modality-Enhanced Multi-Channel Attention Network for Multi-Modal Dialogue Summarization

Lu, Ming; Liu, Yang; Zhang, Xiaoming

doi:10.3390/app14209184

Open AccessArticle

A Modality-Enhanced Multi-Channel Attention Network for Multi-Modal Dialogue Summarization

by

Ming Lu

^1,*,

Yang Liu

² and

Xiaoming Zhang

³

¹

School of Cyber Science and Technology, Beihang University, Beijing 100191, China

²

School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China

³

State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(20), 9184; https://doi.org/10.3390/app14209184

Submission received: 31 August 2024 / Revised: 3 October 2024 / Accepted: 6 October 2024 / Published: 10 October 2024

(This article belongs to the Special Issue Artificial Intelligence in Transportation Safety and Traffic Management)

Download

Browse Figures

Versions Notes

Abstract

:

Integrating multi-modal data in natural language processing has opened new pathways for the enhancement of dialogue summarization. However, existing models often struggle to effectively synthesize textual, auditory, and visual inputs. This paper introduces a Modality-Enhanced Multi-Channel Attention Network (MEMA), a novel approach designed to optimize the integration and interaction of diverse modalities for dialogue summarization. MEMA leverages symmetrical embedding strategies to balance the integrity and distinctiveness of each modality, ensuring a harmonious interaction within the unified architecture. By maintaining symmetry in the processing flow, MEMA enhances the contextual richness and coherence of the generated summaries. Our model demonstrates superior performance on the Multi-modal Dialogue Summarization (MDS) dataset, particularly in generating contextually enriched abstract summaries. The results underscore MEMA’s potential to transform dialogue summarization by providing a more symmetrical and integrated understanding of multi-modal interactions, bridging the gap in multi-modal data processing, and setting a new standard for future summarization tasks.

Keywords:

multi-modal learning; dialogue summarization; attention mechanisms

1. Introduction

In the rapidly evolving field of natural language processing, integrating multi-modal data—including textual, auditory, and visual inputs—represents a significant frontier for innovation. Existing multi-modal approaches, while incorporating additional data sources such as images and audio, frequently face two critical limitations, namely (1) an imbalance in how they process different modalities and (2) ineffective mechanisms for cross-modal interaction. These limitations result in summaries that are either incomplete or fail to fully capture the rich interdependencies between different types of input.

To address these challenges, we introduce the Modality Enhanced Multi-Channel Attention Network (MEMA), a model specifically designed to optimize the symmetrical synthesis of multi-modal inputs, thereby enhancing the contextual richness and coherence of the generated summaries. MEMA’s architecture is engineered to maintain symmetry across modalities, ensuring that each modality’s distinct characteristics are preserved while fostering deep inter-modality interactions. The model’s symmetrical design is achieved through advanced embedding techniques, including position and type embeddings that maintain the integrity and balance of each modality throughout the summarization process. By incorporating a pre-trained summarizer, MEMA further enhances its ability to generate summaries that are both contextually rich and symmetrical in their representation of multi-modal data.

In this paper, we delve into the architecture of MEMA, introducing a specialized modality-enhanced attention mechanism that significantly strengthens inter-modality interactions. This mechanism ensures a seamless and comprehensive synthesis of information into the final summary. Our evaluations, conducted on the extensive MDS dataset, demonstrate MEMA’s superiority over conventional extractive and abstractive models through its ability to produce abstract summaries that are deeply imbued with a rich multi-modal context. The subsequent sections of this paper detail the components of MEMA, outline our experimental methodology and dataset, and discuss the broader implications of our findings for the future of multi-modal dialogue systems. Through MEMA, we aim to redefine the approach to dialogue summarization, offering a deeper, more integrated understanding of multi-modal data.

2. Related Work

Multi-modal summarization is the task of producing condensed content that can summarize the original multimedia inputs. There are three types of multi-modal summarization that can be categorized based on the type of multi-modal interaction.

Interaction in Attention Only. A multi-modal attention model [1] was proposed for how-to videos [2]. The model uses the hierarchical attention approach [3] to combine textual and visual modalities to generate text. Besides textual and visual modalities, MAST [4] incorporates the audio modality into the model and naturally uses trimodal hierarchical attention. MSMO [5] selects an image from the input. VMSMO [6] proposes a local–global attention mechanism to let the video and text interact and select an image as output. A multi-task hierarchical heterogeneous fusion framework [7] was proposed to learn the hierarchical structures and heterogeneous associations existing in multi-modal data.

Interaction in Selected Image. To handle the problems mentioned above, MAtt [8] sets an image filter and utilizes the selected image for summarization optimization. MSE [9] sets a gate to select event highlights via images and distinguishes highlights while encoding. MOF [10] selects an image as guidance to generate the final summarization.

Interaction in Other Methods. There are some other methods to fuse different modalities. MAHR [11] attends to image captions as a bridge to align text documents and their accompanying images, making interaction easy. Furthermore, for video, on the one hand, if there are already transcripts, it serves as the bridge [12]. On the other hand, if there are no transcripts, ASR transcripts are used [13]. CKGM [14] combines multi-modal knowledge graphs of image entity relationships and factual text information to realize cross-modal information interaction and knowledge expansion. This model captures rich, contextual semantic information through pre-training methods and uses a knowledge graph to enrich the text content.

3. Model

3.1. Overview

ViT [15] shows the potential ability of the model to handle multi-modal information equally. We propose a unified multi-modal model, MEMA, with shallow and computationally light embedding layers. Hence, most of the computation concentrates on modeling modality interactions. We apply position embedding independently to indicate input types and add type embedding like ViLT [16]. In order to utilize prior knowledge, we place a pre-trained summarizer as the leading architecture. We use modality interactions inside the transformer module rather than powerful unimodal embedders.

As shown in Figure 1, we use CLIP to extract visual features and a Mel filter for audio. We feed auditory, visual, and textual features to the model at the same time. In the text-embedding module, we load the word embedding of BERT-Base Chinese. The visual features extracted by ViT-B/32 have a dimension of 512. Then, a linear conversion module maps the dimensions of visual features from 512 to 768. We use the encapsulated implementation of a Mel filter from PyTorch to extract auditory features. Next, we respectively add position embedding and modal-type embedding to three formats of data and enter them into BART [17], which is initialized by BART-Base Chinese [18]. We employ an Adam optimizer with an initial

10^{- 4}

learning rate. The experiments are conducted on NVIDIA GTX 3090 GPUs.

3.2. Embedding

We use CLIP to extract visual features and a Mel filter for audio [19,20]. We feed auditory, visual, and textual features to the model at the same time. In the text-embedding module, we load the word embedding of BERT-Base Chinese. The visual features extracted by ViT-B/32 have a dimension of 512. Then, a linear conversion module maps the dimensions of visual features from 512 to 768. We use the encapsulated implementation of of a Mel filter from PyTorch to extract auditory features. Next, we add position embedding and modal-type embedding to three formats of data and enter them into BART [17], which is initialized by BART-Base Chinese [18]. Positional encoding is added to the input embedding so the model can exploit the sequence’s information order. A type-encoding matrix (

T

) with dimensions of

3 \times e m b e d_d i m

is defined to represent three different modalities (text, image, and audio).

E_{c o m b i n e d} = E_{t o k e n s} + T_{t y p e}

(1)

Modal-type encoding aims to inject different information into text, image, and audio data, where

E_{t o k e n s}

represents the encoding of the original modality,

T_{t y p e}

is the embedding encoding selected from the matrix (

T

) according to the encoding modality, and

E_{c o m b i n e d}

is the combined encoding vector used as the final input to the model. This method can effectively provide discriminative information for different modal data and enhance the processing ability of the model for different data types.

3.3. Multi-Channel Attention Mechanism

As shown in Figure 2. Each modality of the input multi-channel attention mechanism is first passed through the BasicConv layer, which contains the standard 2D convolution operation, batch normalization, and a ReLU activation function.

X_{c o n v} = R e L U (B N (C o n v 2 d (X_{i n p u t})))

(2)

Before computing the attention mechanism, the features are compressed by the Z-pool operation, which includes max pooling and average pooling, and the results of the two are concatenated to capture different features.

Z - P o o l (X) = C o n c a t [M a x P o o l (X), A v g P o o l (X)]

(3)

M a x P o o l (X) = max_{i} (x_{i})

(4)

A v g P o o l (X) = \frac{1}{N} \sum_{i} x_{i}

(5)

The results of MaxPool and AvgPool are concatenated in the channel dimension to form an integrated feature representation containing the extreme and average values of the original features. The features are further extracted through a base convolutional layer, which does not use the activation function after the convolutional layer and directly outputs the features used to generate the gate control signal. Subsequently, the sigmoid function is applied to the convolution output to generate a gate control signal between 0 and 1, which adjusts the weight of the original input features. The original input features are multiplied by the gate control signal to obtain the final adjusted features.

s c a l e = σ (B a s i c C o n v (Z - P o o l)) A t t e n t i o n G a t e (A G) = s c a l e \cdot X

(6)

where

σ

represents the Sigmoid activation function. Finally, the channel dimension is removed, and the outputs of all modalities are concatenated along the original dimension to obtain the final multi-modal fusion vector.

M C A t t = A G (X_{audial}) \oplus A G (X_{visual}) \oplus A G (X_{text})

(7)

3.4. Modality-Enhanced Attention Mechanism

One of the challenges the multi-modal summarization task faces is effectively integrating information from different modalities [7]. Some existing methods only include a single text decoder similar to text summarization, which limits the effective utilization of multi-modal input information. This paper proposes a decoder framework based on a modality-enhanced attention mechanism, which aims to strengthen cross-modal interaction and improve the quality and relevance of the summary to process and fuse information from different modalities.

Before computing attention, the cross-modal fusion feature (

G_{f}

) and image embedding (G) filter video frames by calculating fusion weights through cascade gates [21]. Specifically, the image selector fuses the text-aware visual embeddings (

G_{e n c}

) with

G_{f}

, and G and calculates the image ranking score, taking the top-10 images and feeding them into the attention calculation as follows:

λ = σ (F F N (G_{f})) λ^{*} = σ (F F N (G))

(8)

\hat{I} = λ G_{f} + λ^{*} G + (1 - λ - λ^{*}) G_{e n c}

(9)

I_{s c o r e} = s o f t m a x (F F N (\hat{I}))

(10)

L_{i m a g e} = - y_{I}^{T} log (I_{s c o r e})

(11)

where

λ

and

λ^{*}

are the balance weights and

σ

is the Sigmoid activation function.

I_{s c o r e}

is the final image ranking score, and the images with the top-10 scores are considered relevant images.

L_{i m a g e}

is the image selection cross-entropy loss.

\tilde{H} = M E A t t (Q, K, V)

(12)

M E A t t = s o f t m a x (\frac{H_{a u d i o} H_{v i s i o n}^{T}}{\sqrt{d_{k}}}) H_{t e x t}

(13)

where H is the input vector; MEAtt is modality-augmented attention;

Q \in R^{d_{e} \times d_{h}}

,

K \in R^{d_{e} \times d_{h}}

, and

V \in R^{d_{m o d e l} \times d_{h}}

are the query, key, and value matrices, respectively;

d_{h}

represents the implicit representation dimension; and

d_{e}

represents the attention embedding dimension.

4. Experiment

4.1. Setup

4.1.1. Dataset

In response to the need for advanced resources in multi-modal dialogue summarization, we introduce the Multi-modal Dialogue Summarization (MDS) dataset [22], a pioneering collection designed to propel research in this field. The dataset is publicly available and specifically curated for multi-modal tasks, combining audio, visual, and textual elements. It contains over 160,000 min of video distributed across 11,305 dialogue instances, capturing a wide range of real-world daily scenarios and topics. These characteristics make MDS uniquely suited for evaluating models on both extractive and abstractive summarization tasks across modalities. All experiments reported in this paper are conducted using the MDS dataset, ensuring a comprehensive evaluation of our proposed MEMA model.

4.1.2. Metrics

ROUGE-based methods [23] and BLEU-based methods [24] are widely used metrics that involve the measurement of the overlap of n-grams between two texts. Here, we choose ROUGE-1, ROUGE-2, ROUGE-L, BLEU-1, BLEU-2, BLEU-3, and BLEU-4 for comparison. ROUGE and BLEU were chosen as evaluation metrics because they provide a standardized basis for comparison with existing text summarization research. Although they primarily focus on surface-level n-gram matching, they still effectively assess model performance and serve as useful benchmarks in the emerging field of multi-modal summarization.

4.1.3. Baselines

We compare our model in the following three categories of baselines: text summarization, dialogue summarization, and multi-modal summarization—for a total of eight models. S2S [25] is a standard text summarization model with a sequence-to-sequence architecture using an RNN encoder–decoder and a global attention mechanism. PGN [26] is a text summarization model with an attention mechanism and pointer network. Transformer [27] is a classic text summarization model, serving as a non-pre-trained baseline. T5 [28] is a universal abstractive text summarization model pre-trained on dozens of languages. MDialBART [29] represents a pre-trained dialogue summarization model. ConDigSum [30] proposes a dialogue summarization model of topic-aware contrastive learning. HOW2 [1] is the first multi-modal summarization model proposed to summarize video content. VMSMO [6] proposes a dual-interaction multi-modal summarizer to generate multi-modal output. These models were selected to cover a wide spectrum of summarization techniques, from traditional text-based methods (e.g., S2S and Transformer) to more advanced models that incorporate multi-modal data (e.g., HOW2 and VMSMO). By comparing MEMA with both pre-trained (e.g., T5 and MDialBART) and non-pre-trained models (e.g., Transformer and ConDigSum), we aim to highlight MEMA’s superior performance in integrating multi-modal inputs and generating more contextually enriched summaries. The inclusion of these baselines ensures that MEMA is evaluated comprehensively against a diverse set of summarization models, demonstrating its ability to outperform existing methods in both text-based and multi-modal summarization tasks.

4.2. Overall Performance

We implemented eight baseline models alongside our multi-modal input model, MEMA. The experimental results presented in Table 1 highlight the superior performance of MEMA across various metrics. Specifically, MEMA achieved the highest scores in all evaluated metrics, with a ROUGE-1 of 43.55, ROUGE-2 of 30.58, ROUGE-L of 41.11, BLEU-1 of 28.37, BLEU-2 of 20.98, BLEU-3 of 15.20, and BLEU-4 of 11.95. These results validate the effectiveness of our unified cross-modality learning model, MEMA, which not only outperforms all baselines in most metrics but also demonstrates exceptional scalability and robustness in multi-modal summarization tasks.

4.3. Discussion

MEMA outperforms previous dialogue summarization models in all metrics. The average performance improvement across ROUGE and BLEU scores is substantial, demonstrating that integrating multi-modal information significantly enhances the semantic coverage of the reference summaries. Specifically, the model’s ability to harmonize textual and visual cues appears to be a key factor contributing to its improved performance. The fusion mechanism within MEMA allows for a more nuanced understanding of the dialogue context, which is crucial for generating accurate and coherent summaries. However, adding multi-modal data to conventional abstractive summarization models does not guarantee improved performance. If models fail to reconcile conflicts between different modalities, lower performance may be the result. Traditional multi-modal summarization models often need to implement more than purely text-based models, underscoring the complexity of multi-modal dialogue summarization. Models like HOW2 and VMSMO, which use separate components for modal interaction, highlight the fact that disjointed modal interactions are inadequate for this task. These findings emphasize the uniqueness and challenges of multi-modal dialogue summarization, which differs significantly from traditional summarization tasks. Previous models need help with modality conflicts and often yield suboptimal results. Our experiments also reveal that using separate modal interaction components is less effective than a unified approach.

Despite the advantages typically gained from scaling up models, non-pre-trained dialogue summarization model ConDigSum demonstrates a surprising capability in generating summaries, outperforming the pre-trained MDialBART. The result indicates the challenges of applying pre-trained models’ prior knowledge to multi-modal dialogue summarization. In contrast, MEMA leverages the strengths of a pre-trained model and effectively handles modality interactions. The ablation study described in Section 4.4 further confirms that the fusion of multi-modal content enhances ROUGE and BLEU scores, affirming MEMA’s superior multi-modal modeling capabilities.

However, it is important to acknowledge the limitations of MEMA. For instance, the model may underperform in scenarios where the quality of the input modalities is low, such as under condition of noisy audio or blurry video. Additionally, MEMA’s reliance on a pre-trained model might limit its adaptability to domains with significantly different linguistic and visual characteristics relative to the pre-training data. Furthermore, the computational cost of processing multi-modal data can be a barrier to widespread adoption, especially in resource-constrained environments. Addressing these limitations in future work will be crucial for the continued advancement of multi-modal dialogue summarization techniques.

In addition, regarding the problem of modality conflicts, future work could focus on enhancing MEMA’s ability to identify and reconcile modality conflicts. Potential approaches may include the incorporation of conflict detection algorithms that assess the consistency of information across modalities before summarization. Moreover, further research could explore the development of a feedback mechanism that allows MEMA to adjust its summarization strategy based on the quality and coherence of the input data. By systematically addressing modality conflicts, we can significantly enhance the robustness and reliability of multi-modal dialogue summarization systems.

4.4. Ablation Study

We conducted three ablation experiments on MEMA to ascertain the contribution of each modality to the summarization performance. The results, as detailed in Table 2, clearly demonstrate the benefits of a trimodal approach. The complete MEMA model, incorporating audio, visual, and textual modalities, achieved the highest scores across all metrics, with a ROUGE-1 of 43.55, ROUGE-2 of 30.58, ROUGE-L of 41.11, BLEU-1 of 28.37, BLEU-2 of 20.98, BLEU-3 of 15.20, and BLEU-4 of 11.95. When the audio modality was removed, there was a noticeable drop in performance, with the ROUGE-1 reduced to 42.56, ROUGE-2 to 25.71, and BLEU-4 to 9.44. A similar decline was observed when the visual modality was excluded, highlighting its importance. The most significant reduction occurred when audio and visual modalities were removed, underscoring the synergy between the modalities in enhancing summarization quality. These findings confirm that the integrated audio-visual information significantly enhances the textual output, validating MEMA’s design of leveraging cross-modality features. This result emphasizes that, while single modalities contribute to performance, their combined effect is crucial for the achievement of optimal results in multi-modal dialogue summarization.

4.5. Human Evaluation

We further conducted a human evaluation to analyze the model output according to four aspect metrics: informedness (INFOR.), i.e., whether the summary provides enough and necessary information from the input; expressiveness (EXP), i.e., whether the summary is fluent in presentation; consistency (Con), i.e., whether the summary is consistent with the reference; and overall evaluation (OVER), i.e., a global assessment of the summary. We select 50 examples randomly, and three annotators are volunteered to judge scores on a Likert scale from 1 (worst) to 5 (best) for the four metrics. The results of the human evaluation are shown in Table 3. MEMA shows the best performance in four appraisals, with outperformances +0.18, +1.44, +1.28, and +1.00 for INFOR, CONSIS, EXP, and OVER, respectively. For informedness, MEMA is fed with three data modalities, presenting more information and achieving a higher score. Moreover, the structure of MEMA alleviates conflicts between different modalities, taking full advantage of the pre-trained model. Compared to INFOR (3.50) and EXP (3.94), MEMA takes has lower scores in CONSIS (2.66) and OVER (2.42). However, the difference between CONSIS (+1.44) is higher than the other three scores. We suggest that factual consistency is the primary influence parameter, and we conducted error analysis of factual consistency as described in Section 4.6. MDialBART obtains a score of 3.32 for INFOR, but the generated summary is irrelevant to reference, with a score of 1.22 for CONSIS. MDialBART generates a considerable amount of text that has nothing to do with dialogue. The human evaluation, especially for informedness and expressiveness, proves the ability of MEMA to interact with different modalities of information.

4.6. Error Analysis

Multi-modal dialogue is more difficult for models to learn and generate summarization than textual dialogue. There are complex correlations among the data across modalities. As shown in Table 1, sometimes, attaching multi-modal data even results in performance degradation. Different modalities of data supplement more information and can cause conflict. Table 3 shows that the main problem of existing dialogue summarization models is their inconsistency. To obtain a higher score on word-overlap metrics, summarization models tend to generate a longer summary to cover more information. Given the lack of constraints on factual inconsistencies, models generate plenty of errors. Therefore, we conducted an error analysis and for a case study on MDS to analyze these challenges.

Factual inconsistencies are often found in abstractive summarization and dialogue summarization. Several generated summaries with high ROUGE and BLEU scores contain considerable factual inconsistencies. We selected 50 of them and conducted a quantitative analysis. There are eight factual inconsistencies in abstractive summarization [31], and Table 4 shows the six most frequent types and their rates.

Generally, MEMA performs better than T5 and MDialBART, as MEMA makes fewer errors than T5 and MDialBART regarding the following four factual inconsistencies: missing information, redundant information, wrong reference, and object error. Specifically, T5 and MDialBART have high error rates of missing information (78% and 84%, respectively). At the same time, MEMA can reduce the error rate to 26%, indicating that MEMA has a a greater probability of generating a complete summary than T5 and MDialBART. Furthermore, MEMA significantly decreases object error compared to T5 (−52%) and MDialBART (−40%), showing that MEMA possesses a more vital ability to generate correct objects in summary with audio-visual accompaniment. MEMA also enhances the performance of the problem of redundant information and incorrect references to some extent. As for circumstantial and negation errors, although MEMA tends to have a slight disadvantage, all three models maintain low error rates and can alleviate those two factual inconsistencies. Overall, error analysis shows that MEMA understands modality interaction and generates more faithful information than the compared models.

5. Conclusions and Future Work

This paper proposes a unified model that emphasizes symmetrical modality interaction in multi-modal dialogue summarization. Our experimental results demonstrate that MEMA produces summaries that are more fluent, informative, and relevant while maintaining a high level of symmetry across the modalities. By complementing the gap in current dialogue summarization research, which predominantly focuses on textual data and often neglects multi-modal content, our model sets a new standard for achieving balanced and symmetrical multi-modal integration.

While MEMA demonstrates strong performance on the MDS dataset, it also has the potential for various real-world applications. For instance, in customer service, MEMA could be utilized to summarize customer interactions across different modalities (text, audio, and video), allowing for quick resolution and improved service efficiency. In medical consultations, the model could assist healthcare professionals by summarizing patient discussions, ensuring that vital information is captured and easily accessible. In the realm of video-based educational content, MEMA can help summarize lectures and instructional videos, providing concise, context-rich summaries that enhance learning outcomes for students. We leave this for future work.

However, despite these advancements, factual inconsistencies are still a major problem. In the future, we aim to solve this problem from the following perspectives:

Implement a multi-modal consistency checking module to align and compare information across modalities, resolving discrepancies in real time.
Integrate a fact verification component using external knowledge bases to ensure the accuracy of the generated summaries.
Apply contrastive learning to improve the model’s ability to distinguish between coherent and incoherent multi-modal data.
Conduct targeted experiments and ablation studies to refine the model’s components and enhance its robustness against misleading inputs.

Author Contributions

Methodology, M.L.; Software, M.L. and Y.L.; Supervision, X.Z.; Validation, M.L.; Writing—original draft, Y.L.; Writing—review and editing, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

“Not applicable” for studies not involving humans or animals.

Informed Consent Statement

“Not applicable” for studies not involving humans.

Data Availability Statement

The dataset can be available by contacting the author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Palaskar, S.; Libovickỳ, J.; Gella, S.; Metze, F. Multimodal Abstractive Summarization for How2 Videos. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6587–6596. [Google Scholar]
Sanabria, R.; Caglayan, O.; Palaskar, S.; Elliott, D.; Barrault, L.; Specia, L.; Metze, F. How2: A Large-scale Dataset for Multimodal Language Understanding. In Proceedings of the NeurIPS, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar] [CrossRef]
Libovickỳ, J.; Helcl, J. Attention strategies for multi-source sequence-to-sequence learning. arXiv 2017, arXiv:1704.06567. [Google Scholar]
Khullar, A.; Arora, U. MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention. arXiv 2020, arXiv:2010.08021. [Google Scholar]
Zhu, J.; Li, H.; Liu, T.; Zhou, Y.; Zhang, J.; Zong, C. MSMO: Multimodal summarization with multimodal output. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4154–4164. [Google Scholar]
Li, M.; Chen, X.; Gao, S.; Chan, Z.; Zhao, D.; Yan, R. VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 9360–9369. [Google Scholar]
Zhang, L.; Zhang, X.; Han, L.; Yu, Z.; Liu, Y.; Li, Z. Multi-task Hierarchical Heterogeneous Fusion Framework for multimodal summarization. Inf. Process. Manag. 2024, 61, 103693. [Google Scholar] [CrossRef]
Li, H.; Zhu, J.; Liu, T.; Zhang, J.; Zong, C. Multi-modal Sentence Summarization with Modality Attention and Image Filtering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; pp. 4152–4158. [Google Scholar]
Li, H.; Zhu, J.; Zhang, J.; He, X.; Zong, C. Multimodal sentence summarization via multimodal selective encoding. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 8–13 December 2020; pp. 5655–5667. [Google Scholar]
Zhu, J.; Zhou, Y.; Zhang, J.; Li, H.; Zong, C.; Li, C. Multimodal summarization with guidance of multimodal reference. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9749–9756. [Google Scholar]
Chen, J.; Zhuge, H. Abstractive text-image summarization using multi-modal attentional hierarchical rnn. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4046–4056. [Google Scholar]
Fu, X.; Wang, J.; Yang, Z. Multi-modal Summarization for Video-containing Documents. arXiv 2020, arXiv:2009.08018. [Google Scholar]
Liu, N.; Sun, X.; Yu, H.; Zhang, W.; Xu, G. Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1834–1845. [Google Scholar]
Wang, H.; Liu, J.; Duan, M.; Gong, P.; Wu, Z.; Wang, J.; Han, B. Cross-modal knowledge guided model for abstractive summarization. Complex Intell. Syst. 2024, 10, 577–594. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Shao, Y.; Geng, Z.; Liu, Y.; Dai, J.; Yang, F.; Zhe, L.; Bao, H.; Qiu, X. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation. arXiv 2021, arXiv:2109.05729. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, X.; Pan, J. Hierarchical cross-modality semantic correlation learning model for multimodal summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 11676–11684. [Google Scholar]
Zhang, L.; Zhang, X.; Guo, Z.; Liu, Z. CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), Minneapolis, MN, USA, 27–29 April 2023; pp. 370–378. [Google Scholar]
Zhang, L.; Zhang, X.; Zhou, Z.; Huang, F.; Li, C. Reinforced adaptive knowledge learning for multimodal fake news detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 16777–16785. [Google Scholar]
Liu, Z.; Zhang, X.; Zhang, L.; Yu, Z. MDS: A Fine-Grained Dataset for Multi-Modal Dialogue Summarization. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 11123–11137. [Google Scholar]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of summaries. In Proceedings of the ACL Workshop: Text Summarization Braches Out 2004, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Luong, M.T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar]
See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1073–1083. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Wang, J.; Meng, F.; Lu, Z.; Zheng, D.; Li, Z.; Qu, J.; Zhou, J. ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization. arXiv 2022, arXiv:2202.05599. [Google Scholar]
Liu, J.; Zou, Y.; Zhang, H.; Chen, H.; Ding, Z.; Yuan, C.; Wang, X. Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Online/Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1229–1243. [Google Scholar]
Tang, X.; Nair, A.; Wang, B.; Wang, B.; Desai, J.; Wade, A.; Li, H.; Celikyilmaz, A.; Mehdad, Y.; Radev, D. Confit: Toward faithful dialogue summarization with linguistically-informed contrastive fine-tuning. arXiv 2021, arXiv:2112.08713. [Google Scholar]

Figure 1. Overview of MEMA.

Figure 2. Multi-channel attention mechanism.

Table 1. ROUGE scores and BLEU scores of summarization baselines on MDS.

	ROUGE-1	ROUGE-2	ROUGE-L	BLEU-1	BLEU-2	BLEU-3	BLEU-4
S2S [25]	21.14	6.99	18.22	7.31	3.73	2.51	0.37
PGN [26]	20.97	4.42	18.46	10.84	4.24	1.42	0.19
Transformer [27]	38.18	17.73	32.42	24.02	16.70	11.81	4.93
T5 [28]	40.69	18.62	37.22	12.04	8.39	6.10	3.44
MDialBART [29]	26.27	8.69	20.41	11.64	7.34	4.94	2.41
ConDigSum [30]	37.36	17.77	29.35	24.11	17.11	11.59	3.97
HOW2 [1]	20.71	3.85	18.04	10.38	3.71	1.14	0.14
VMSMO [6]	15.79	4.25	13.24	6.18	2.70	1.37	0.13
MEMA	43.55	30.58	41.11	28.37	20.98	15.20	11.95

Table 2. Ablation study of MEMA.

	ROUGE-1	ROUGE-2	ROUGE-L	BLEU-1	BLEU-2	BLEU-3	BLEU-4
MEMA	43.55	30.58	41.11	28.37	20.98	15.20	11.95
w/o audial modality	42.56	25.71	38.92	26.72	19.03	14.17	9.44
w/o visual modality	42.77	26.72	37.54	26.12	19.31	13.89	8.22
w/o both	41.82	24.99	36.01	25.71	18.89	13.46	7.23

Table 3. Human evaluation scores on four measures of informedness (INFOR.), consistency (CONSIS.), expressiveness (EXP.), and overall evaluation (OVER.).

	INFOR.	CONSIS.	EXP.	OVER.
Transformer	1.92	1.70	2.20	1.42
T5	2.84	1.12	1.12	1.08
MDialBART	3.32	1.22	2.66	1.26
VMSMO	1.02	1.30	1.00	1.00
MEMA	3.50	2.66	3.94	2.42

Table 4. Error analysis of three pre-trained models on the six different measures of missing information (Missing Infor.), redundant information (Redundant Infor.), circumstantial error, wrong reference, negation error, and object error.

	T5	MDialBART	MEMA
Missing Infor.	78%	84%	26%
Redundant Infor.	18%	16%	10%
Circumstantial Error	4%	2%	14%
Wrong Reference	40%	20%	12%
Negation Error	2%	0%	2%
Object Error	74%	62%	22%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, M.; Liu, Y.; Zhang, X. A Modality-Enhanced Multi-Channel Attention Network for Multi-Modal Dialogue Summarization. Appl. Sci. 2024, 14, 9184. https://doi.org/10.3390/app14209184

AMA Style

Lu M, Liu Y, Zhang X. A Modality-Enhanced Multi-Channel Attention Network for Multi-Modal Dialogue Summarization. Applied Sciences. 2024; 14(20):9184. https://doi.org/10.3390/app14209184

Chicago/Turabian Style

Lu, Ming, Yang Liu, and Xiaoming Zhang. 2024. "A Modality-Enhanced Multi-Channel Attention Network for Multi-Modal Dialogue Summarization" Applied Sciences 14, no. 20: 9184. https://doi.org/10.3390/app14209184

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Modality-Enhanced Multi-Channel Attention Network for Multi-Modal Dialogue Summarization

Abstract

1. Introduction

2. Related Work

3. Model

3.1. Overview

3.2. Embedding

3.3. Multi-Channel Attention Mechanism

3.4. Modality-Enhanced Attention Mechanism

4. Experiment

4.1. Setup

4.1.1. Dataset

4.1.2. Metrics

4.1.3. Baselines

4.2. Overall Performance

4.3. Discussion

4.4. Ablation Study

4.5. Human Evaluation

4.6. Error Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI