1. Introduction
Named Entity Recognition (NER) and Relation Extraction (RE) are core tasks in social media information monitoring and evaluation. Given the rising influence of multimodal posts, multimodal information extraction technologies are more conducive to the supervision of public opinion on social media, such as Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE). In recent years, food safety issues have received widespread attention on social media, particularly regarding the safety of staple crops such as rice and wheat [
1]. The spread and attention of food safety incidents typically affect consumer confidence and purchasing decisions. Therefore, monitoring and analyzing food safety information on social media, especially the safety information of staple crops, has become increasingly important [
2]. To more effectively mine this information, researchers have applied NER and RE techniques to the monitoring and evaluation of food safety incidents, as well as the analysis of food industry dynamics and food recall information [
3,
4]. On social media platforms, users comment, discuss, and share information about food safety issues, providing valuable data sources for researchers and regulatory authorities.
In early NER and RE research, the focus was mainly on text modality studies. Traditional approaches mainly have focused on constructing various efficient NER features, which were subsequently fed into different linear classifiers, such as the Support Vector Machines, Logistic Regression, and Conditional Random Fields (CRFs) [
5,
6,
7,
8,
9]. With the development of deep learning techniques, the traditional one-hot representations have been replaced by word embedding methods, such as Word2Vec [
10] and GloVe [
11]. These methods capture semantic information of words in lower dimensions. In order to simplify the task of feature engineering, researchers have explored the combination of various deep neural networks with CRF layers to achieve word-level predictions—for instance, Recurrent Neural Network (RNN) [
12] and Convolutional Neural Network (CNN) [
13]. With the advent of the Transformer [
14], many researchers adopted this model for NER and RE tasks [
15,
16,
17]. Knowledge-enabled Bidirectional Encoder Representation from Transformers (K-bert) [
18] and Enhanced Language Representation with Informative Entities (ERNIE) [
19] combined pre-trained models with knowledge graphs to effectively learn additional entity knowledge.
Although the above models have achieved good performance in the text modality, they still face challenges when dealing with short, noisy data, such as Twitter datasets, especially encountering out-of-vocabulary words and complex abbreviations, as this type of text is often ambiguous. Many researchers have also proposed coping strategies. Von et al. [
20] employed sentence-level features to enhance the recognition accuracy of named entities in short texts. Xie et al. [
21] proposed a novel vocabulary substitution strategy based on constructing Entity Type Compatible (ETC) semantic spaces. Unknown words are replaced with ETC words found through Deep Metric Learning (DML) and Nearest Neighbor (NN) search. Soares et al. [
22] fine-tuned pre-trained models with blank relationship data to automatically determine whether entities were shared between two relationship instances. Language Understanding with Knowledge-based Embeddings (LUKE) [
23] introduced an entity-aware self-attention strategy and contextualized representation. Nevertheless, the aforementioned methods have not exploited the potential of visual information when dealing with multimodal data.
To compensate for the semantic context deficiencies in a single modality, Moon et al. [
24] constructed a deep image network and a generic modality attention module that leverages images to enhance textual information and diminish irrelevant modalities. The model aimed to uncover correlations between multimodal data. Zhang et al. [
25] designed an Adaptive Co-Attention Network (ACN), combining textual information with visual information processed by the Visual Geometry Group network (VGGnet) [
26]. Subsequently, Arshad et al. [
27] extended multidimensional self-attention techniques, simultaneously learning shared semantics between internal textual representations and visual features. Yu et al. [
28] proposed an entity span detection technique to consider both word and visual representations of image and text information. To mitigate visual guidance biases introduced by images, Wu et al. [
29] incorporated fine-grained images and designed a dense co-attention layer, introducing a co-attention mechanism between related image objects and entities. To further leverage image information, Chen et al. [
30] proposed a novel neural network model that introduced image attributes and image knowledge. Zheng et al. [
31] came up with the Adversarial Gated Bilinear Attention Neural network (AGBAN) which employs adversarial gating strategies to deal with the image-text alignment issue. To address the issue of image noise, Xu et al. [
32] proposed a multimodal matching and alignment model to simultaneously handle modality differences and noise. Chen et al. [
33] introduced a visual-prefix-based attention strategy (HVPNet) that progressively integrates visual features into each module of Bidirectional Encoder Representations from Transformers (BERT) for attention computation. Although these methods have improved the noise issue of multimodal information to some extent, there remains a significant modality gap between image features and text features originating from different modalities.
In order to bridge this gap and enhance the performance of multimodal models in various applications, it is crucial to examine the underlying structures and approaches of existing multimodal BERT models. Multimodal BERT models can be categorized into two main dimensions. (1) Structure—its main categories include single-stream and cross-stream. Single-stream multimodal models treat and fuse information from different modalities equally. Visual and textual inputs are processed by their respective encoders, resulting in visual and textual features that are subsequently combined into a unified sequence. The sequence serves as the input for the multimodal transformer, which autonomously learns the associations between cross-modal features. Representatives of this structure include VisualBERT [
34], Universal Encoder for Vision and Language (Unicoder-VL) [
35], and Object-Semantics Aligned Pre-training (Oscar) [
36]. Cross-stream structures process visual and language information independently as two streams. Unimodal features were concatenated and fed into a deep feedforward neural network as input, with information exchange being achieved through cross-modal interaction transformation layers. Typical two-stream structure models include Contrastive Language Image Pre-training (CLIP) [
37] and ViLBERT (short for Vision-and-Language BERT) [
38]. (2) Pre-training tasks—pre-training tasks in vision-language models are key to interpreting and linking images as well as text within the artificial intelligence domain. However, most existing models are pre-trained based on image caption generation or conditional image generation, which emphasizes multimodal interaction, such as Dual-Level Collaborative Transformer (DLCT) [
39], Graph Attention Networks (GAT) [
40], Attentional Generative Adversarial Network (AttnGAN) [
41], etc. Directly applying these models to multimodal named entity recognition (MNER) and multimodal relation extraction (MRE) tasks may not achieve ideal results because the core objectives of MNER and MRE are to utilize visual features for enhancing text information extraction, rather than predicting image content.
To address these challenges, many models have introduced image caption generation techniques. Accurately describing images is a challenging task, as it requires capturing and understanding various details within the image. Early image caption generation methods were primarily based on matching approaches, such as DFE [
42] and VSE++ [
43]. However, due to the limited generative capabilities of the description libraries, diversity is still difficult to achieve. Due to the ongoing advancements in deep learning and neural network technologies, numerous scholars have utilized these approaches in the area of image caption creation. Vinyals et al. [
44] used CNN to encode images and LSTM as a text generator. Anderson et al. [
45] employed Faster R-CNN [
46] for staged object detection, improving the accuracy of image caption generation. Following the introduction of the Transformer model [
14], Luo et al. [
39] improved its attention mechanism by developing a Dual-Level Collaborative Transformer (DLCT) to enhance the complementarity between regional and grid features. Wang et al. [
47] proposed the Geometry Attention Transformer (GAT) to further capture relevant geometric relationships of visual objects and extract spatial information within images. With the increase in large-scale models, inference speed has significantly decreased. Jin et al. [
48] introduced an early exit strategy through layer-wise similarity to save computational resources while retaining state-of-the-art performance.
To effectively integrate visual features and semantic features generated from image captions, we propose a Visual Description Augmented Integration Network (VDAIN) in this paper. First, we input the raw image as well as the object detection image into the Vision Transformer (ViT) [
49] and extract visual features from each ViT module as output. Furthermore, we apply image description generation techniques to the images and use A Lite BERT (ALBERT) [
17] to extract semantic features from the generated image descriptions. Finally, we introduce the visual features and text vectors generated from image descriptions into the Visual Description Augmented Integration Network and predict entities and their relationships. In comparison to the shallow attention alignment method of MEGA, we have applied a BERT method fused with visual knowledge, which enhances the integration of visual and textual information. Compared with HVPNet, we adopt a visual description augmented integration mechanism to extract the textual semantic information from images, reducing the modality gap during the information fusion process. Furthermore, we employ ViT as the visual feature extractor, which is more similar to the BERT architecture, thereby improving the model’s ability to integrate cross-modal information to a certain extent. Therefore, our paper contributes to the existing research in three major aspects.
- (1)
We employ an image description generation approach to extract key information and global summaries from visual content. The introduction of text features generated from image descriptions helps to reduce the modality gap, enhancing the robustness of MNER and MRE tasks.
- (2)
We propose a Visual Description Augmented Integration Network for MNER and MRE. By adaptively integrating semantic features obtained from image descriptions with the visual features, the model aims to mitigate the error sensitivity caused by irrelevant features.
- (3)
VDAIN achieves new state-of-the-art results on three public benchmark tests for MNER and MRE. Comprehensive evaluations confirm the practicality and efficacy of the proposed approach.