Next Article in Journal
Study on the Glare Phenomenon and Time-Varying Characteristics of Luminance in the Access Zone of the East–West Oriented Tunnel
Previous Article in Journal
Effect of Chlorhexidine Digluconate on Oral Bacteria Adhesion to Surfaces of Orthodontic Appliance Alloys
Previous Article in Special Issue
Improved YOLOv7 Algorithm for Small Object Detection in Unmanned Aerial Vehicle Image Scenarios
 
 
Article
Peer-Review Record

A Multi-View Interactive Approach for Multimodal Sarcasm Detection in Social Internet of Things with Knowledge Enhancement

Appl. Sci. 2024, 14(5), 2146; https://doi.org/10.3390/app14052146
by Hao Liu, Bo Yang * and Zhiwen Yu
Reviewer 1:
Reviewer 2:
Appl. Sci. 2024, 14(5), 2146; https://doi.org/10.3390/app14052146
Submission received: 1 February 2024 / Revised: 1 March 2024 / Accepted: 2 March 2024 / Published: 4 March 2024
(This article belongs to the Special Issue Future Trends in Intelligent Edge Computing and Networking)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Short Summary

The paper deals with recognising sarcasm by considering two modalities: image and text. The deep learning-based approach combines image features extracted with a pre-trained DenseNet, text features based on BERT, and knowledge features. The authors claim that the knowledge features, which distinguish the model from previous approaches, are automatically determined attribute labels of the images using ResNet, which are then vectorised using BERT. The focus of the work is the development of a sarcasm recognition model for use on mobile devices. The approach is aimed at minimizing the number of model parameters due to the limited memory space.    Experimental results demonstrate that the model outperforms state-of-the-art single-modality approaches and is comparable to multimodal approaches.

 

Overall Impression 

The approach presented is very interesting for many fields of application. The more technical parts of the paper are easy to understand. However, the introductory part of the paper, as well as the literature discussion, needs to be revised completely. Furthermore, the terminology should be clarified throughout the entire paper. In addition, the experiment designed to show that the model pays attention to contradictions is not very meaningful, as it only considers a single image. More examples should be evaluated and statistically assessed here.

Advice for Improvement

Abstract

The basic idea of the paper becomes apparent in the abstract. However, it should be briefly emphasised why including several modalities when recognising sarcasm is essential. What the model is based on should also be added, e.g., whether it uses deep learning methods. When mentioning the results, the main focus should be on how the developed model performed compared to the baselines instead of just saying that it delivers good results. 

 

Introduction

First of all, sarcasm is not the same as satire at all. Both terms are well-defined and should be used carefully depending on the context. Further, instead of introducing the development of satire detection, the authors should discuss the background and implications of their work. Actually, the joint analysis of different modalities has implications for numerous fields of application.

The importance of the multiple modalities is apparent in the introduction, but the role of the Social Internet of Things must be clarified.  It would be beneficial to define the term 'Social Internet of Things' at the beginning and provide more detail on why recognising sarcasm is particularly important in this context to improve comprehension. The definition of the Social Internet of Things needs to be clarified in regards to why methods such as sentiment analysis and opinion mining are included, as they are, in fact, text mining methods. Additionally, the definition of sarcasm recognition needs to be revised, especially regarding what satirical labels mean. The description of challenge three should also be more precise, particularly regarding the relationship between the separate consideration of modalities and the use of mobile devices.

The use of ‘external knowledge’ is confusing. Typically, external knowledge is introduced from external sources, e.g., databases. In contrast, the authors refer to the internal knowledge of images embedded in the text as ‘external knowledge’. The authors use the terms multi-view and multimodality interchangeably, but from my point of view, these are different things. Within the community, the term ‘multimodality’ is widespread and should, therefore, be preferred.

Related Work 

The literature survey is mainly a list of different approaches and lacks a connection to the own approach. It is recommended to discuss the various approaches in detail, including the results they achieved and their level of success. The literature survey should conclude by highlighting the research gap that the author’s approach attempts to fill.

Some parts of the introduction would be better placed in this section.

Methodology

Figure 1 should clarify which part of the picture refers to which of the described components and where one can recognize the two parts mentioned in the introduction. Furthermore, the caption of this figure should provide more details. It should be made clear which part the Feature Extract Module belongs to.

The methodology description is otherwise clear and easy to understand.

Experiment Setup

The explanation of basic rating measures such as Accuracy is unnecessary and should be omitted. Add the percentage frequency to Table 1.

Experimental Result and Analysis

The results are comprehensibly presented and described. However, the case study in 5.4 needs to be applied to more examples than a single sample to draw meaningful conclusions about what is important for correct classification.

Conclusion

The conclusion provides a suitable summary of the paper. For future work, additional approaches could be mentioned that address handling poor image quality, which can result in classification errors, as discussed in section 5.4. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Multimodal sarcasm detection is a new area of study focusing on understanding sarcasm in social media, which combines text and image data. Sarcasm often reveals people's true feelings about events, showing their emotions and thoughts. However, detecting sarcasm in IoT devices, like smartphones, is tricky due to limited memory and slow processing. This research proposes a lightweight model that combines external knowledge with text and image data to better understand sarcasm. Results show it works well, even with fewer parameters. This is important topic and there are some improvements to be made in the manuscript.

 

1.     I would like to see explicitly a research goal or research question. In addition to the research contributions, authors need to tell the significance of the research as well.

2.     Figure 1, please expand MVK in the title.

3.     Figure 1, I can understand the block image, knowledge, and text. What is the output of Densenet, BRET, and BRET? Why BRETs are different colors? Does the color in the Figure have significance? What do the block N layers represent? Are they neural networks? Very confusing! Please redraw the diagram and ensure it is self-explanatory.

4.     Also, I am confused about your methodology. Figure 1 is your proposed approach. Again, I am confused that you have an experiment and case study in your paper. I suggest you draw one more diagram that focuses on just the research methodology, which includes the steps of your research, including validation. Validation of your results is missing in the current version.  

5.     What NLP is using and explain more details about it. What are the limitations of your experiment?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Comment 1 – Abstract

 

All comments have been incorporated and the clarity of the abstract has been improved. However, the statement that the model outperforms the compared baselines is not entirely correct. In fact, the results were comparable to those of other systems based on multiple modalities, and only better results were achieved than with models based on a single modality (text or image).

 

Comment 3 – Introduction

 

The definition of 'Social Internet of Things' has been established, but the paper requires a reference to the new definition; maybe you can use this one:

 

Luigi A, Antonio I, Giacomo M et al (2012) The social internet of things (siot)–when social networks meet the internet of things: concept, architecture and network characterization. Comput Netw 56(16):3594–3608

 

Additionally, it is still not clear to what extent sentiment analysis and opinion mining are applications of the Social Internet of Things, and to what extent it is important for these applications to recognise sarcasm. Although this was clarified in the authors' response, it should also be addressed in the paper. The definition of sarcasm detection has been revised. However, it is now in the methodology section, when it be more useful in the introduction.

 

Comment 5 – Introduction

 

According to the authors' response, the term "multimodality" was chosen consistently instead of "multi-view". However, the term "multi-view" is still used, e.g. "The proposed multi-view modal interaction method enhances the multimodal information interaction." The other changes have been made.

 

Comment 6 – Related Work

 

The literature discussion still primarily resembles a list, as the approaches are mentioned primarily. However, the last section is helpful as it briefly summarises the results of the approaches and points out the gap. The statement that no common knowledge has been included so far is confusing, as the image annotations used by the authors cannot be described as common knowledge either.

 

Comment 7 – Methodology

 

The image has been enhanced, and the individual modules are now easily identifiable. The caption has also been expanded. However, it is still unclear which of the two components mentioned in the introduction, interactive learning or feature fusion, the feature extractor belongs to.

 

 

Comments 9 – Experimental Result and Analysis

 

The experiment included only one additional non-sarcastic image. To call this a case study, more examples of sarcastic images should be added, otherwise the examples given are just that, examples.  The description of the pictures falsely referred to sample (a) twice. It is also not yet clear what the conclusion of the case study is and to what extent the model can effectively extract text and image features and improve the interaction of the model.

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed my questions raised in the previous review. However, they need to include the limitations of their research. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop