Next Article in Journal
Reflection of Acoustic Wave through Multilayered Porous Sea Ice Sandwiched between the Water and Air Half-Spaces
Previous Article in Journal
Modulation Linearization Technique for FM/CW SAR Image Processing Using Range Migration
Previous Article in Special Issue
Guided Spatial Transformers for Facial Expression Recognition
 
 
Article
Peer-Review Record

Topic-Oriented Text Features Can Match Visual Deep Models of Video Memorability

Appl. Sci. 2021, 11(16), 7406; https://doi.org/10.3390/app11167406
by Ricardo Kleinlein *, Cristina Luna-Jiménez, David Arias-Cuadrado, Javier Ferreiros and Fernando Fernández-Martínez
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2021, 11(16), 7406; https://doi.org/10.3390/app11167406
Submission received: 28 June 2021 / Revised: 30 July 2021 / Accepted: 9 August 2021 / Published: 12 August 2021
(This article belongs to the Special Issue Computational Trust and Reputation Models)

Round 1

Reviewer 1 Report

General summary

This is a thought-provoking paper on a novel research niche that is likely to become increasingly more important, and will be of interested to readers of this journal. The introduction and method sections are sufficiently clear. The results and conclusions presented are reasonable given the parameters of the study. The figures are well presented and appropriate for the research aims.

I have no major suggestions, but do have a number of minor suggestions, all of which are related to language and should be relatively easy to address. I hope these changes will increase not only the readability of the article, but also increase the number of readers who engage with the article. For each point, I provide the line reference, the relevant wording and my comment or suggested change. Please feel free to use alternative wording rather than my suggestions if you prefer.

Language clarity

20 This implies – I cannot understand where the inference that all input stimuli are equally well remembered comes from. Please clarify.

Language brevity

Based on it, Sentence-BERT --> Sentence-BERT

Language formality.

3 reckon  --> think/believe

63 don’t seem to  --> do not seem to

314 the  latter’s labels  --> the labels of the latter

Language accuracy

1 Not every visual media production are  --> Not every visual media production is

3 on a second watch or not  --> on a second viewing or not

6 we deepen in the study of short captions as a mean to  --> we deepen the study of short captions as a means to

18 Human brain  --> The human brain

26-27 make their messages to persist  --> make their messages persist

49-50 in full extent in section 5  --> in full in section 5

56 aesthetics experience  --> aesthetic experience

62 Contrarily to intuition  --> Contrary to intuition

100-101 deepen on the particularities of each dataset.  --> discuss the particularities of each dataset in depth.

313 official train  --> official training

Author Response

We sincerely appreciate the kind words of the reviewer, and their efforts to ease the application of the suggested changes. We have included most of them and we thank the reviewer for their efforts to make the article more readable.

Reviewer 2 Report

  1. This paper needs a brief figure to show overall architecture or concept of the method the author have proposed.
  2. In this method, how is BERT or SBERT used for unsupervised topic discovery? The explanations shown in Section 4.1 is not enough to show the role of BERT in this method. A diagram to explain it clearly is required with further detailed explanations.
  3. Section 5 shows the predictive models of video memorability. How  do textual SBERT-based models work with the SBERT, PCA, and linear regression model? The author should improve this section by showing its overall processes clearly.
  4. In Section 5.2, the visual baseline pipeline involves DenseNet-121 model. Can be other CNN models such as ResNet and MobileNet applied into the pipeline?  
  5. In Figure 8, what is the AVG? There is no explanation for this in Section 5.2.
  6. Are there any additional experimental results besides Pearson's correlation and Spearman's correlation? In particular, can't these results be graphed? If there are experimental results that can be represented graphically, it would be good to add them in this paper.

Author Response

We thank the reviewer for their interest in making our work more understandable and interesting to readers. Below we try to answer all their suggestions in order.

  1. It is true that an image is worth a thousand words. Hence we have replaced the figure that depicted exclusively the visual baseline’s architecture (Figure 8) for another one that shows all the branches of the model (image, text and the combination of both). We hope this new diagram clarifies our approach, but we have also included the appropriate references in the main text of the article, altogether with further explanations about said diagram.
  2. It worth to emphasize that we do not carry out any adaptation procedure over the default SBERT models, so in section 4 our purpose is to explain how our approach must meet two prerequisites to work in a way that grants results are meaningful: on the one hand, as we explain in section 4.1 “Out-Of-Vocabulary Words”, our pretrained model must be able to capture the semantics of our set of sentences and embody it in a numerical embedding, otherwise the automatically extracted embeddings would barely make sense from the point of view of language processing. On the other hand, the semantic units we are able to extract via SBERT must show some degree of alignment with the memorability scores. This is the issue we deal with in depth in section 4.2 “Relationship between topics and memorability”. In it, we fully develop how the SBERT embeddings are used as inputs to a UMAP dimensionality reduction process aimed at alleviating the burden of a latter clustering, which successfully returns us groups of embeddings (and therefore, sentences) that share semantic content and that seem to correlate with memorability as a group. This way, we  identify semantic units that overall tend to show greater memorability scores than others. Still, we understand it might be clearer for new readers if we extended our explanations as the reviewer suggests, and therefore additional comments have been added in the main text.
  3. We have included a diagram that displays all the predictive models in the hope it helps understanding the three modalities we use (visual, text from SBERT and a late fusion of both) in Figure 8, corresponding with the beginning of the section devoted to the predictive models of media memorability.
  4. Indeed, any CNN architecture would work, although the prediction rates achieved probably would be different. We resort to this particular architecture because in a very recent paper (actually the one in which Memento10K, one of the datasets we experiment with, was released), its authors proposed a model whose visual branch followed this very same architecture, achieving state of the art performance. By using it, we are in a position to better compare our system against their proposal.
  5. The ‘AVG’ acronym refers to average. As it is explained in section 5.2: “This model goes in order through the frames of a video, computing a memorability score independently for every frame and then averaging over these predictions to compute a video-level final estimation.” However, we are now aware the diagram could be clearer to new readers, and therefore we have replaced ‘AVG’ by ‘AVERAGE’.
  6. These two metrics (Pearsons’s correlation and Spearman’s index) are the standard metrics in all media memorability studies we have knowledge of, and therefore they constitute the most straightforward way to compare our approaches against models proposed by other researchers. Regarding a possible visual representation of the results, we thank the reviewer for the suggestion. We opted to plot the input representation dimensionality after the PCA transform against both Pearson’s coefficient and Spearman’s rank index, allowing readers to better see its effect on the capacity of the linear regression models to predict media memorability.

We hope these corrections are in line with the original suggestions of the reviewer.

Round 2

Reviewer 2 Report

I have confirmed the author's response and corrected contents in the revised manuscript. The contents I had suggested have been carefully revised by the authors.

Back to TopTop