Next Article in Journal
Parametric Analysis of a Polygeneration System with CO2 Working Fluid
Next Article in Special Issue
Finding the Differences in Capillaries of Taste Buds between Smokers and Non-Smokers Using the Convolutional Neural Networks
Previous Article in Journal
Nonlinear Dynamics of an Internally Resonant Base-Isolated Beam under Turbulent Wind Flow
Previous Article in Special Issue
A New Real-Time Detection and Tracking Method in Videos for Small Target Traffic Signs
 
 
Article
Peer-Review Record

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Appl. Sci. 2021, 11(7), 3214; https://doi.org/10.3390/app11073214
by Huy Manh Nguyen, Tomo Miyazaki *, Yoshihiro Sugaya and Shinichiro Omachi
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2021, 11(7), 3214; https://doi.org/10.3390/app11073214
Submission received: 3 March 2021 / Revised: 26 March 2021 / Accepted: 31 March 2021 / Published: 3 April 2021
(This article belongs to the Special Issue Artificial Intelligence for Computer Vision)

Round 1

Reviewer 1 Report

The article presents and interesting solution for video retrieval from query sentence. The authors use multiple individual embedding spaces to capture multiple relationships between instances and a weighted sum strategy to produce a final similarity between instances.

The article has a good formalism and the results are sustained by elaborated and good explained experiments. Maybe the authors care to explain why they chose those values for the selected parameters (20 chunks, for instance) and to elaborate a bit on those learnable parameters.

The Related Work chapter is based on a large and recent list of references. Still, the chapter is a very short presentation of what visual and language understanding means and also visual and sentence embedding. The suggestion is to elaborate these parts.

Also the Conclusion chapter can be elaborated to emphasize better the results and how and when they can be used.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper proposes an approach for visual-semantic embedding, i.e. to relate video and sentences.

The manuscript is overall well-written, easy to follow, and pleasant to read. The content is of great interest and can be useful for the research community. The related works are also presented quite methodically, even if the subsection dedicated specifically to Video and Sentence Embedding may be enlarged.

The problem is well-stated in Section 3.1 and Figure 1 explains the whole approach easily and immediately. The Authors performed sentence-to-video retrieval experiments on a benchmark dataset (MSR-VTT), which allows them to compare their proposal to other its limitation to single or dual space. The ablation studies are also quite complete and important for this discussion.

 Nevertheless, it is this reviewer’s opinion that some improvements and corrections are necessary before acceptation. Specifically:

 

  1. Page 2, the statement “Recent works suffer from extracting visual dynamics in a video [3–6]” is a bit too vague and should be presented in more detail.
  2. The rationale for using a weighted sum for the similarity scores over the two embedding spaces is well explained by the Authors at the end of page 3. If done manually, a weighted average will introduce a large arbitrariness in the procedure. The authors bypass this problem by using a data-driven approach, the Mixture of Experts fusion strategy. This is a well-reasoned approach, yet the effects of varying the assigned weights should be addressed more methodically, also to support the quite vague statement that “videos and sentences require different attention”.
  3. Subsection 3.3/1: it is not clear why N=20 chucks were selected. Is it possible to link the chunk size to the number of pixels in the frame? In this sense, it would be better to report also the size of the samples from the ImageNet dataset before resizing.
  4. Subsection 3.3/2. There is any specific reason for setting D=512?
  5. Subsection 3.4: “we [..] take the first frames of each chunk as the input of the sequential visual

 network.” How many frames exactly? Are they continuous?

  1. Video analysis is a very computationally expensive procedure. The Authors should investigate and discuss how their approach compare to the state-of-the-art methods reviewed under this aspect (e.g. by comparing the elapsed time). This is relevant as well since they use multiple embedding networks to capture various relationships between video and sentence. While this can lead to more compelling video retrieval with respect to other existing option which relies on a single embedding space, this is done at the cost of further complicating the whole procedure. Therefore, a cost-benefit analysis will be necessary as well.
  2. The consideration of the previous remark applies as well for the optimisation phase described in Subsection 3.6: how long does it take to perform this operation?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The authors replied and answered completely to all the remarks and concerns raised by this Reviewer. The reply to remark 2 is particularly appreciated since it made the intended meaning much clearer and removed a potential ambiguity.

Therefore, it is this Reviewer's opinion that the manuscript can be accepted in its current form. 

Back to TopTop