Next Article in Journal
Volume Contraction in Shallow Sediments: Discrete Element Simulation
Previous Article in Journal
A High-Resolution MIR Echelle Grating Spectrometer with a Three-Mirror Anastigmatic System
 
 
Article
Peer-Review Record

Combining Keyframes and Image Classification for Violent Behavior Recognition

Appl. Sci. 2022, 12(16), 8014; https://doi.org/10.3390/app12168014
by Yanqing Bi *, Dong Li and Yu Luo
Reviewer 1: Anonymous
Reviewer 2:
Appl. Sci. 2022, 12(16), 8014; https://doi.org/10.3390/app12168014
Submission received: 24 June 2022 / Revised: 3 August 2022 / Accepted: 9 August 2022 / Published: 10 August 2022
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

In this work, the authors proposed a system for violence detection from video keyframes. Image pairs were also used to improve the performance.

 Here are the comments to the authors:

In Line 127,  The authors stated that :

“the background and camera movement will greatly impact 3DCNN-based and LSTM-based models, …”

But as we know the deep models are better in generalization so to what extent this is true?

 

In figure 5 of the dataset, there is a women in the scene but was not segment with the fight actors. Since, she is part of the foreground how was she  removed by the segmentation ? 

 

In line : 234

k_i and k_i\prime are missing?

 

 

in line 240:

y_ij – x_ij should be equal to D_ij by definition?

 

In line 252:

The authors reported that the training is equivalent to finding the minimum value of formula (1). But it is not clear which one is used? Lagrange resolution or training?

A graph or a pipeline will be more expressive in this case.

 

Figure 8 is not clear, it is blurred. It needs also a legend for more clarity.

 

In table IV, that shows the implementation of the model,  the Lagrange resolution step described below is not present? 

 

Also, the best reported performances should be highlighted. It is easier for the reader to compare.

 

Finally, it is not clear why the authors tried to minimize the distance between the feature representations of the original and the segmented image which are basically different since the original one contains the background. With or without background a fight frame is a fight frame. 

Also, the segmentation algorithm cannot accurately extract only the belligerents since it is deigned to extract general foregrounds. This can be very restrictive for the approach.

 

 

Author Response

Please see the attachment

Author Response File: Author Response.docx

Reviewer 2 Report

This paper deals with an important topic and provides useful information on detecting by images the onset of violent acts such as riots or even terroristic action. Therefore, the merit of the paper is in this part of the paper. The modeling topics, by which the authors tech how the interpretation of the images might be improved referring to the literature in the field of social dynamics.

Known models are developed by a dynamics of “gain” and “loss” terms as proposed by the authors who, however, skip over the literature in the field reported, for instance in the survey paper open access

https://www.worldscientific.com/doi/pdf/10.1142/S0218202521500408

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The authors have responded to all the inquiries. The manuscript is much  clearer now.

 

Another question concerns Figure 8 regarding the visualization. Subfigures (b) and (d) are very similar but represent the disjoint classes of segmented nonviolence and segmented violence, respectively. Is there any particular reason ?

 

In the same figure, is there any explanation of the morphological structure of the representations in (b) and (d) ?

 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop