Combining Keyframes and Image Classification for Violent Behavior Recognition
Round 1
Reviewer 1 Report
In this work, the authors proposed a system for violence detection from video keyframes. Image pairs were also used to improve the performance.
Here are the comments to the authors:
In Line 127, The authors stated that :
“the background and camera movement will greatly impact 3DCNN-based and LSTM-based models, …”
But as we know the deep models are better in generalization so to what extent this is true?
In figure 5 of the dataset, there is a women in the scene but was not segment with the fight actors. Since, she is part of the foreground how was she removed by the segmentation ?
In line : 234
k_i and k_i\prime are missing?
in line 240:
y_ij – x_ij should be equal to D_ij by definition?
In line 252:
The authors reported that the training is equivalent to finding the minimum value of formula (1). But it is not clear which one is used? Lagrange resolution or training?
A graph or a pipeline will be more expressive in this case.
Figure 8 is not clear, it is blurred. It needs also a legend for more clarity.
In table IV, that shows the implementation of the model, the Lagrange resolution step described below is not present?
Also, the best reported performances should be highlighted. It is easier for the reader to compare.
Finally, it is not clear why the authors tried to minimize the distance between the feature representations of the original and the segmented image which are basically different since the original one contains the background. With or without background a fight frame is a fight frame.
Also, the segmentation algorithm cannot accurately extract only the belligerents since it is deigned to extract general foregrounds. This can be very restrictive for the approach.
Author Response
Please see the attachment
Author Response File: Author Response.docx
Reviewer 2 Report
This paper deals with an important topic and provides useful information on detecting by images the onset of violent acts such as riots or even terroristic action. Therefore, the merit of the paper is in this part of the paper. The modeling topics, by which the authors tech how the interpretation of the images might be improved referring to the literature in the field of social dynamics.
Known models are developed by a dynamics of “gain” and “loss” terms as proposed by the authors who, however, skip over the literature in the field reported, for instance in the survey paper open access
https://www.worldscientific.com/doi/pdf/10.1142/S0218202521500408
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
The authors have responded to all the inquiries. The manuscript is much clearer now.
Another question concerns Figure 8 regarding the visualization. Subfigures (b) and (d) are very similar but represent the disjoint classes of segmented nonviolence and segmented violence, respectively. Is there any particular reason ?
In the same figure, is there any explanation of the morphological structure of the representations in (b) and (d) ?
Author Response
Please see the attachment.
Author Response File: Author Response.docx