Next Article in Journal
Crime Light Imaging (CLI): A Novel Sensor for Stand-Off Detection and Localization of Forensic Traces
Next Article in Special Issue
On the Importance of Attention and Augmentations for Hypothesis Transfer in Domain Adaptation and Generalization
Previous Article in Journal
A Unified Multimodal Interface for the RELAX High-Payload Collaborative Robot
Previous Article in Special Issue
POSEIDON: A Data Augmentation Tool for Small Object Detection Datasets in Maritime Environments
 
 
Article
Peer-Review Record

CNN-ViT Supported Weakly-Supervised Video Segment Level Anomaly Detection

Sensors 2023, 23(18), 7734; https://doi.org/10.3390/s23187734
by Md. Haidar Sharif *, Lei Jiao and Christian W. Omlin
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Sensors 2023, 23(18), 7734; https://doi.org/10.3390/s23187734
Submission received: 1 August 2023 / Revised: 1 September 2023 / Accepted: 4 September 2023 / Published: 7 September 2023

Round 1

Reviewer 1 Report

The manuscript proposed weakly-supervised video segment-level anomaly detection based on the extracted visual features. Here, the visual features are extracted using CNN and ViT. The overall illustration is logical and clear.  Several minor issues:

1.  3.1.3 Feature length normalization for training can be put in 3.3. training part

  2.  How you use C3D (the frame length) can add more details here.

3.  Lin 212-213  how features from CNn and ViT is separately converted into a probability score vector Pscore is not clear. Suggest adding more description. 

4.  4.8. Future work  can be put in Section 5

5.  added some discussion as a new section 4.8,  why not use LSTM to extract temporal features?   added some discussion here?

6.  in the testing phase,  if the video length is not the same, your method will still work or not?  Like to know how to deal with virant-length videos? 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper presents a novel approach, called CNN-ViT-TSAN, for weakly supervised video anomaly event detection (WVAED). The proposed method leverages pre-trained feature extractors (C3D, I3D, and CLIP) to extract effective representations. It incorporates both long-range and short-range temporal dependencies using a TSAN. Experimental results on popular crowd datasets prove the effectiveness of the CNN-ViT-TSAN approach in WVAED.  Overall, this paper is well-structured and of great value. Nevertheless, there are a few issues that should be acknowledged and addressed. 

1. In the Introduction section, it is crucial for the authors to provide a more specific and targeted explanation of the problem they are addressing. They should clearly articulate the necessity of Weakly-Supervised Video Segment Level Anomaly Detection. What are the limitations or challenges with existing methods that make weakly-supervised approaches necessary?

2. When describing the proposed method, it is crucial to emphasize the key innovations of your work and how they distinguish it from related approaches. Highlight the unique contributions and advancements that your method brings to the field.

3. In application scenarios of anomaly detection, efficiency is an important factor to consider. Therefore, I suggest conducting a comprehensive time-consuming comparison to evaluate the efficiency of the proposed method. This comparison can demonstrate the computational performance and runtime of the method in comparison to existing approaches. By including an efficiency analysis, you provide valuable insights for practical implementation and adoption of the proposed method in real-world scenarios.

4. The experimental results should be thoroughly discussed to highlight the specific contribution of the proposed framework. Emphasize how the proposed method outperforms existing methods or addresses limitations in the field. Discuss any notable findings, trends, or patterns observed in the results.

5. The Conclusion section can be improved by presenting the principles demonstrated by the results more comprehensively. Clearly articulate the key findings and their significance for the research field. Discuss the theoretical implications of your work, highlighting the unique contributions made by this article. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Authors designed a MIL-based generalized framework of CNN-ViT-TSAN to specify a series of models for WVAED problem. The structure of CNN-ViT-TSAN was illustrated. Experimental were carried out on UMN, UCSD-Ped1, UCSD-Ped2, ShanghaiTech, and UCF-Crime datasets. Detailed results were presented. There are several points should be improved.

1. Innovations are expected to be enhanced. Five deep models were obtained by combining the CNN and ViT based pre-trained models. The experiments showed I3D-CLIP-TSAN has the best results. However, the reasons why such network is best were not analyzed. Similarly, in Section 3.2, although Temporal self-attention network was introduced, further analyses were neglected.

2. In Section 4.1, the completed introduction of datasets seems not necessary.

3. What is the definition of AUC in Section 4? In Section 4.3 AUC scores obtained by I3D-CLIP-TSAN when processing the videos of UCSD-Ped2, ShanghaiTech, and UCF-Crime datasets are 0.986, 0.989, and 0.912 respectively. However we can not find the corresponding results in Table.1. Please explain it.

4. In Figure 2, the two horizontal axes indicating the number of frames and the number of snippets in every subfigure are confusing.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

What is the definition of AUC in Section 4? Authors are recommended to present its calculation formula.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop