A Controlled Benchmark of Video Violence Detection Techniques
Abstract
:1. Introduction
- Understanding crowd dynamics;
- Allowing the development of crowd control systems;
- Support designing and organizing public areas;
- Improving animation models (e.g., for simulations and special effects creation in videogames design).
2. Datasets
2.1. Violent-Flows-Crowd Violence/Non-Violence Dataset
2.2. Hockey Fights Dataset
2.3. UCF 101 Dataset
2.4. Hollywood2
2.5. Movies Dataset
2.6. Behave Dataset
2.7. Caviar Dataset
2.8. UCSD Dataset
3. Reviewed Works and Implementation Details
3.1. ViF
3.2. OViF
3.3. MoSIFT
3.4. KDE, Sparse Coding, and Max Pooling
3.5. HOA, HOP, HOD, and OE
3.6. ConvLSTM Network
3.7. Extraction of OHOF from Candidate Regions
- A multiscale scanning window used to search for violent events in the candidate regions;
- The OHOF feature is extracted for each area of the image covered by the scanner window, to distinguish violent from non-violent actions.
- Step 1: Scanning windows are built with three types of scales: 72 × 72, 24 × 24 and 8 × 8.
- Step 2: Scroll the images through multiple scales with steps of 8 pixels at a time.
- Step 3: If the scanning window crosses more than half of the candidate regions as violent regions, then skip to Step 4; otherwise, go back to Step 2.
- Step 4: Update the candidate regions with the regions crossed by the scanning windows, and mark them as violent regions.
- Step 5: Sample the new candidate regions as violent by using the method in work [10], and go back to Step 2.
3.8. Improved Fisher Vector with Boosting and Spatiotemporal Information
3.9. Haralick Feature
- Second Angular Moment:
- Contrast: The following is an example of an equation:
- Homogeneity:
- Correlation:
- Dissimilarity:
- Asymmetry:
- IFU:
3.10. Blobs Motion Features
- (1)
- Given a sequence of frames, each frame is converted into its grayscale, as in Equation (33):
- (2)
- Take It−1 (x, y) and It (x, y), which will be two consecutive frames at time t−1 and t, and the absolute difference of the two frames is given in Equation (34).
- (3)
- This new matrix is transformed into the quantized binary form, using a threshold h, as shown in Equation (35):
- (4)
- It is necessary to search for each blob in the image Ft (x, y). For each image Ft (x, y), some of the blobs are selected from which other information will be extracted. The selection will be made on the basis of the blob’s area.
- (5)
- The blob area (Aa,t) is defined as in Equation (36):
- (6)
- Centroids are also calculated as in Equation (37):
- (7)
- Compute the Euclidean distance between two blobs’ centroids.
- (8)
- Compactness is used to estimate the shape of the blobs (circular or elliptical). It is defined as in Equation (38):
3.11. Violence Detection with Inception V3
4. Experiments Setup and Results
- ViF;
- Motion Blobs;
- Inception V3;
- Haralick;
- Improved Fisher Vector.
5. Conclusions and Future Work
Author Contributions
Funding
Conflicts of Interest
References
- Afiq, A.; Zakariya, M.; Saad, M.; Nurfarzana, A.; Khir, M.; Fadzil, A.; Jale, A.; Witjaksono, G.; Izuddin, Z.; Faizari, M. A review on classifying abnormal behavior in crowd scene. J. Vis. Commun. Image Represent. 2019, 58, 285–303. [Google Scholar] [CrossRef]
- Lloyd, K.; Rosin, P.L.; Marshall, D.; Moore, S.C. Detecting violent and abnormal crowd activity using temporal analysis of grey level co-occurrence matrix (GLCM)-based texture measures. Mach. Vis. Appl. 2017, 28, 361–371. [Google Scholar] [CrossRef] [Green Version]
- Wilk, S.; Kopf, S.; Effelsberg, W. Video composition by the crowd: A system to compose user-generated videos in near real-time. In Proceedings of the 6th ACM Multimedia Systems Conference, Portland, OR, USA, 18–20 March 2015; pp. 13–24. [Google Scholar]
- Pujol, F.A.; Mora, H.; Pertegal, M.L. A soft computing approach to violence detection in social media for smart cities. Soft Comput. 2019, 1–11. [Google Scholar] [CrossRef]
- Hassner, T.; Itcher, Y.; Kliper-Gross, O. Violent flows: Real-time detection of violent crowd behavior. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 1–6. [Google Scholar]
- Bilinski, P.; Bremond, F. Human violence recognition and detection in surveillance videos. In Proceedings of the 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Colorado Springs, CO, USA, 23–26 August 2016; pp. 30–36. [Google Scholar]
- Deniz, O.; Serrano, I.; Bueno, G.; Kim, T.-K. Fast violence detection in video. In Proceedings of the 2014 International Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, 5–8 January 2014; pp. 478–485. [Google Scholar]
- Ribeiro, P.C.; Audigier, R.; Pham, Q.-C. RIMOC, a feature to discriminate unstructured motions: Application to violence detection for video-surveillance. Comput. Vis. Image Underst. 2016, 144, 121–143. [Google Scholar] [CrossRef]
- Ditsanthia, E.; Pipanmaekaporn, L.; Kamonsantiroj, S. Video representation learning for CCTV-Based violence detection. In Proceedings of the 3rd Technology Innovation Management and Engineering Science International Conference (TIMES-iCON), Bangkok, Thailand, 12–14 December 2018; pp. 1–5. [Google Scholar]
- Zhang, T.; Yang, Z.; Jia, W.; Yang, B.; Yang, J.; He, X. A new method for violence detection in surveillance scenes. Multimed. Tools Appl. 2016, 75, 7327–7349. [Google Scholar] [CrossRef]
- Mousavi, H.; Mohammadi, S.; Perina, A.; Chellali, R.; Murino, V. Analyzing tracklets for the detection of abnormal crowd behavior. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; pp. 148–155. [Google Scholar]
- Gao, Y.; Liu, H.; Sun, X.; Wang, C.; Liu, Y. Violence detection using oriented violent flows. Image Vis. Comput. 2016, 48, 37–41. [Google Scholar] [CrossRef]
- Xu, L.; Gong, C.; Yang, J.; Wu, Q.; Yao, L. Violent video detection based on MoSIFT feature and sparse coding. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 3538–3542. [Google Scholar]
- Battiato, S.; Gallo, G.; Puglisi, G.; Scellato, S. SIFT features tracking for video stabilization. In Proceedings of the 14th International Conference on Image Analysis and Processing (ICIAP 2007), Modena, Italy, 10–14 September 2007; pp. 825–830. [Google Scholar]
- Keerthi, S.S.; Shevade, S. A fast tracking algorithm for generalized lars/lasso. IEEE Trans. Neural Networks 2007, 18, 1826–1830. [Google Scholar] [CrossRef] [Green Version]
- Podder, P.; Khan, T.Z.; Khan, M.H.; Rahman, M.M. Comparative performance analysis of hamming, hanning and blackman window. Int. J. Comput. Appl. 2014, 96, 18. [Google Scholar] [CrossRef]
- Deans, S.R. The Radon Transform and Some of its Applications; Courier Corporation: North Chelmsford, MA, USA, 2007. [Google Scholar]
- Sudhakaran, S.; Lanz, O. Learning to detect violent videos using convolutional long short-term memory. In Proceedings of the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
- Chauhan, A.K.; Krishan, P. Moving object tracking using gaussian mixture model and optical flow. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2013, 3. Available online: https://www.semanticscholar.org/paper/Moving-Object-Tracking-using-Gaussian-Mixture-Model-Chauhan-Krishan/8b56b43978543749075d9dee5d9d78f17614ae9b#paper-header (accessed on 8 June 2020).
- Zhou, P.; Ding, Q.; Luo, H.; Hou, X. Violent interaction detection in video based on deep learning. In Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2017; Volume 844, p. 012044. [Google Scholar]
- Gracia, I.S.; Suárez, O.D.; García, G.B.; Kim, T.-K. Fast fight detection. PLoS ONE 2015, 10. [Google Scholar] [CrossRef]
- Kingma, D.P.; Jimmy, B.A. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980, 2014. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Dataset | Description | Resolution | Sources |
---|---|---|---|
Violent-Flows-Crowd Violence/Non-violence Database | Scene: real-world actions in scenarios such as roads, football stadiums, volleyball fields or ice hockey, and schools Source: YouTube Videos: 246 videos Average video length: 3.60 s Task: Violence Detection | 320 × 240 p | [5] |
Hockey Fights Dataset | Scene: hockey games. Source: NHL (National Hockey League) Videos: 1000 videos Task: Violence Detection | 320 × 288 p | [6] |
UCF101 | Scene: simulated actions and actions taken from the real world Source: YouTube Number of classes: 101 Videos: 13,320 videos Task: Action Recognition | 320 × 240 p | [7] |
Hollywood2 | Scene: actions taken from film Source: 69 films Number of classes: 12 action classes and 10 scene classes Videos: 3669 videos Total duration: 20 h Task: Action Recognition | Several resolutions | [8] |
Movies Dataset | Scene: violent actions taken from movies; non-violent actions taken from the real world | Several resolutions | [9] |
Behave | Scene: simulated actions such as walking, running, chasing, group discussions, moving vehicles or bikes, fighting, etc. Frame: 200,000 frames Task: Action Recognition | 640 × 480 p | [10] |
Caviar | Scene: simulated and real actions of people walking alone, meeting with others, etc. Source: INRIA Lab, street of a shopping center in Portugal Task: Action Recognition | 348 × 288 p | [8] |
UCSD | Scene: real world scenes Task: Anomalous pedestrian motion patterns | PED1: 158 × 238 p PED2: 240 × 360 p | [11] |
Technique | Description | References |
---|---|---|
ViF | It is a vector of linked histograms. Each histogram represents the change in magnitudes in a certain region of the frames of a video. | [5] |
OViF | A vector of HOOF histograms indicating changes in magnitude and orientation of optical flow vectors in regions of the scene. | [12] |
HOA | It is a representation of acceleration through the histogram of kurtosis values extracted from the processing of a frame sequence. | [7,17] |
HOP | It supports the estimate of acceleration (HOA) and deceleration (HOD) by considering the average of the image obtained from the ratio of two spectral powers of consecutive frames. | [17] |
HOD | It gives an estimate of the deceleration within a frame sequence. The extracted kurtosis values represented by a histogram. | [17] |
Features from ConvLSTM | Features extracted from ConvLSTM, a network made up of CNN Alexnet, pretrained on the ImageNet database, and an LSTM for obtaining space–time features. | [18] |
OHOF | This histogram is calculated on the regions previously marked as violent regions, calculating the optical flow of these regions and adding context information. | [19] |
IFV | The representation of the IFV in [6] is in vector of gradients which are firstly normalized with power normalization, and later with the L2 norm. | [6] |
HARALICK | After the GLCM on eight directions is computed, extract the seven previously defined metrics. | [2] |
Blobs Area | Area of blobs detected in a scene. | [21] |
Compactness | Describes the shape (circular or elliptical) of the blobs. | [21] |
Blobs Centroids | Blobs centroids detected in a scene. | [21] |
Distance of the centroids of the blobs | Distance of centroids between one blob and another. | [21] |
Inception V3 | Violent/non-violent binary classification by averaging the score of all frames within a video or a time window. If the score is ≤ 0.5, the video is violent; it is non-violent otherwise. | [23] |
Dataset | Feature | Classifier | Performance | References |
---|---|---|---|---|
Violence/Non-Violence | ViF | Linear SVM | 81.30 ± 0.21% | [5] |
Aslan | 56.57 ± 25% | |||
Hockey Fights | 82.90 ± 0.14% | |||
Violent-Flows | OViF | SVM | 76.80 ± 3.90% | [12] |
AdaBoost | 74.00 ± 4.90% | |||
Hockey Fights | SVM | 82.20 ± 3.33% | ||
AdaBoost | 78.30 ± 1.68% | |||
BEHAVE | OHOF | Linear SVM | 85.29 ± 0.16% | [10] |
CAVIAR | 86.75 ± 0.15% | |||
Crowd Violence | 82.79 ± 0.19% | |||
Behave | 95.00 ± 0.54% | |||
CF-Violence Dataset | Haralick | Linear SVM | 99% | [2] |
Violent-Flows | 82% | |||
UMN | 86.03 ± 4.25% | |||
UCF | 97% | |||
Hockey Fights | IFV | Linear SVM | 93% | [6] |
Movies | 98% | |||
Violent-Flows | 94% | |||
Hockey Fights | MoSIFT + KDE | SVM (kernel RBN) | 94.3 ± 1.68% | [13] |
Crowd Violence | 1 × 3 × 1:93.5% | |||
Hockey Fights | ConvLSTM | ConvLSTM | 94.3 ± 1.68% | [18] |
Movie Dataset | 100% | |||
CF-Violence Dataset | 94.57 ± 2.34% | |||
Movie Dataset | Motion Blobs | SVM | 87.2% ± 0.2% | [21] |
AdaBoost | 81.7 (±0.5%) | |||
Hockey Fights | SVM | 72.50% | ||
AdaBoost | 71.7% |
ViF + SVM | ViF + Random Forest | Motion Blobs + SVM | Motion Blobs + Random Forest | Inception V3 | Haralick | Haralick | IFV + SVM | IFV + Random Forest | |
---|---|---|---|---|---|---|---|---|---|
(5-FOLD) | (5-FOLD) | + | + | ||||||
SVM | Random Forest | ||||||||
Movies | 97% | 95.5% | 97.5% | 97% | 99% | 85% | 87% | 88% ± 0.93 | 56% ± 2.76 |
±0.53 | ±0.53 | ±0.14 | ±0.21 | ±0.82 | ±1.24 | ±1.87 | |||
Hockey | 80.5% | 80.1% | 71.2% | 80.2% | 96.3% | 90% | 94% | 94% ± 0.11 | 49% ± 3.22 |
±0.25 | ±0.34 | ±0.21 | ±0.19 | ±0.97 | ±0.14 | ±1.55 | |||
Violent/Crowd Dataset | 81.25% | 79.16% | 65.41% | 66.25% | 91.6% | 80% | 86% | 86% | 52% |
±0.14 | ±0.28 | ±0.24 | ±0.17 | ±0.21 | ±1.32 | ±0.82 | ±1.65 | ±2.89 |
Prediction Time | |
---|---|
ViF | ~0.6 s |
Motion Blobs | ~3 s |
Inception V3 | ~1 s |
IFV | ~0.36 s |
Haralick | ~24 s |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Convertini, N.; Dentamaro, V.; Impedovo, D.; Pirlo, G.; Sarcinella, L. A Controlled Benchmark of Video Violence Detection Techniques. Information 2020, 11, 321. https://doi.org/10.3390/info11060321
Convertini N, Dentamaro V, Impedovo D, Pirlo G, Sarcinella L. A Controlled Benchmark of Video Violence Detection Techniques. Information. 2020; 11(6):321. https://doi.org/10.3390/info11060321
Chicago/Turabian StyleConvertini, Nicola, Vincenzo Dentamaro, Donato Impedovo, Giuseppe Pirlo, and Lucia Sarcinella. 2020. "A Controlled Benchmark of Video Violence Detection Techniques" Information 11, no. 6: 321. https://doi.org/10.3390/info11060321
APA StyleConvertini, N., Dentamaro, V., Impedovo, D., Pirlo, G., & Sarcinella, L. (2020). A Controlled Benchmark of Video Violence Detection Techniques. Information, 11(6), 321. https://doi.org/10.3390/info11060321