1. Introduction
With the advancement in artificial intelligence technology, recent developments in deep learning have improved significantly in the accuracy and efficiency of human action recognition (HAR) in closed-circuit television (CCTV) [
1,
2]. Among the many deep learning architectures developed, convolutional neural networks (CNNs) [
3,
4] are particularly effective at detecting and recognizing patterns in video data [
5], making them well-suited for this task. Recurrent neural networks (RNNs) [
6] are also used to capture time-related relationships between video frames. Despite the help of CNNs and RNNs, the performance of deep learning models for behavior recognition is still hindered by several challenges. One of the key challenges in HAR involves the large variability in human actions [
7], including different viewpoints [
8,
9], clothing [
10,
11], lighting conditions [
12], and occlusions [
13]. To overcome this challenge, researchers have developed various techniques, such as data augmentation [
14], transfer learning [
15], and ensemble methods [
16], to improve the robustness and generalizability of HAR models.
Detecting violent actions within human activity recognition (HAR) is of paramount importance, serving as a crucial tool to alert security officials and law enforcement to potential threats, thereby safeguarding the community. Identifying violent behavior within video data from surveillance cameras—encompassing actions like physical altercations, aggressive behaviors, and vandalism—is a formidable challenge in HAR [
17,
18,
19]. To tackle this challenge, researchers have innovated various techniques. These range from motion-based methodologies [
20] and appearance-centric approaches [
21] to spatio-temporal strategies, harnessing nuances like motion patterns [
22], color histograms [
23], and the interplay between spatial and temporal elements within the scene [
24,
25].
A fundamental requirement for the efficacy of these methods is an ample corpus of labeled data for training. However, the dearth of labeled violent video segments (ViVis) hampers the training of robust violence recognition models tailored for CCTV systems [
26]. This paucity impedes the development of models adept at recognizing infrequent or emerging actions that are violent. To overcome this challenge, various techniques have been explored by researchers, such as unsupervised learning, which involves the use of tools like generative adversarial networks (GANs) and clustering [
27,
28] to derive representations that are meaningful from videos that do not have explicit labels. An alternative and promising avenue is transfer learning [
29], wherein a model pretrained on a broader dataset [
30] is adjusted and refined using a violent dataset that is labeled and reduced in size, enhancing both accuracy and adaptability in scenarios with restricted availability of labeled data.
While the architecture of vision Transformer (ViT) [
31] has garnered acclaim across various visual tasks, its adoption in violence recognition is not without challenges. Notably, ViT’s expansive model size can hinder the deployment on constrained devices or in real-time applications [
32]. Furthermore, its frame-by-frame feature extraction might compromise temporal context [
33]. To mitigate these concerns, innovations like the multilayer perceptron mixer (MLP-Mixer) [
34] have emerged. These streamlined models promise expedited inference and reduced computational overhead, making them viable for deployment on devices like embedded systems, smartphones, and energy-efficient cameras. Additionally, the efficacy of the MLP-Mixer in capturing temporal dynamics within HAR has been empirically validated [
35].
Fan et al. introduced an innovative approach, termed the super image for action recognition (SIFAR) [
36], aimed at enhancing temporal feature extraction. SIFAR transforms input video frames into composite images, streamlining action recognition akin to image classification. While this amalgamation of sequential frames into a singular, comprehensive image captures temporal nuances, the subsequent resizing sacrifices detailed information fidelity. To counteract this drawback, this study reintroduces the primary video frames, constructing a hybrid dataset to restore this eroded granularity.
In pursuit of computational efficiency, a backbone network centered around the MLP-Mixer architecture for video violence recognition is advocated in this paper. Confronting the limited scope of the video violence dataset, a novel data augmentation technique—sequential image collage (SIC)—is presented in this research. The SIC amalgamates the super images from both SIFAR [
36] and original video frames, circumventing the need for the intricate feature design typical of CNNs or the data-intensive nature associated with Transformers.
Specifically, the primary contributions of this paper are listed as follows.
Proposing an MLP-Mixer-driven framework for violence recognition, distinguished by its reduced computational demands vis-à-vis Transformer-centric models.
Introducing a composite dataset comprising both image collages that capture the space and time relationships between video frames and unmodified video frames. This composite dataset augments the training process, culminating in a spatio-temporal model exhibiting superior action recognition capabilities.
The rest of this paper is structured in the following manner:
Section 2 describes the development of HAR from handcrafted features and deep learning to Transformers.
Section 3 illustrates the SIC architecture proposed in this work.
Section 4 demonstrates the efficiency of SIC, with
Section 5 finally concluding the paper.
3. Methodology
The proposed SIC is illustrated in detail in this section. The architecture of the SIC is shown in
Figure 1.
3.1. Main Architecture
In this paper, the initial step involves extracting individual frames from the video sequence. Subsequently, these frames are amalgamated into composite images, effectively creating an image collage that encapsulates both spatial and temporal dimensions. This approach ensures that each collage encapsulates a continuum of information over a specific time frame. These synthesized images, in conjunction with the original frames to account for any potential information loss, are then fed into the MLP-Mixer backbone architecture. Ultimately, based on these processed data, the model determines whether the given collage depicts violent or non-violent content. A visual representation of the comprehensive architecture can be observed in
Figure 1.
3.2. Dual-form Dataset
In prior research, training typically relied on individual images, overlooking the interconnectedness between consecutive video frames. Given that adjacent frames in a video sequence often possess temporal correlations, neglecting this inherent relationship could compromise training efficacy. To address this, we adopt the approach delineated in [
36]. Initially, videos are decomposed into individual frame images. Subsequently, sets of nine consecutive frames are amalgamated to form a unified
image collage, as illustrated in
Figure 2.
3.3. MLP-Mixer
While both the mechanism of self-attention in Transformers and the operation of convolutions in CNNs have demonstrated their efficacy in numerous studies, they are not indispensable. The MLP-Mixer emerges as a compelling alternative, leveraging only MLPs yet delivering commendable performance across various image recognition tasks due to its inherent simplicity.
The foundational step mirrors that of the vision Transformer: each input is fragmented into contiguous and disjoint patches, initiating the patch embedding process. These embedded features subsequently traverse through a Mixer Layer. This layer houses two distinct MLP categories: channel-mixing MLPs and token-mixing MLPs, with each serving a unique purpose in data fusion.
Token-mixing MLPs are adept at deciphering spatial intricacies within the image, facilitating the extraction of spatial interrelations among the patches. In contrast, channel-mixing MLPs delve into the image’s channel-specific nuances, extracting pertinent features spanning various channels—such as the RGB components in an image.
After going through the mixer layers, Global Average Pooling is applied to the output to calculate the mean value of every spatial position within each channel. Finally, the output is classified to be either violent or non-violent after routing the resultant data through a layer that is fully connected, followed by activation of the Softmax function.
A salient advantage of the MLP-Mixer lies in its streamlined design. This design not only simplifies implementation but also strikes a harmonious balance between computational efficiency and performance, positioning it favorably against more complex, state-of-the-art models.
3.4. Sequential Image Collage
The dual-form dataset, denoted as
I, comprises both image collages
S and frames of the original videos
F where
and
. Here,
and
, with
m and
n representing the number of original video frames and image collages, respectively. The dimensions of an original video frame are
. For the video frames to fit into a
image collage, nine consecutive frames of the original video will have to be resized into the dimensions of
. Every image collage is then paired and accompanied by its nine original video frames as the input data as shown in
Figure 2.
I is then patch embedded before being fed into the proposed model. Patch embedded refers to the patch embedding concept that was proposed in the original paper of the vision Transformer [
31] where it splits images into patches, flattens them, and projects them into dimensions of a fixed latent vector size
where
.
denotes the resulting feature map,
C is the designated dimension that is hidden, and
is a series of image patches that do not overlap.
is computed as
given that an input image has a patch size of
and resolution size of
.
Subsequent for patch embedding, the data then go through the operation of token mixing, followed by channel mixing in a mixer layer. Mathematically, this step unfolds as:
and
where the normalization of a layer is denoted as
, and the output of channel mixing and token mixing is denoted as
Y and
U. The channel and spatial positions are indicated as indices
j and
i.
Token mixing commences with normalizing feature map
X using layer normalization, subsequently employing weight matrices
and
using linear transformations, and finally applying a function of non-linear activation
such as ReLU [
55].
U is then obtained by augmenting the processed output with the original input and later acts as the input of the channel mixing operation.
Conversely, the output from the token-mixing operation U is applied with layer normalization as the commencement of the channel-mixing operation. Weight matrices and are then applied to U by using linear transformations, followed by an activation function . The result is amalgamated with the initial output to produce Y. With the help of channel-mixing and token-mixing operations, the model is capable of capturing both local and global features from the violent video dataset.
5. Conclusions
A novel methodology for discerning violent behaviors within videos is introduced in this paper, leveraging the capabilities of the MLP-Mixer architecture in tandem with the approach of a sequential image collage (SIC) dual-form dataset. Distinctively, an efficient alternative to self-attention mechanisms and conventional convolutional operations is offered by the MLP-based framework. Its inherent computational simplicity renders it particularly apt for deployment in resource-constrained environments.
Central to our approach is the integration of both video frames and image collages (super image) within the SIC dual-form dataset. This comprehensive dataset amalgamation empowers the model to adeptly assimilate spatio-temporal intricacies inherent to violent actions, potentially augmenting its proficiency in violence recognition tasks.
Encouragingly, empirical evaluations showcased the SIC methodology’s competitive performance across three benchmark violence datasets, by achieving competitive accuracies while requiring fewer parameters and FLOPs compared to other state-of-the-art models. As a culmination of our efforts, we envisage that the insights and methodologies delineated herein will pave the way for enhanced real-world applications in the realm of violent video analysis.