3.1. Data Preprocessing
In the data preprocessing stage, sparse sampling, RGB2Gray, and frame-grouping operations are performed successively so that the target video clip is put into the format suitable for the following 2D CNN network.
The purpose of sparse sampling is to reduce redundancy in videos at the frame level. In reality, there is a high degree of redundancy in videos especially in cases of high frame rates. Dense sampling inevitably captures too many highly similar consecutive frames. For violence detection in videos, dense temporal sampling is costly in both time and computing resources. In contrast, sparse sampling preserving relevant information can be used to model the long-range temporal structure over the video with a much lower cost [
8,
20]. In this work, we adopt the sparse sampling strategy proposed in [
8] to remove redundancy. As shown in
Figure 1, a video clip of a fixed duration is divided into 3T segments, and one frame is sampled from each segment. Thus, all 3T RGB frames,
F = [F
1, F
2, …, F
3T], are captured from the video clip for use in the following procedure.
Based on the sparsely sampled RGB frames, the RGB2Gray operation is performed to reduce redundancy further at the channel level. Traditionally, 2D/3D CNNs are employed to learn features from RGB images/frames so as to realize tasks such as image classification and object detection and tracking. However, though color information may be useful for violence detection, motion information has been found to be the most crucial [
27,
28,
29,
30,
39]. Moreover, compacting RGB images into single-channel images is beneficial to lowering computational load. Based on this observation, RGB images were squeezed into single-channel ones by using an averaging operation along channel dimension [
15]. In this work, we convert the sparsely sampled RGB images into grayscale images using the standard weighted averaging method instead of the method in [
15]. Mathematically, the obtained grayscale pixel intensity
can be given by
where,
,
, and
represent intensities of the red, green, and blue channels of the pixel, separately;
,
, and
are standard weighting coefficients obtained from the human luminance perception system. As shown in
Figure 1, 3T grayscale frames,
X = [X
1, X
2,…, X
3T], are obtained from the video clip.
The function of frame-grouping is to reshape the obtained grayscale frames into three-channel images. According to
Figure 1, the obtained 3T grayscale frames are divided into T groups, each of which contains three consecutive frames. For each group, a three-channel image is constructed by putting the three grayscale frames into RGB channels [
15]. Therefore, T three-channel images,
I = [I
1, I
2,…, I
T], are obtained from a video clip and fed to the following 2D CNN as input. The shape of
I is [T, C, H, W], where T is a temporal dimension, C denotes the number of channels, and H and W represent the height and width of the image, respectively. As reported in [
15], the merit of frame-grouping is twofold. Firstly, the synthesized three-channel images are fit for 2D CNNs in format. Secondly, the synthesized three-channel images enable 2D CNNs to learn short-term motion features. In this work, we take the advantages of frame-grouping and propose a new lightweight 2D CNN, i.e., improved EfficientNet-B0, to extract both short-term and long-term motion features effectively.
3.2. Improved EfficientNet-B0
The improved EfficientNet-B0 is a modified version of EfficientNet-B0 [
18]. Specifically, we propose a new attention module, namely, Bi-LTMA, and integrate it into EfficientNet-B0 to enable long-term motion feature extraction. In addition, the TSM module [
20] is adopted to realize temporal feature interaction. In what follows, we briefly review the EfficientNet-B0 [
18] and then present details of the improved EfficientNet-B0.
EfficientNet-B0 is a state-of-the-art lightweight 2D CNN obtained by using a multi-objective neural architecture search that optimizes both accuracy and FLOPS [
18]. As shown in
Figure 2, it consists of nine stages. The specific structure of each stage can be found in
Table 1. Notably, the second to eighth stages are composed of inverted residual blocks, named as mobile inverted bottleneck (MBConv) in
Table 1. In fact, MBConvs resemble the basic block of MobileNetV3 [
48]. The structure of the inverted residual block is shown in
Figure 3. Within the block, depthwise separable convolution (DW Conv) is employed to decrease the computational cost-effectively. For further detailed information, please refer to [
18,
48] and references therein. To take advantage of EfficientNet-B0 in both accuracy and FLOPs, we select it as the backbone of the proposed model.
The goal of the improved EfficientNet-B0 is to gain capabilities of long-term motion feature extraction and temporal feature interaction. To this end, we integrate the Bi-LTMA module and TSM module into inverted residual blocks in
Figure 4, i.e., the second to eighth stages, of the EfficientNet-B0. The modified inverted residual block is shown in
Figure 4. In what follows, we present the details of our proposed Bi-LTMA and that of TSM.
The Bi-LTMA module is proposed to extract long-term motion information effectively. As reported in the literature [
1,
2,
15,
41], long-term motion information plays a key role in violence detection in videos. In [
1,
2], LSTM and BiLSTM were used to aggregate temporal features. In [
15], a temporal attention module, named T-SE, was employed jointly with frame-grouping to recalibrate and aggregate global temporal features. In [
41], motion features were defined as a channel-wise difference between adjacent frames, and the motion excitation (ME) module was proposed to excite the motion-sensitive channels. The ME focuses on channel attention, where detailed spatial features are omitted by using the pooling operation. Our idea of Bi-LTMA is partly inspired by ME. Differently from ME, Bi-LTMA takes both spatial and channel dimensions into consideration and captures motion features in both forward and backward directions.
The architecture of the Bi-LTMA module is shown in
Figure 5. To meet the lightweight design requirements, the input feature F is downscaled by using a 1 × 1 2D convolution (
). The downscaled feature is given by
where
is the reduction ratio,
indicates the convolution calculation. Based on the downscaled feature, motion information is then derived. Similar to ME [
44], motion information is obtained by computing the feature-level difference between two adjacent frames. A 3 × 3 2D convolution (
) is used as the mapping function to deal with the problem of spatial misalignment. Specifically, given the features of two adjacent frames
and
, we compute the bi-directional motion information as per the following:
where
is the forward difference and
is the backward difference. For the last time step
, we set the motion information as
and
. Then, another 1 × 1 2D convolution (
) is employed together with a concatenation (concat) layer in each branch to restore the original shape of the input feature F. After that, forward and backward attention weights,
and
, are generated separately by using sigmoid functions as per the following:
where
is the sigmoid activation function,
stands for concatenation function. The attention map
is computed by averaging the forward and backward attention weights as follows:
As shown in
Figure 5, a weighted feature can be obtained by performing Hadamard product operation
. DropConnect is an effective method to improve the generalization ability of neural network models [
49]. It randomly sets the weights to zero according to a certain probability. To improve the generalization ability of the model, we put a DropConnect structure after the
operation, where
represents the DropConnect operation. Moreover, to preserve the original feature and, meanwhile, make use of the Bi-LTMA attention mechanism, we employ a shortcut connection to fuse the original feature with the weighted. Mathematically, the output feature of the Bi-LTMA module can be expressed as
The TSM module [
20] is adopted in the modified inverted residual block to enhance the temporal modeling of the improved EfficientNet-B0 further. As reported in [
20], TSM module enables 2D CNNs joint spatial–temporal modeling via shifting part of the channels along the temporal dimension. Moreover, inserting the TSM module into 2D CNNs does not impose an additional computational cost. These merits of TSM are attractive to real-time video understanding. It is also worth noting that in-place TSM may lower the spatial feature learning capability of the backbone model to some extent [
20]. Residual TSM is a promising alternative to tackle this problem. Noting that most stages of EfficientNet-B0 are inverted residual blocks, we insert bi-directional TSM modules into residual branches of stages of EfficientNet-B0 to realize temporal feature interaction and meanwhile keep good spatial feature learning capability of the backbone, as shown in
Figure 4.
Given a video clip, the principle of the bi-directional TSM module is demonstrated in
Figure 6. As described above, the T three-channel images are obtained from a video clip after frame-grouping. For a given three-channel image, the feature extracted by the 2D CNN layer right before the TSM module is denoted with a specific color. Assume that the feature extracted from each three-channel image consists of C feature maps of height H and width W, where C is the number of channels. As shown in
Figure 6, the TSM module realizes temporal feature interaction in three steps. Firstly, the features of all T three-channel images are reshaped into a tensor of dimensions H × W, T, and C. Then, a small number of the channels are shifted along the temporal dimension of the tensor. Some of them are shifted backward by one step, the others are shifted forward by one step. The upper and lower vacancies are padded with zeros. Finally, the resultant tensor is reshaped to recover T output features, the format of which is the same as the input. Evidently, the feature at arbitrary time step
t is fused with those at time steps
t − 1 and
t + 1. It means that the temporal perceptional field is expanded effectively by using the TSM module with no additional computation cost.
3.3. Auxiliary Loss
We introduce an auxiliary loss to improve the classification capability of the proposed model. As shown in
Figure 1, the auxiliary head is inserted in an intermediate stage of the improved EfficientNet-B0. Softmax classifier is employed in both the main branch and the auxiliary branch. In reality, auxiliary heads are used only in the phase of model training and discarded in the phase of inference [
42]. Therefore, the auxiliary head does not impose any additional computational cost to the model during inference. In what follows, we introduce the auxiliary loss in detail.
An important issue is to determine the intermediate layer to which the auxiliary head is attached. In principle, early layers of CNNs learn low-level features while later layers learn high-level features. It is the high-level features in the final layers that are important for classification [
45]. As reported in [
44], if a single auxiliary head is inserted at a too-early layer of typical CNNs, the accuracy of the final classifier can be harmed badly. The underlying reason is that the auxiliary classifier attached to an early layer benefits short-term early features but may collapse information required to generate high-quality features in later layers. On the other hand, if the auxiliary head is inserted at a too-deep layer, the auxiliary loss will lose its advantage in combatting the gradient disappearance problem [
19]. Based on these observations, we attach the auxiliary head right after stage 7 of the improved EfficientNet-B0, as shown in
Figure 7.
Another issue is to design the auxiliary branch reasonably so as to gain better final classification accuracy. In some past approaches, auxiliary classifiers were connected to hidden layers directly. Experiment results suggested that the final accuracies achieved were hardly improved [
43]. As analyzed in [
19], discrepancy in optimization directions of the main classifier and the auxiliary ones could degrade the model accuracy. In order to relieve the optimization inconsistency between the auxiliary and main classifiers, the auxiliary branch is designed to resemble the structure of stage 9 of the EfficientNet-B0 as shown in
Figure 7. Intuitively, discriminative high-level features can be learned by the auxiliary branch in a way similar to the main branch.
For a given video clip, the total loss is obtained as shown in
Figure 1. The main classifier branch and the auxiliary classifier branch operate in parallel. For each of the T three-channel images
I = [I
1, I
2, …, I
T], the two Softmax classifiers independently give instant scores, based on which the instant cross-entropy losses in both two branches can be calculated. A unique main loss
and an auxiliary loss
are obtained by performing average pooling over the instant cross-entropy losses. The total loss is formulated as
where
is a tunable balancing coefficient.