To address the problem of the strong generalization ability to exist methods for predicting abnormal samples, which cannot utilize advanced semantics, we propose an encoder–decoder model, named SMAMS, which is based on a spatiotemporal masked autoencoder (ST-MAE), incorporating multi-memory and skip connections. The network of our method is shown in
Figure 1. Initially, the video events are represented as spatiotemporal cubes, and a portion of the foreground patches are randomly concealed. Subsequently, the unmasked patches were fed into the spatiotemporal masked autoencoder, which can extract both the spatiotemporal features and high-level semantics of the video events. Second, to reconstruct normal events better and abnormal events worse, multiple memory modules are added on top of the spatiotemporal masked autoencoder to store the normal patterns of uncovered video patches from different feature layers. Additionally, skip connections are added to compensate for the crucial feature loss caused by the memory modules. Normal event patterns are commonly characterized by simplicity and predictability, whereas abnormal event patterns tend to exhibit greater complexity and irregularity. Finally, the reconstructed masked video data are obtained from the decoding module, and the normal score for each input is obtained based on the differences between the reconstructed data and the input data.
3.1. ST-MAE Based Memory Encoder–Decoder
Traditional convolutional autoencoders have a poor temporal dependency. To better utilize the advanced semantics and temporal contextual cues of video events, we introduce the spatiotemporal masked autoencoder (ST-MAE) into the field of VAD. ST-MAE is a novel method for reconstructing the training dataset, which can be used for denoising and fully learning high-level semantics and spatiotemporal features of normal samples.
Spatial-temporal cubes (STC) are used in our proposed method to represent video events, which are constructed from temporally contiguous foreground patches. STC extraction is critical for detecting anomalies [
33,
34] since it allows subsequent modeling to concentrate on relevant foreground information instead of unimportant backgrounds in video clips. We first localize every frame in the video to gather the foreground information, adopting the video event extraction approach in [
34,
35]. Second, according to each object’s position, T foreground blocks are taken from the present frame and T-1 subsequent frames in time. Finally, the size of the T foreground patches is adjusted to
, and then heaped to generate a STC,
. Given a video event represented by the STC
, randomly conceal 90% of the foreground patches. Assuming the masked spatiotemporal patches are
,
. The unmasked spatiotemporal patches are
,
. We follow the settings of ST-MAE and use ViT as the backbone network for the encoder and decoder. The unmasked spatiotemporal patch sequence
is input to the encoder for encoding, resulting in a hidden feature y. The feature representation y can be utilized as the query to retrieve and modify items within the memory module.
where
and
represent hyperparameters, and
represents the memory module’s output feature representation. The decoder is used to reconstruct the masked spatiotemporal patches.
where
represents the hyperparameters of the decoder
.
In particular, we adopt the ST-MAE model combined with the memory module as the backbone of our masked autoencoder. As shown in
Figure 2, the encoder utilizes the Transformer architecture and memory module, composed of multiple self-attention layers and feed-forward neural network layers and memory module. The decoder part is mainly responsible for reconstructing the input-masked blocks according to the original settings of the Transformer. In VAD, ST-MAE effectively models the temporal and spatial correlations in video sequences, leading to a better understanding of the differences between normal and abnormal patterns. The memory module is to remember the normal mode of unmasked patches. Specifically, the unmasked spatiotemporal patches are used as the input to the encoder part. Given the sequence of unmasked patches
, it is first flattened into a 1-dimensional token sequence
. Each token sequence is then mapped to a lower-dimensional embedding using a trainable linear projection. To preserve the spatiotemporal information of the patches, learnable embeddings
and learnable position embeddings
are added to the token embeddings, resulting in the following token sequence:
Then, we pass the token sequence
through multiple encoder blocks, where each block performs the following computation processes to learn temporal-spatial features
f of unmasked patches.
where
is the encoder input matrix formed from the unmasked patches.
and
are the input and output of layer
l,
,
,
are the linear projections of the encoder’s layer l-1 for query, key, and value of the multi-head attention operator, respectively.
,
are the layer
l learnable memory matrices that are concatenated with
,
. The multi-head attention operator
adheres to the standard architecture of ViT and Transformer. L represents the count of encoder blocks. Multiple encoders can be stacked together to form a deep architecture, therefore increasing the model’s representative capacity. Based on our experiments and evaluations, we have determined that setting L to 3 yields the optimal configuration for our model.
Subsequently, the decoder is employed to reconstruct the sequence of masked patches. The generated STC is considered to be the reconstruction of the original sequence . The ST-MAE is trained by minimizing the reconstruction error.
3.2. Multi-Layer Memory Module with Skip Connections
First, the memory module is introduced, which consists of two main components: an encoding pattern that provides query and matching for memory item storage features and a memory update strategy. By utilizing a memory module, the model can effectively prevent the reconstruction of anomalous samples, significantly improving anomaly detection performance. During the query and match phase, the memory module is designed as a matrix
, where it consists of
N feature vectors of dimension
C. The memory matrix
M stores the normal patterns observed during the training process. Assuming the dimension of the feature representation
z is the same as
C,
. Let the row vector
represent a memory item in the matrix
M, where
i denotes the
i-
row of matrix
M. Given a query feature
, we can retrieve similar memory items and reconstruct the feature
by multiplying them with
.
where
represents the similarity between the query feature
z and the memory items, which can be mathematically expressed as follows:
where
represents the cosine similarity:
For the memory update strategy, the following metrics are utilized to gauge the matching degree between memory items and queries:
Each memory item is updated to:
where
represents the
norm, and the weighted average of the feature is computed to make the feature pattern in the memory item more normal. We adopted the approach proposed in reference [
36], where three memory modules were introduced between the encoder and decoder of the spatiotemporal masking autoencoder. These memory modules were designed to capture the features of unmasked patches. Additionally, skip connections were incorporated to ensure the preservation of crucial information.
3.3. Anomaly Score
After performing the ST-MAE, we aim to distinguish anomalies by computing an anomaly score. Following existing reconstruction-based methods, we use reconstruction error as the anomaly score. Let the reconstruction result for
be denoted as
, and we employ the mean squared error (MSE) loss as the reconstruction loss for each STC:
where
represents the
norm. It calculates the Euclidean distance between the predicted values
and the true values
. A higher value implies a greater differentiation between the predicted values and the true values.
The matching probability
w for each memory module is added with entropy loss as the memory module loss:
where
M is the number of memory modules and
is the matching probability for the
j-th memory slot in the
i-th memory module. Balancing the above two loss functions, we obtain the following loss function to train the model:
In the testing phase, we utilize the trained autoencoder model to reconstruct the test data and calculate the loss error for each sample. For each STC, the anomaly score is computed pixel-wise based on the defined loss function. It can be calculated as follows:
where
and
represent the mean and standard deviation of the losses
ℓ. The anomaly score represents the distribution of reconstruction errors for a sample relative to normal behavior. A higher anomaly score indicates a larger deviation of the sample from normal behavior, suggesting the presence of anomalies.