3.1. Target State Prediction
Current multi-target tracking algorithms typically employ a linear motion model, with the target’s motion state modeled using a linear Kalman filter. During tracking, a Kalman filter is initialized for each target. Before the target association begins, the Kalman filter predicts the target’s position in the next frame. After completing the target association, the filter is updated using information from the matched detection frame. The target state
of the current frame can be obtained from the target state
of the previous frame, and the process can be expressed as:
where
denotes the state transfer matrix,
denotes the control matrix, and
denotes the control input.
The process of predicting the error covariance
of the current time frame based on the error covariance of the target state
in the previous frame and the process noise
can be expressed as:
In the update phase, the predicted error covariance
and measurement noise
are first used to calculate the Kalman gain
, which can be expressed as:
where
is the observation matrix that describes the observation model.
Updating the state estimate for the current frame based on the observed and predicted states can be expressed as:
where
denotes the updated target state of the current frame and
denotes the observed value of the target of the current frame.
Finally, the error covariance
of the current frame is updated based on the Kalman gain
, which can be expressed as:
The process noise in the Kalman filtering algorithm reflects the uncertainty in the system model, with a larger indicating greater uncertainty in the system model, and the measurement noise covariance matrix reflects the uncertainty in the sensor measurements, with a large indicating greater measurement uncertainty. The dependence of the filter on the measured and predicted values can be adjusted by varying the size of and . Although it is possible to mitigate the effect of the predicted values on the filter by reducing , this approach also suffers from the problem of not being able to correct by the predicted values when there is a large uncertainty in the detected values, and also the adjustment of and needs to be continually experimented with.
To address the above problems, this paper designs an LSTM-MP module based on the LSTM structure. The LSTM-MP uses the LSTM as an encoder and the MLP as a decoder to predict the target’s motion, as shown in
Figure 3.
The overall structure of LSTM-MP is shown in
Figure 4. The encoder uses an LSTM network, and the decoder uses an MLP (multilayer perceptron) network, where the LSTM of the encoder consists of an oblivion gate, a memory gate, a cellular state, and an output gate.
The input of the forgetting gate consists of the target state
of the current frame and the
of the target state of the previous frame computed by the LSTM-MP, and the output of the forgetting gate can be expressed as:
where
denotes the moment t oblivion gate output,
denotes the sigmoid function,
and
denote the weight parameters of the oblivion gate, and
denotes the bias parameter of the oblivion gate.
LSTM-MP has two memory gates, which use sigmoid and tanh functions to compute the memorization degree of different data, respectively. The data computation of the two memory gates can be expressed as:
where
and
denote the memory gate output at moment
t,
denotes the sigmoid function,
denotes the tanh function,
,
,
, and
denote the weight parameter of the memory gate input data, and
and
denote the bias parameter of the memory gate input data, respectively.
The cell state
in the LSTM-MP model denotes the memory state of the encoder for the data at moment t, which can be represented by a combination of the forgetting gate and memory gate:
where
denotes the Hadamard product, i.e., element-by-element multiplication.
The output gate
in the encoder is computed from the combination of the cell state
, the output state
of the previous frame, and the input combination
of the current frame, and the operation can be expressed as:
The LSTM-MP model linearly maps the output of the last frame of data of the LSTM structure and computes the final predicted output through two linear layers. The parameters of the designed linear layers are shown in
Table 1, where the output dimension of Linear#2 is 20, which indicates the prediction of the next five frames of target motion data. The length of each data frame is 4, which indicates the prediction parameters of the target’s center point position, width, and height.
3.2. Target State Fusion
Traditional methods for fusing appearance features over time assume that more recent target information is more valuable, modeling fusion coefficients as an exponential function of the time interval between the historical and current frames. However, in complex scenarios, the relationship between a target’s appearance features across different time points is more complex than an exponential function can capture. Thus, the fusion coefficient should account not only for the time interval but also for the information extracted from the appearance features. Our approach enhances the appearance features of the target in the current frame, improving the differentiation between similar targets and maintaining feature consistency for the same target during tracking.
To capture the state-implicit correlations of the target’s historical frames, we propose a spatiotemporal attention-based target appearance feature fusion (TSA-FF) algorithm, which computes adaptive fusion coefficients. TSA-FF combines historical state information to enhance the robustness of the current frame states. The algorithm employs an attention mechanism to calculate weighted fusion coefficients for target appearance features, as illustrated in
Figure 5. Given the challenges posed by complex backgrounds and target motion, TSA-FF incorporates features specific to multi-target tracking and introduces self-spatial and interactive spatial attention weights within the attention mechanism.
The self-spatial attention weighting evaluates the fusion weights of appearance features from different historical frames by calculating their similarity. If a historical frame has a higher similarity to all other historical frames, it is considered a key frame and is assigned a greater fusion weight. The process can be represented as:
where
denotes the self-attention among the target history frame appearance features,
denotes the number of history frames,
denotes the target appearance feature dimensions, and
denotes matrix multiplication.
denotes the self-spatial attention weight, and
denotes the mean of the matrix by row.
To extract the degree of influence that appearance features of a target’s history frames have on the current frame, we propose the concept of interaction space attention weight. This method involves calculating the similarity coefficient between the appearance features of the target’s current frame and those of all its history frames.
Interactive spatial attention weighting uses this similarity coefficient to determine the interactive spatial attention fusion weights of the history frames, a process that can be expressed as:
where
denotes the interaction attention between the appearance features of the target’s current frame and the history frames,
denotes the appearance features of the target’s current frame,
denotes the appearance features of the target’s history frames,
denotes the number of history frames,
denotes the dimension of the target’s appearance features,
denotes the matrix multiplication, and
denotes the interaction spatial attention weight. If a history frame has a higher similarity to the target’s current frame, it is considered to contribute more to feature fusion. Consequently, it is assigned a larger fusion weight.
Instead of modeling time intervals through manual design, self-space attention weights and interaction space attention weights extract complex temporal dependencies from the perspective of target historical appearance features in higher dimensions. This approach determines the temporal fusion method of target appearance features in complex scenes. Based on self-space attention weights and interaction space attention weights, the target appearance feature fusion weights are represented as:
where
denotes the Hadamard product, i.e., the multiplication of elements between matrices.
denotes the fusion weight result, obtained by summing the self-space attention weights and the interaction space attention weights, followed by an arithmetic operation with the temporal attention weights.
Further, the fused target appearance feature may be represented as:
The fused target appearance feature contains the appearance feature information of the target in the previous n frames. The attention weight is no longer the traditional weight designed based on artificial a priori knowledge, but the result of the combined consideration of self-space attention weight and interaction space attention weight, and denotes the feature vector normalization operation.