2.1. Notation and Pixel Model
Throughout this paper, lowercase letters represent scalar variables, bold lowercase letters are used for vectors, capital letters represent matrices, and Greek letters are used for the coefficients.
As shown in
Figure 2, the video sequences are organized in the form of three-dimensional arrays. Each element in the array represents the IR intensity value associated with the corresponding pixel. Since the IR images are monochromatic, each element carries just one value, instead of the triad of RGB videos.
Referring to [
3], we model the signal
carried by a single pixel of spatial coordinates
and
at the quantized time instant
as:
where
is the background signal;
is the target signal;
and
are the height and the width of each frame, respectively;
is the number of collected frames. We also introduce the matrices
,
, and
, in which
,
, and
denote the
-th frame, the corresponding background, and the target, respectively, reorganized in lexicographic order, while
is the number of pixels. Given such a model, the objective of target detection is to separate the target signal
from the background
. In the literature, such a task is commonly referred to as background subtraction.
2.2. RPCA for Background Subtraction
RPCA is a well-known technique that improves PCA [
43] by making it robust against outliers. In fact, while PCA can be used to effectively purge the input matrix from the additive white Gaussian noise, it fails in detecting outliers. In the case of MTD, according to the previously introduced model, the input matrix
, which is representative of the input video, can be seen as the sum of a background matrix, represented by
, and an outlier matrix
, which represents the target. The idea behind using RPCA is that
is low rank, while
is sparse. Mathematically, the problem can be reformulated as that of finding
and
that satisfy Equation (2):
where
denotes the
l0-pseudo-norm, which counts the total number of non-zero elements in the matrix
, while
is a regularization parameter. Since both
and
are non-convex, the problem is not tractable as it is. For this reason, a convex relaxation makes it possible to find the optimal
and
with high probability. Such relaxation is given in Equation (3):
which is further relaxed in:
where
is the nuclear norm of
, which is a convex envelope of the function
;
is the
l1-norm of
, which is a convex approximation of the
l0-pseudo norm which promotes sparsity; as well,
is another regularization parameter which, along with
, controls the balance of the three terms. The convex problem in Equation (3) is known as principal component pursuit (PCP); it converges to the problem in Equation (2) and can be solved using an augmented Lagrange multiplier (ALM) algorithm [
25,
44]. The implementation is exposed in Algorithm 1.
Algorithm 1: RPCA by ALM |
1 | Input: (observed data) (regularization parameters)
|
2 | Initialize: , |
3 | while not converged do |
4 | - (1)
|
5 | - (2)
|
6 | - (3)
|
7 | return: |
In Algorithm 1:
denotes the shrinkage operator applied on the matrix
, which is the proximal operator for the
l1-norm minimization problem
[
45];
>
denotes the singular value thresholding operator applied on the matrix
, whose singular value decomposition (SVD) is
, which is the proximal operator for the nuclear-norm minimization problem
[
45].
RPCA is usually implemented in a batch form. In this implementation, the video is divided into batches of fixed length of
frames and RPCA is applied on each batch. The length of the batches has to be chosen taking into consideration the minimum speed of the target we are interested in as well as the stationarity of the background. This method is affected by non-causality, and therefore, it does not meet the real-time requirements. In fact, we would need to wait for the collection of the entire batch before obtaining background and target estimates. A possible solution is to apply a sliding window to the input video, resulting in a moving window RPCA (MW-RPCA) [
40] which, for each new collected frame, calculates the batch RPCA on the last
frames to provide the background/foreground separation of the last frame. This implementation in the analysis of the video sequences usually has quite a large computational burden.
2.3. Online Moving Window RPCA
In the literature, there are a few proposals of online RPCA implementations [
38,
39,
40]. For this study, we referred to online moving window RPCA (OMW-RPCA) proposed by Xiao et al. [
40], which is an improvement of online robust PCA via stochastic optimization (RPCA-STOC) proposed by Feng et al. [
39]. We, hereinafter, summarize the ideas behind OMW-RPCA, which, by relaxing (3), solves the following problem:
where
and
are regularization parameters. It is worth noting that, even though by dividing the three terms in (5) by
we could reconduct to a form that is more similar to the one in Equation (4), which is relative to batch implementation, online implementation requires a different proportion of the regularization parameters. For this reason and in order to comply with the notation used in the reference paper, we decided to keep the notations distinguished. Therefore, hereinafter,
and
will refer to batch RPCA, while
and
will refer to online implementation.
According to [
39], the nuclear norm of
respects the relation in Equation (6), which means that, given two matrices
and
such that
with
, the nuclear norm of
is always lower than
.
This means that solving the minimization problem in Equation (7) by plugging (6) into (5) also solves the minimization problem in Equation (5).
The above-depicted nuclear norm factorization is a well-established solution for online optimization problems [
39,
40,
46,
47] and is particularly elegant since
can be seen as the basis for the low-rank subspace, in which case,
would represent the coefficients of observations with respect to the basis
. Given the input matrix
, solving Equation (7) minimizes the following so called “empirical cost function”:
where
is the empirical loss function for each frame, which is defined as:
The vectors and and the matrix are updated in two steps.
First, Equation (9) is solved in
, to find
and
; then,
is updated by minimizing the following function:
whose minimum can be found in closed form:
which means that
can be updated by block-coordinate descent with warm restart.
The advantage of online implementation with respect to the MW-RPCA lies in the fact that, for each new frame, only Equation (9) must be minimized with respect to two vectors, which requires remarkably less time than the minimization of Equation (4) with respect to two matrices. In addition, the update of is in closed form and does not have to be accomplished in an iterative way, therefore, adding very small computational load.
The implementation of OMW-RPCA, unfortunately, needs an initialization which provides both the estimated rank of the matrix
and the initial basis
. Such initialization, which is called the “burn-in” phase, is accomplished by applying batch RPCA on the first
frames of the sequence, where
is a user-specified window size that must be higher than the expected rank of the matrix
. Although we suggest reading [
40] for more details, we report in Algorithm 2 the steps of OMW-RPCA.
Algorithm 2: Online Moving Window RPCA |
1 | Input:
(observed data revealed sequentially) (burn-in regularization parameters) (online regularization parameters) (burn-in samples)
|
2 | Initialize:
Compute batch RPCA on burn-in samples to get r, and Compute SVD on to get and , (auxiliary matrices)
|
3 | for to do |
4 |
|
5 | for to do |
6 | - (4)
Reveal the sample
|
7 | - (5)
Project new sample:
|
8 | - (6)
|
9 | - (7)
Compute with as warm restart
|
10 | return:
|
Although OMW-RPCA solves the causality problem, the result is highly affected by the burn-in phase. In fact, in the burn-in sequence, if, on the one hand, no target is present, the successive iterations effectively isolate the target. On the other hand, if any target is present in the burn-in sequence, the successive iterations keep on considering the initial presence of the target as a part of the background. The result is that the estimated foreground and background contain a ghost of the target in the position it occupied during the burn-in phase. This problem is a sensitive issue since, in an operative context, we do not have any control of the scene during the initialization of the surveillance system.
Figure 3 shows the effect of the burn-in ghosting in a sequence in which the target was present at the beginning of the recording. The upper row shows one of the first frames of the video sequence, which is included in the burn-in sequence, while the lower row shows a later frame, which is outside of the burn-in sequence. Alongside both frames, the corresponding background and foreground estimations are represented. It is worth noting that the presence of the target in the burn-in sequence affects the estimations and, even though the target is moving at a constant speed, the ghost remains in the position assumed by the boat in the burn-in sequence and does not move towards the successive positions.
A trivial idea to solve the burn-in ghosting problem is to increase the value of the regularization parameter , which increases the weight of in the loss function in Equation (5). In fact, by increasing , we would increase the threshold of the proximal operator associated with the l1-norm, which is, indeed, the shrinkage operator. By doing this, we would cut the lower intensity pixels out of the foreground. Such pixels would hopefully belong to the ghost rather than to the actual target. In this way, the background estimation would also be modified, because of the condition , therefore, effectively deleting the ghost.
Increasing is, unfortunately, an unpleasant solution for the following reasons:
The parameter would become much more dependent on the specific input matrix , while, in the practice, it is usually set as ;
Along with the ghost pixels, a higher would also cause erosion of target associated pixels, affecting the detection probability as well.
In order to overcome those problems, we used a saliency-based approach, described in
Section 2.4, which consisted of using a saliency map to modulate the regularization parameter associated with
.
2.4. Saliency-Aided RPCA
The saliency-based approach in RPCA is not new in the literature [
41,
48,
49]. Our approach was inspired by the approach proposed by Oreifej et al. in [
41], which modified the minimization problem in Equation (3) as follows:
which is then relaxed to the form:
where
is a matrix whose
i-th column
is the saliency map of the
i-th frame, scaled in the range between 0 and 1 and organized in lexicographic order. The operator
indicates the element-wise multiplication, while the operator
denotes any function that:
inverts the polarity of each element of , in the sense that a low value should address high objectness confidence, and vice versa;
scales the resulting matrix in a wider modulation range (e.g., between 0 and 20).
We use , where and are tuning parameters controlling the slope of the negative exponential and the dynamic of the resulting matrix, respectively. For each new frame, the saliency map is calculated through one of the many saliency filters presented in the literature. In this work, we refer to the SR and the FG algorithms because of their very small execution time. In particular, SR takes advantage of the property of the natural images known as 1/f law, which states that the amplitude of the averaged Fourier spectrum of the ensemble of natural images obeys a distribution of the type .
FG is an implementation of the well-known visual attention model, which emulates the behavior of the retina of the human eye, to highlight the spots within the image that are characterized by the highest center–surround contrast. After calculating the saliency maps, the problem in Equation (13) can be solved, again, using ALM. Referring to [
41] for the details, the steps of the saliency-aided RPCA are reported in Algorithm 3.
Algorithm 3: Saliency aided RPCA |
1 | Input:
(observed data) (regularization parameters) (parameters of )
|
2 | Initialize: (empty matrix of size ) |
3 | for to do |
4 | Reshape in the frame form to get the matrix of size |
5 | Compute the saliency algorithm on the frame to get |
6 | Put in lexicographic order to get and update |
7 | while not converged do |
8 | - (1)
|
9 | - (2)
|
10 | - (3)
|
11 | return: |