**1. Introduction**

As a key technology in computer vision, multi-object tracking (MOT) has received growing attentions from researchers all over the world. In recent years, with the improvements in object detecting techniques [1–3], tracking-by-detection (TBD) has become one of the most successful strategies. It applies an object detector to produce detection responses in each frame, which are then used to generate complete trajectories. The data association process mainly depends on object features including appearance, motion, and other factors. It is often solved by Hungarian algorithms [4,5], network flows [6–8], minimum energy models [9,10], conditional random field approaches [11,12], hyper-graph model [13], deep learning methods [14–17], and so on.

Object feature expression is the basis of data association. Handcrafted features, such as the histogram of oriented gradient (HOG) [18], local binary patterns (LBP) [19], and the histogram of color (HOC) are widely used in computer vision researches [8,11,13,20]. These features were originally designed to distinguish objects from various backgrounds. Although a combination of different handcrafted features [11,13] is often used to improve discrimination, it is still not robust enough. Meanwhile, detection responses given by object detectors are not always accurate and sometimes

117

even false due to complex backgrounds, poor image quality, complicated movements, or occlusions of objects. Thus, how to better distinguish targets by online detection responses, how to deal with noise due to detection inaccuracy, and how to combine various cues of a target to enhance discrimination remain key issues that limit tracking performance.

With the developments of deep learning in image classification, segmentation, and other applications, researchers used deep architectures to learn discriminative features for multi-object tracking, and they achieved good results. In [12,15,17,21–23], deep Siamese networks were adopted instead of traditional handcrafted methods [11,13]. A contrastive loss function was used with the aim of decreasing the feature distances for the same object pairs while increasing distances for the different pairs. Due to the shortage of online samples, training of such deep neural network mainly depends on offline learning. Although online fine-tuning measures are often adopted, the online data are too limited to run a deep network effectively.

In this paper, a Siamese network with an auto-encoding constraint (SNAC) is proposed, which is able to work well with a small-sized sample set. Different from previous deep Siamese networks, the SNAC has a simple structure with two fully-connected layers, an auto-encoder layer, and a code-mix layer. The simple network can be easily learned by online limited samples to extract discriminative features to distinguish objects on the scene. Inspired by stacked auto-encoder methods [24,25], the output of the encoder layer tries to represent the input detection response as accurately as possible. This is done by adding a constraint term to the loss function, called the auto-encoding constraint, which effectively prevents the network from overfitting while training with limited samples. To deal with inaccurate detection responses (red bounding box in Figure 1a), Gaussian distribution training samples are generated around detection responses to suppress noises. For each detection response, one SNAC is trained to distinguish it from others in adjacent frames. Meanwhile, in order to enhance robustness, following [22], the HOC is used as the input instead of raw pixels. With the discriminative detection features extracted by SNACs, reliable tracklets are generated.

To better distinguish tracklets, SNAC is improved to extract a composite previous-appearance-next (PAN) feature for each tracklet, which combines previous and next step motions with the appearance of the tracklet element. Following [11,26], elements in the same tracklet can be treated as positive samples, and the negative samples are obtained from time overlapped tracklets. The distribution is proposed to express motion that can suppress motion noises, and this is also compatible with the appearance for joint learning of the PAN feature.

In order to solve the MOT problem by the proposed SNAC, an online incremental learned tracking framework is established. First, one SNAC is trained for each detection response online, and reliable tracklets are generated mainly by the extracted features. Then, the PAN features are learned from tracklets by improved SNACs. To improve the training efficiency, SNACs are trained by incremental learning. During tracklet generation, the parameters of SNAC for detection in the new frame are inherited from the predecessor tracklet element, and the training samples are updated frame by frame. To extract PAN, the parameters are initialized by the SNAC of the related detection response. A tracklet growing process is used to deal with missing and partial detections (Figure 1b,c) before tracklet association. With the discriminative PAN feature, complete trajectories are solved efficiently by an iterative greedy algorithm. The main contributions of this paper are summarized as follows:


(**a**) Inaccurate detection

(**b**) Missing detection

**Figure 1.** Illustrations of detection failures in three consecutive frames. The solid yellow bounding boxes represent the correct detection responses, and the red boxes are error cases. (**a**) The red bounding box is a deviation detection that does not exactly match the target. (**b**) The red dashed bounding box indicates a missing detection. (**c**) The detection response only includes the upper body of the target.

## **2. Related Works**

Tracking by detection (TBD) has been one of the most promising methods developed to solve the multi-object tracking (MOT) problem in recent years. It generates object trajectories based on detection responses given by pre-designed detectors. For reliable data association, most recent researches were based on tracklets. In [26], the dual-threshold method was proposed to generate reliable tracklets and utilize them to ge<sup>t</sup> the final trajectories hierarchically. In [27], a prototype of a three frames triplet, which is a type of three members tracklet, was designed to extract high-level features. The Hungarian algorithm was also used to generate reliable tracklets in [12,15]. On the basis of tracklets, [11] built an online learning conditional random field (CRF) model focused on distinguishing the difficult pairs of objects. In [13], a hyper-graph model was developed to explore more complex relations among objects. The latest MOT methods [12,21] also focused on using tracklets. In these studies, tracklet building and feature expression are important to achieve reliable data association. In this section, MOT object feature extraction methods are mainly introduced.

From handcrafted methods to deep learning techniques, many studies have achieved significant improvements in extracting appropriate object features for MOT. In [11,13], a combination of multiple handcrafted features was proposed to distinguish objects by appearance. Their sample collection schemes were used in many following studies. The developments of deep learning have introduced new ideas for feature description in tracking areas. In [24,28–30], deep neural networks were adopted for single object tracking (SOT), and achieved significant improvements. In SOT problems, features of objects were used to distinguish them from the background. Different from SOT, MOT mainly distinguishes objects from each other. Due to this difference, the deep learning scheme of SOT cannot provide good results for MOT problems.

The deep learning methods for MOT can be summarized into two categories. The first builds a deep learning based tracking model to form the whole MOT system. Milan et al. [31] proposed a tracking model based on recurrent neural networks (RNN). The proposed RNN model described the whole tracking system including motion prediction, updating, object state judgment, and data association. It was trained online in an end-to-end manner to track various objects. Schulter et al. [14] proposed a deep network flow model for MOT, which instead of empirically hand-crafting costs, learned the parameterized costs of the network flow model by end-to-end training. This dynamic parameter setting method improved the robustness and accuracy of tracking. Zhou et al. [12] proposed a deep continuous conditional random field (DCCRF) model for solving online MOT problems. The unary term was used to provide a deep discriminative appearance feature for tracklet association, and a pairwise term was used to deal with inter-object relations. In [16], a deep neural network consisting of an encoder and a decoder was proposed. In their method, an encoder was a fully-connected network and a decoder was a bidirectional long short-term memory (LSTM). This network was able to learn the association matrix to solve MOT.

The second group uses a deep neural network to extract discriminative feature for each object. Unlike the previous kind, this method deals with the object feature extraction problem directly, and many researchers have followed this idea. Sadehgian et al. [32] proposed an RNN model jointly used the appearance, motion, and interactions of an object to encode a discriminative long-term temporal relationship using these cues. Their discriminative appearance features were extracted by a deep CNN. Son et al. [33] designed a quadruplet CNN (QCNN) network to learn the affinities among objects based on appearance and motion. The proposed quadruplet loss function guided the network to learn a temporally-smooth appearance model with motion-aware constraints. Features extracted from the QCNN included time continuity, which enhanced the discrimination. In addition, Siamese networks, first defined and used for signature verification, played an important role and have achieved good results in face identification [34], people re-identification [35], and many computer vision applications. Siamese networks are more suitable for distinguishing objects due to their symmetrical structures. Wang et al. [15] applied a Siamese CNN (SCNN) to construct an appearance affinity model for tracklets. They embedded a temporally-constrained multi-task mechanism in their training process. Leal-Taixé et al. [22] used an SCNN to estimate the likelihood of two objects using a multi-modal inputs including image and optical flow. Following [22], Yoon et al. [23] proposed the historical appearance matching method and trained a Siamese network by a two-step process to deal with noisy detections. In [17], a speeding method was proposed to remove redundant appearance matchings of SCNN for real-time tracking. In the DCCRF model [12], SCNN was also used to extract discriminative features. Based on SCNN, Bae et al. [21] proposed a confidence-based data association method for MOT. They utilized the SCNN to learn a discriminative appearance model from offline training datasets.

#### **3. Online Learned Siamese Network with Auto-Encoding Constraint**

In this section, a new Siamese network with an auto-encoding constraint (SNAC) is proposed. It is better at distinguishing objects in MOT. Benefiting from the simple structure of two fully-connected layers, an auto-encoder layer and a code-mix layer, the SNAC can be learned effectively. Meanwhile, with an auto-encoding constraint in the loss function, SNAC can prevent overfitting while training with limited online samples. In order to suppress detection noises, Gaussian distribution samples were generated around detection responses to make up the training set and HOC was used as the input instead of raw pixels. Then, an incremental learning algorithm was proposed to train the SNAC to generate reliable tracklets. Mathematical notations are listed in Table 1.


**Table 1.** Notations.

#### *3.1. The Structure of SNAC*

The two-layer structure of SNAC is shown in Figure 2a. Bounding boxes of detection responses were first resized to 48 × 32 as the inputs of the Siamese network. The two sub-networks (dashed boxes in Figure 2a) were identical in structure and share parameters including weights and biases. A contrastive loss function was employed to learn the Siamese network.

As shown in Figure 2b, each sub-network consisted of an auto-encoder layer and a code-mix layer. The first layer contained three parallel auto-encoders corresponding to the red, green, and blue channels of the input RGB image, respectively. Similar to [22], the inputs were R, G, and B histograms, not pixel values, and they were denoted as 256 dimensions vectors: **x**0, **x**1, and **x**2. Because of limited samples, training based on pixel values may lead to overfitting. Meanwhile, the histogram can also suppress the detection noises. Each auto-encoder contained a forward encoder, a backward decoder, and an auto-encoding error evaluator. The encoder and decoder were fully-connected networks. The output of the encoder was a vector with 100 dimensions, and the output of the decoder was a reproduction of the corresponding input. The code-mix layer was fully connecting and combined three code vectors of the first layer to produce a feature vector with 100 dimensions as the final output. Mathematically, the sub-network can be written as:

$$\begin{cases} \mathbf{y}\_m^k = \sigma(\mathbf{W}\_E^k \mathbf{x}\_m^k + \mathbf{b}\_E^k), m = p, q, k = 0, 1, 2\\ \hat{\mathbf{x}}\_m^k = \sigma(\mathbf{W}\_D^k \mathbf{y}\_m^k + \mathbf{b}\_D^k), m = p, q, k = 0, 1, 2\\ \mathbf{z}\_m = \sigma(\mathbf{W}\_M(\mathbf{y}\_{m'}^0 \mathbf{y}\_{m'}^1 \mathbf{y}\_m^2) + \mathbf{b}\_M), m = p, q \end{cases} \tag{1}$$

where subscript *m* indexes the upper *p* or lower *q* sub-network, the upper-script *k* indexes the channel, **y** is the code vector from an encoder, **x**ˆ is the reproduction of **y** by the decoder, and **z** is the final feature vector. **W**, **b**, and *σ* are the weights, biases, and activation functions of the neural networks, with the subscripts *E*, *D*, and *M* indicating the encoder, decoder, and code-mix layer.

(**a**) The overall framework

(**b**) Internal details

**Figure 2.** Structure of SNAC: (**a**) shows the overall structure of SNAC, including its symmetrical structure and parameter sharing. Here, AEL stands for auto-encoder layer, superscripts 0, 1, and 2 indicate image channel numbers, and ML stands for the code-mix layer. (**b**) is the internal anatomical diagram of the SNAC structure, showing its auto-encoder layer and code-mix layer.

#### *3.2. Loss Function and Auto-Encoding Constraint*

To learn a Siamese network, a contrastive loss function was formulated based on similarity or difference measurements between input pair. The objective was to train the network to sufficiently reduce differences between pairs of the same inputs and to increase feature distances of different ones. The distance of input training pair is denoted as:

$$D(\mathbf{x}\_{\mathcal{V}}, \mathbf{x}\_{\emptyset}) = ||\mathbf{x}\_{\mathcal{V}} - \mathbf{x}\_{\emptyset}||\_2^2 \tag{2}$$

where **<sup>x</sup>***p* and **x***q* are feature vectors from the two sub-networks in SNAC. Instead of using the Euclidean distance here, other measures, like Mahalanobis and Bhattacharyya distances, can be used.

Given a group of training samples, the loss function of SNAC to be minimized consists of three terms, L1, L2, and L3, as follows:

$$\begin{split} L &= aL1 + \beta L2 + \gamma L3 \\ &= a \sum\_{p,q} \max(0, \delta - l\_{pq} [1 - ||\mathbf{z}\_p - \mathbf{z}\_q||\_2^2]) \\ &+ \beta \sum\_{k=0,1,2} ||\mathbf{x}\_j^k - \hat{\mathbf{x}}\_j^k||\_2^2 \\ &+ \gamma (\sum\_{k=0,1,2} ||\mathbf{W}\_k||\_2^2 + ||\mathbf{b}\_k||\_2^2) \end{split} \tag{3}$$

where *α*, *β*, and *γ* are weight coefficients between zero and one. The first term, L1, is a margin-based loss of difference of sample pairs; *δ* is the decision margin, which satisfies (0 ≤ *δ* ≤ 1); *lpq* is the sample indicator; *lpq* = 1 denotes a positive pair; and *lpq* = 0 denotes a negative pair. The L3 term is the regularization constraint.

However, deep neural networks contain a large number of parameters and require huge sample sets for training. For the case of using limited online samples, parameters of a deep model will often be overfitting after training, and the network will not work. This method often pays more attentions to some local details of training samples and does not balance the general features. Subsequently, inspired by the stacked auto-encoder in [24,25], the L2 term was added, an auto-encoding constraint (AC) to the loss function in Equation (3), to prevent overfitting, even when training with limited online samples.

#### *3.3. Denoising through the Collection of Training Samples*

*Dt* = {*dti* , *i* = 1, 2, ...*Nt*} is the detection set at frame *t*. Each detection response *dti* was associated with the SNAC(*dti*). Training samples were collected around *dti* . The purpose of SNAC(*dti*) is to distinguish *dti* from other object detection responses in adjacent frames, not over a longer time period. The training samples of SNAC(*dti*) were collected online. Inspired by [11], *dti* is the only one positive sample, and the remaining detection responses at frame *t* constitute the negative sample set. Although SNAC(*dti*) can be trained by small-sized samples, an unbalanced sample set with only one positive sample cannot drive it. To solve this problem, more samples are needed, which means additional detection responses of *dti*.

There is a fundamental issue whereby detection responses are not always perfect, and their bounding boxes are often inaccurate, as explained before in Figure 1a. When a noisy detection is used as a training sample, it will impair the parameters of SNAC. However, detection noise is inevitable, so this error can be suppressed through more *dti* with random noise. This noise processing is just enough to solve the positive sample shortage problem.

Detection noise was assumed to be modeled as additive noise as follows:

$$\mathbf{p}\_n = \mathbf{p} + \mathbf{n}\_{p^\prime} \ \mathbf{s}\_n = \mathbf{s} + \mathbf{n}\_{\delta} \tag{4}$$

where **p** = (*<sup>x</sup>*, *y*) is the center position of the detection response, **s** = ( *w*, *h*) is the size vector of width and height, and **<sup>n</sup>***p* and **n***s* are additive noises that refer to position and size, respectively. **<sup>n</sup>***p* and **n***s* are assumed to follow a Gaussian distribution, *G*(0, *<sup>σ</sup>p*) and *G*(0, *<sup>σ</sup>s*), where *<sup>σ</sup>p* and *σs* are corresponding covariances obtained by prior analysis.

A group of random bounding boxes Ψ({*d<sup>t</sup> i* }) was generated around *dt i* according to Equation (4) with distributions of **<sup>n</sup>***p* and **n***s*. In the same way, Ψ(*Dt* − {*d<sup>t</sup> i* }) was obtained. Ψ({*d<sup>t</sup> i* }) and Ψ(*Dt* − {*d<sup>t</sup> i* }) are the positive and negative sample sets, respectively. Using these online collected samples, SNAC(*d<sup>t</sup> i* ) not only can extract discriminative features for *dt i* , but it also can suppress detection noises.

#### *3.4. Iterative Tracklet Generation with SNAC by Incremental Learning*

The above sections discussed the establishment and training of SNAC. Each detection response *dt i* is associated with SNAC(*d<sup>t</sup> i* ), which extracts discriminative features to better distinguish *dt i* from other detections belonging to *Dt*+1. Moreover, connecting these original independent networks not only increases the number of samples, but can also improve the training efficiency. On the one hand, SNAC(*d<sup>t</sup> i* ) can obtain more training samples from *d<sup>t</sup>*−<sup>1</sup> *j* in the adjacent frame *t* − 1 through a relationship. On the other hand, with this relationship, SNAC(*d<sup>t</sup> i* ) does not need random initialization parameters for training, but inherits them from SNAC(*d<sup>t</sup>*−<sup>1</sup> *j* ), which can reduce the training time to improve the efficiency. This relationship is the principle of tracklet linking, that is the two detection responses between adjacent frames belong to the same object. Incremental learning of SNACs through this inheritance relationship can effectively match adjacent frame detection responses. To generate reliable tracklets, an iterative algorithm with SNAC by incremental learning is proposed as shown in Algorithm 1.

**Algorithm 1** Iterative tracklet building with SNAC by incremental learning.

**Input:** D = {*<sup>D</sup>*1, *D*2, ..., *Dt*}, detection set of each frame **Output:** T*<sup>t</sup>* = {*T<sup>t</sup> <sup>k</sup>*}, tracklet setup to frame *t* 1: Initialization: *t* = 1, T<sup>1</sup> = ∅ 2: **for** each *d* ∈ *D*1 **do** 3: *T*<sup>1</sup> *k* = *d* 4: Initialize **F**1 *k* with random parameters 5: Set *P* = <sup>Ψ</sup>(*d*), *N* = <sup>Ψ</sup>(*<sup>D</sup>*1 − *d*) 6: Train **F**1 *k* with *P* and *N* 7: **end for** 8: **while** *t* ≥ 2 **do** 9: **for** each *Tt*−<sup>1</sup> *k* ∈ T*<sup>t</sup>*−<sup>1</sup> and each *d* ∈ *Dt* **do** 10: Compute <sup>Λ</sup>*a*(*Tt*−<sup>1</sup> *k* , *d*) as Equation (6) 11: Compute <sup>Λ</sup>(*Tt*−<sup>1</sup> *k* , *d*) as Equation (5) 12: **end for** 13: For all <sup>Λ</sup>(*Tt*−<sup>1</sup> *k* , *d*) meeting the link requirement, select 14: pairs of *Tt*−<sup>1</sup> *k* and *d* by the Hungarian algorithm. 15: T*<sup>t</sup>* = renewed T*<sup>t</sup>*−<sup>1</sup> by linking the selected pairs. 16: *D<sup>R</sup> t* = *Dt* 17: **for** each *Tt k* ∈ T*<sup>t</sup>* having a new detection added **do** 18: *d* = the new detection of *Tt k* 19: Set *P* = <sup>Ψ</sup>(*d*), *N* = Ψ(*Dt* − *d*) 20: **F***t k* = **F***t*−<sup>1</sup> *k* incrementally trained with *P* and *N* 21: *D<sup>R</sup> t* = *Dt* − *d* 22: **end for** 23: **for** each *d* ∈ *D<sup>R</sup> t* **do** 24: Add a new single member tracklet *Tt k* = *d*, 25: and set its **F***t k* as above. 26: **end for** 27: **end while**

At the first frame *t* = 1, a new tracklet *T*1*i* was established by a single member of *d*1*i* in *D*1, and the current total number of tracklets was *N*1. To match the detection response belonging to the same object (or inexistence) in the next frame, a randomly initialized network, SNAC(*d*1*i* ), was associated with *d*1*i* . After SNAC(*d*1*i* ) training, the appearance similarity <sup>Λ</sup>*a*(*T*1*i* , *d*2*j*) can be calculated by *T*1*i* , which is equal to *d*1*i* and *d*2*j* . Together with the position similarity <sup>Λ</sup>*p*(*T*1*i* , *d*2*j*) based on position and size, the total similarity <sup>Λ</sup>(*T*1*i* , *d*2*j*) can be calculated. When similarities of all detection responses in Frame 1 have been calculated, the Hungarian algorithm was used to determine if there was a *d*2*j* that could be combined with *T*1*i* . If *d*1*i* and *d*2*j* belong to the same object, *d*2*j* joins with *T*1*i* , and tracklet *T*1*i* is updated to *T*2*i* . Otherwise, a new tracklet *<sup>T</sup>*2*N*1+<sup>1</sup> of *d*2*j* is generated. Then, the processing went into Frame 2, and tracklets that contained the detection responses in Frame 2 needed to train. Taking *T*2*i* as an example, its last element was *d*2*j* . If *d*1*i* exists as a former element of *d*2*j* in tracklet *T*2*i* , the initial parameters of SNAC(*T*2*i* ) equal to SNAC(*d*2*j*) will be inherited from the trained SNAC(*d*1*i* ). In addition, the positive and negative training sets can be expanded through the samples of SNAC(*d*1*i* ). Training of SNAC(*T*2*i* ) can be done with fewer iterations in this incremental manner. If *T*2*i* is a new added tracklet that only contains *d*2*j* , SNAC(*T*2*i* ) will be trained similarly to SNAC(*d*1*i* ). Finally, all reliable tracklets T will be produced frame-by-frame.

Now, the calculation of similarities between a tracklet and a detection response is explained. <sup>Λ</sup>(*Tt*−<sup>1</sup> *k*, *dtj*) is given as follows:

$$
\Lambda(T\_k^{t-1}, d\_j^t) = \Lambda\_a(T\_k^{t-1}, d\_j^t) \Lambda\_o(T\_k^{t-1}, d\_j^t). \tag{5}
$$

The appearance similarity was computed by the distance between feature vectors output by the SNAC(*Tt*−<sup>1</sup> *k*). It is given by:

$$\Lambda\_{\mathfrak{a}}(T\_k^{t-1}, d\_j^t) = \lg\{ \|\mathbf{F}\_k^{t-1}((T\_k^{t-1}(\mathfrak{e}))) - \mathbf{F}\_k^{t-1}(d\_j^t)\|\_2^2 \}\tag{6}$$

where *Tt*−<sup>1</sup> *k* (*e*) denotes the end element of tracklet *Tt*−<sup>1</sup> *k* , **F***t*−<sup>1</sup> *k* denotes the output feature vector of the SNAC for tracklet *Tt*−<sup>1</sup> *k* , and *g* is a probability function on the squared distance of feature vectors. Because of the margin-based loss of SNAC, the definition of function *g* is as follows:

$$g(\mathbf{x}) = \begin{cases} 1 & \mathbf{x} < 1 - \delta \\ 0 & \mathbf{x} > 1 + \delta \\ (1 + \delta - \mathbf{x}) / 2\delta & \text{otherwise} \end{cases} \tag{7}$$

where *δ* is the decision margin given in the loss function of Equation (3).

Overlapping is widely used to describe the detection position relationship. It takes information about the coordinates and size into account. The overlapping <sup>Λ</sup>*o*(*Tt*−<sup>1</sup> *k*, *dtj*) is given as:

$$A\_{\boldsymbol{\sigma}}(T\_k^{t-1}, d\_j^t) = \frac{A\_{\boldsymbol{\cap}}(T\_k^{t-1}(\boldsymbol{e}), d\_j^t)}{\min[A(T\_k^{t-1}(\boldsymbol{e}), A(d\_j^t))]} \tag{8}$$

where *A* is the area function on a detection response and *A*∩ is the area function on the intersection of two detection responses.

#### **4. Multi-Object Tracking Framework**

## *4.1. Overall Framework*

Based on SNAC, a tracking framework following TBD was established to solve the MOT problem. A TBD scheme can be described as solving an MAP problem by:

$$\mathcal{T}' = \underset{\mathcal{T}}{\text{arg}\,\text{max}}\, P\left(\mathcal{T}|\mathcal{D}\right) \tag{9}$$

where D is the set of given detection responses and T is the set of trajectories. In the framework, tracklets were first generated. Because a tracklet is an ordered combination of detection responses, it is able to extract higher order features to better describe relations between objects. Then, the problem can be converted into a more reliable tracklet association as follows:

$$\mathcal{T}' = \arg\max\_{\mathcal{T}} P\left(\mathcal{T}|\mathbb{T}\right) \tag{10}$$

where T is the set of all tracklets.

The whole framework is shown in Figure 3. First of all, the inputs were checked, and deformity detection responses were deleted, such as too large or small bounding boxes. SNAC was proposed to extract discriminative appearance features for detection responses. The online SNAC incremental learning method mentioned above was used to generate reliable tracklets. The next step was to generate tracking results through tracklet association. Similar to detection association based on the learning method, SNAC was improved to extract a new discriminative composite feature PAN for the tracklet instead of using traditional handcrafted methods. To enhance tracklet association, the tracklet growing module was embedded to make tracklets as extended as possible. With the discriminative PAN feature, tracklet association was converted to a linear programming problem that was solved by an efficient greedy iterative algorithm, and the final trajectories were achieved. For real-time tracking, the whole tracking process was carried out in sliding time windows.

**Figure 3.** Illustration of the overall online tracking by detection (TBD) framework. In addition to standard inputs and outputs, an online tracking framework is established with new facilities, including an iterative Siamese network with an auto-encoding constraint (SNAC) to learn the detection responses, previous-appearance-next (PAN) to represent the composite features of tracklets, and pre-processing of tracklet growth to cope with short-time detection failures. Finally, a greedy iterative algorithm is used to output robust trajectories in sliding windows.

#### *4.2. Previous-Appearance-Next Feature of the Tracklet*

A tracklet *<sup>T</sup>t*2*m* = {*d<sup>t</sup>*1*i* , *d<sup>t</sup>*1+<sup>1</sup> *j* , ...*d<sup>t</sup>*2*<sup>k</sup>* } is an ordered sequence of detection responses that represents a moving object with a short time from frame *t*1–*t*2. To describe *<sup>T</sup>t*2*m* , appearance and motion are indispensable. They are often assumed to be independent of each other in several studies [12,21,36]. Only by weighted summation can they express the similarity between two tracklets. To increase the flexibility and discrimination, a composite previous-appearance-next (PAN) feature was proposed. The new feature combined appearance and motion for the tracklet, and it was extracted jointly by an improved SNAC.

Taking *<sup>T</sup>t*2*m* and *<sup>T</sup>t*4*n* as examples, as shown in Figure 4b, *<sup>T</sup>t*4*n* is from frame *t*3–*t*4 and *t*2 < *t*3. To calculate the similarity between *<sup>T</sup>t*2*m* and *<sup>T</sup>t*4*n* , it is better to use the tail part of *<sup>T</sup>t*2*m* and the head part of *<sup>T</sup>t*4*n* rather than using their whole information. *<sup>T</sup>t*2*m* (*e*) is the last element of tracklet *<sup>T</sup>t*2*m* , and *<sup>T</sup>t*4*n* (*s*) is the first element of *<sup>T</sup>t*4*n* . The PAN(*Tt*2*m* (*e*)) vector integrated the appearance, previous, and next stage motions of *<sup>T</sup>t*2*m* (*e*) to express the tail part composite feature of tracklet *<sup>T</sup>t*2*m* . Correspondingly, the PAN(*Tt*4*n* (*s*)) vector was defined for the head part composite feature of *<sup>T</sup>t*4*n* . The next stage motion of tail *<sup>T</sup>t*2*m* and the previous of head *<sup>T</sup>t*4*n* were computed by estimation methods.

The SNAC for detection response was revised to extract PAN(.) vectors of tracklets. The new structure is shown in Figure 4a. The previous and next stage motions were used as additional inputs to the mix-layer. The first layer of the new SNAC was same as the old SNAC. **Δ***p* = (*xp*, *y<sup>p</sup>*) and **Δ***n* = (*xn*, *y<sup>n</sup>*) are the previous and next motion vectors of *<sup>T</sup>t*2*m* (*e*), respectively. As shown in Figure 4b, **Δ***p* represents the *x* and *y* axes displacements of *<sup>T</sup>t*2*m* from *t*2 − 1 to *t*2. For the next-stage motion vector, *<sup>T</sup>t*2*m* (*e* + <sup>1</sup>), the estimation of *<sup>T</sup>t*2*m* in frame *t*2 + 1 was computed first, and then, **Δ***n* of *<sup>T</sup>t*2*m* (*e*) was calculated.

Since **Δ***p* and **Δ***n* are two-dimensional vectors that include displacements with x and y directions and the output of each auto-encoder in the first layer of SNAC is a 100-dimension feature vector, they are totally different in type and cannot work together simply. Meanwhile, the existence of detection noises makes the deterministic motion descriptions inaccurate. A distribution description method was proposed to represent the motion instead of specific values. Assuming following the Gaussian distribution, the *x* axis displacement, *xp* of **Δ***p* for instance, is described by *<sup>G</sup>*(*xp*, *<sup>σ</sup>x*), where *σx* is set by pre-training. *<sup>G</sup>*(*yp*, *<sup>σ</sup>y*) is for *y* displacement, as well. The distribution description was given by sample vectors of *<sup>G</sup>*(*xp*, *<sup>σ</sup>x*) and *<sup>G</sup>*(*yp*, *<sup>σ</sup>y*), and its length was taken to be equal to that of the appearance vector. For in MOT, the motion feature is as important as appearance. The distribution description for **Δ***n* can also be obtained. Then, they were merged with the three outputs of the first layer to form one mixed vector for the second-layer training.

Then, SNAC(*Tt*2*m* (*e*)) was trained to extract the tail PAN(*Tt*2*m* (*e*)) feature. Training samples of SNAC(*Tt*2*m* (*e*)) were also collected online. Similar to [11], elements in *<sup>T</sup>t*2*m* are positive samples. Tracklets that overlap with *<sup>T</sup>t*2*m* in time are positive samples. The parameters of the first layer were inherited from the corresponding detection SNAC. After training the SNAC(*Tt*2*m* ), discriminative local composite features can be extracted to distinguish *<sup>T</sup>t*2*m* from other subsequent tracklets.

As shown in Figure 4b, similarities between tracklet *<sup>T</sup>t*2*m* and *<sup>T</sup>t*4*n* were computed. After training, PAN(*Tt*2*m* (*e*)) and PAN(*Tt*4*n* (*s* + <sup>1</sup>)), as shown by the blue dashed circle areas in the figure, were extracted. Then, forward similarity was achieved as follows:

$$S\_{m,n}^F = ||PAN(T\_m^{t2}(\varepsilon)) - PAR(T\_n^{t4}(s+1))||\_2^2 \tag{11}$$

To ge<sup>t</sup> a reliable similarity, the backward relationship was also computed, as shown in Equation (12).

$$S\_{m,n}^{B} = ||PAN(T\_m^{t2}(e-1)) - PAR(T\_n^{t4}(s))||\_2^2 \tag{12}$$

The final similarity was given by:

$$
\Lambda\_{\text{PAN}}(T\_{\text{m}}, T\_n) = \lg(\min(S\_{\text{m}, \text{n}'}^{F} S\_{\text{m}, \text{n}}^{B})) \tag{13}
$$

where *g* is the probability function for the distance of feature vectors, as defined in Equation (7).

(**a**) SNAC for PAN

(**b**) Similarity computing by PAN

**Figure 4.** The generation and application of PAN. (**a**) SNAC is revised and added two pieces of motion information of a tracklet member together with the appearance codes from the auto-encoder layer as the inputs of the code-mix layer. During the online training process, the PAN feature is the final output of the code-mix layer. (**b**) Similarities of tracklets are determined by calculating the forward and backward PAN affinities.
