4.1.2. Settings

In the proposed STAR method, SPLOMO is used to represent the still images and Cross-view Quadratic Discriminant Analysis (XQDA) [8] is used for metric learning. On iLIDS-VID and PRID 2011, the performance of all the methods is measured by the average Cumulative Matching Characteristics (CMC) curves after 10 trials. On MARS, the performance is evaluated by CMC with a fixed partition as [49], and by Mean Average Precision (mAP). Then, we give the parameters in our implementation. The width and height of the images in both datasets are *W* = 64 and *H* = 128 respectively. We perform superpixel segmentation using SLIC [75], with the maximal number of superpixel is 100, in both motion information extraction and superpixel based representation. The threshold distance to the horizontal center line *λ* = 17, according to the datasets. We utilize average pooling manner for temporal pooling and divide a walking cycle into *M* = 8 segments in our STAR method, according to the analysis in Section 4.4.

In SPLOMO, we extract person masks using DDN (The code is available on http://mmlab.ie.cuhk. edu.hk/projects/luoWTiccv2013DDN/index.html) [78]. Figure 9 shows some examples of person segmentation by DDN, the person segmentation result used in this paper indicates the person region (see more details of DDN in [78]).

**Figure 9.** Some examples of person segmentation on the iLIDS-VID dataset.

#### *4.2. Comparison with the State-of-the-Art Methods*

In this section, we report the comparison results of STAR with the existing state-of-the-art video-based person re-id approaches on iLIDS-VID, PRID 2011 and MARS datasets. Three groups of the approaches are compared, as shown in Table 1, e.g., (1) traditional methods: GEI + RSVM [59], HOG3D + DVR [30], Color + LFDA [82], STFV3D + KISSME [31], CS-FAST3D + RMLLC [57], SRID [55], TDL [56]; (2) deep network based methods: RNN [70], CNN + XQDA + MQ [49], SPRNN [53], ASTPN [28], DSAN [52]; and (3) TAPR [32] is the preliminary version of our method. More specifically, GEI+RSVM [59] is a gait based approach, which is not specially designed for person re-id. HOG3D + DVR [30], Color + LFDA [82], STFV3D + KISSME [31], CS-FAST3D + RMLLC [57] focus on appearance based representations for video. SRID [55] formulates the re-id problem as a block sparse recovery problem. TDL [56] mainly focuses on metric learning under the top-push constraint. [49] uses CNN to represent each frame. RNN [70], SPRNN [53], ASTPN [28], and DSAN [52] are end-to-end deep architectures, by incorporating feature leaning and metric learning together.

**Table 1.** Quantitative comparison of the proposed method and the state of the art methods on iLIDS-VID, PRID 2011 and MARS datasets. Bold and underlined values indicate the best and the second-best performance respectively.


Table 1 shows that the STAR approach outperforms the other methods in general, especially on the iLIDS-VID and MARS datasets. In particular, on the iLIDS-VID dataset, the proposed method performs significantly better than the other methods, even the deep learning based methods. Specifically, the rank-1 and rank-5 identification rates of our method are 5.5% and 4.1% over the second-best scores, respectively. On the PRID 2011 dataset, although the deep learning based methods perform better, STAR obtains the comparative results and significantly outperforms the traditional methods. On the MARS dataset, although SPRNN [53] obtains the highest scores of rank-5 and rank-20, we achieve the comparable results. Moreover, the rank-1 score of STAR is 6.5% over the second-best score obtained by DSAN [52]. More importantly, the mAP of STAR is 70.0%, about 20% higher over SPRNN [53]. The comparison results on the three datasets demonstrate that the proposed method STAR performs favorably against the state-of-the-art methods, even deep learning-based ones. The comparison results of STAR with [30,31] show the superpixel level motion information is more robust than pixel-level motion information.

We also report the results of TAPR, which is the preliminary version of our method. The main difference of TAPR and STAR is that TAPR describes each frame with LOMO feature, while STAR describes it with a SPLOMO feature. As mentioned above, SPLOMO improves LOMO by introducing the constraint of superpixel and person masks for better spatial alignment. The comparison results of these two methods show that our proposed superpixel based representation improve the performance of LOMO distinctly, especially on the iLIDS-VID dataset. More specifically, the rank-1 identification rate is 67.5% for STAR on the iLIDS-VID dataset, while 55.0% for TAPR. This validates the effectiveness of the SPLOMO representation on spatial alignment.

It is worth noting that we achieve the best performance on both iLIDS-VID dataset and MARS dataset, and a comparable result on the PRID 2011 dataset. We believe this is because the iLIDS-VID and MARS datasets have more occlusions and cluttered background than the PRID 2011 dataset. And the proposed STAR can excavate the "best" walking cycle to reduce their effect and achieve more accurate temporal alignment. Moreover, STAR uses the person masks to avoid the effect of the background, and uses superpixel based representation to achieve better spatial alignment. This means (1) accurate spatial and temporal alignment is quite essential in the video-based re-identification, especially when the scenes are complex; (2) our method has great abilities in video temporal alignment even with complex scenes.
