**1. Introduction**

Person re-identification (re-id), as an important technique to automatically match a specific person in a non-overlapping multi-camera network, has been widely used in many applications, such as surveillance [1,2], forensic search [3], and multimedia analysis [4]. It is a challenging problem because of large variations in a person's appearance caused by illumination, pose or viewpoint changes, as well as occlusion. Two fundamental problems in person re-identification intensively studied in the literature [3,5–7] are feature representation [8–14] and metric learning [15–25], and this work is mainly concerned with the former.

Existing works on feature representation mostly rely on still person images across non-overlapping camera views. These works can further be divided into two groups according to the number of images they use, namely, single-shot re-id and multiple-shot re-id. Single-shot re-id methods use one single image to model the person, which are evidently limited in that they do not make full use of the information. Because, in real-world applications (like surveillance), there is usually a sequence of images available for a person in each camera view. As a consequence, the single-shot methods often suffer from some practical challenging factors, like occlusion, and pose, viewpoint or lighting changes. Multiple-shot re-id methods [26–29] can alleviate these issues to some extent by utilizing more images in their feature representations. However, the multiple-shot re-id methods still have two typical limitations: (1) they treat a video sequence as an unordered set of images, where the temporal information is totally lost; (2) they may be computationally costly since appearance features need to be calculated on a large number of frames.

To tackle the limitations mentioned above, some authors recently advocate video-based re-id, which can usually achieve better performance by exploiting the abundant space–time information in feature representation. Compared to still-image-based re-id, where only spatial alignment needs to be considered since the primary challenge in feature representation is to achieve robustness to viewpoint changes, one fundamental but challenging problem in video-based re-id is temporal alignment. Considering its significance, temporal alignment has been studied in very recent literature [30,31], both using Flow Energy Profile (FEP) to align the video sequences temporally. The FEP extracts the motion information the optic flow field. Wang et al. extract fixed-length fragments around the local maxima/minima of FEP [30], and Zhang et al. use the frequency of dominant one in the discrete Fourier transform domain of FEP to extract a more stable walking cycle [31]. However, they still suffer from the following problems: (1) FEP, captured using optic flow, is based on individual pixels, and all the pixels of the lower body are considered. That means FEP is pixel level motion information, which is less accurate or robust due to the heavy noise caused by background clutter or occlusions. (2) These works represent the video using all the motion information [30,31], i.e., around all the local maxima/minima of FEP or all the walking cycles. However, this to some extent introduces redundancy and noise caused by cluttered background and occlusions, which is harmful to robust representations. Thus, we argue that person representation using only one walking cycle may be a more effective strategy to address the problem of redundancy and noise. However, the walking cycle should representative, i.e., with less redundancy and less noise. Although some deep learning-based methods have been proposed recently, they still depend on much more resources of computation, memory, training data, compared to the traditional methods. This makes the deep learning based method can not be used in some resource limited applications.

To this end, we present a superpixel-based Temporally Aligned Representation (STAR) method to address the temporal alignment problem for video-based person re-identification. More precisely, we first extract motion information from the input sequence by tracking the superpixel of the lowest portions of human, to build a candidate set of walking cycles. Second, on the assumption that there is a whole walking cycle in the video of each person, we select the "best" walking cycle to perform temporal alignment across videos. Finally, to extract the representation of the selected walking cycle of a video, we propose a superpixel-based representation for each single image, and a walking cycle based temporally aligned pooling method.

The preliminary version of portions of this paper has been published in [32]. Compared to [32], this manuscript (1) proposes a superpixel based representation for the still images, termed as SPLOMO, and compares it with original LOMO; (2) expands it from 5 pages to more than 17 pages; (3) expands or rewrites all the sections, and adds the "Related work" section; (4) in Experimental Results Section, adds some recent works in the comparison results, and adds the "Evaluation of Components and Parameters", analyzing the results in more detail.

To sum up, the contributions of this paper are as follows:

(1) We propose a robust temporal alignment method for video-based person re-id, which is featured by the superpixel-based motion information extraction, the effective criterion for candidate walking cycles, and the use of only the best walking cycle to build the appearance representation.


The rest of this paper is organized as follows. Related work is reviewed in Section 2. We introduce the proposed STAR video representation in detail in Section 3. In Section 4, we present extensive experimental results, and we conclude this paper in Section 5.
