*3.1. Motion Information Extraction*

We propose a superpixel based motion information extraction, which ought to be more robust than pixel based methods, because (1) it is based on superpixels, dynamic information extraction is robust to noise caused by some individual pixels, while pixels based on methods like FEP suffer from this; (2) to alleviate the effect of occlusions and cluttered background, we utilize the local superpixel to extract motion information, and then select the "best" cycle in the curves of all the superpixels. Note that, although the superpixels may also be on the background, whose curves of motion information are quite easy to be distinguished from that on person, as discussed in Sections 3.2 and 4.3.

Given a video sequence *V* = {*It*}*t*=1,...,*T*, with *T* frames, we extract the motion information, as illustrated in Figure 2. In our implementation, we only consider the motion information of the lowest portions of human, because of its amplitude of walking is more significant. Specifically, we first perform superpixel segmentation on the lowest portion of the first frame, using SLIC method [75]. *<sup>N</sup>* superpixels {*S<sup>j</sup>* <sup>1</sup>}*j*=1,...,*N*, are obtained. Figure 2b shows an example, where the superpixel labeled in red is on the right foot of the person.

*Sensors* **2019**, *19*, 3861

Then we track all the *N* superpixel to extract the motion information. For the *j*th superpixel *S<sup>j</sup>* 1, we track it throughout the video sequence *<sup>V</sup>*, resulting in a set of superpixel {*ST<sup>j</sup> <sup>t</sup>* }*t*=1,...,*<sup>T</sup>* containing *<sup>T</sup>* elements, where {*ST<sup>j</sup> <sup>t</sup>* } is from the *t*th frame. Note that, although we use a set to describe the superpixels, it is ordered, and the superpixel in the sets are in the same order as the frames in the corresponding video.

At frame *<sup>t</sup>*, we extract *Nt* superpixels using SLIC, that is {*S<sup>k</sup> t* }*k*=1,...,*Nt* . For simplicity, we obtain the tracking result *ST<sup>j</sup> <sup>t</sup>* , which is the best match in {*S<sup>k</sup> <sup>t</sup>* }*k*=1,...,*Nt* to the initial superpixel *<sup>S</sup><sup>j</sup>* <sup>1</sup>, with the smallest distance:

$$ST\_t^j = S\_t^{k^\*} = \underset{k}{\text{arg min}} \, (f(S\_1^j) - f(S\_t^k))^2,\tag{1}$$

where *f*(*x*) denotes the representation of superpixel *x*. In this paper, we use the color feature, i.e., HSV histogram, to represent the superpixels. Figure 2c shows the superpixel tracking results corresponding to the initial superpixel in Figure 2b. We denote the horizontal positions {*L<sup>j</sup> <sup>t</sup>*}*t*=1,...,*T*,*j*=1,...,*<sup>N</sup>* of the centers of the superpixel set {*ST<sup>j</sup> <sup>t</sup>* }*t*=1,...,*T*,*j*=1,...,*N*. The final motion information of the region corresponding to the *j*th superpixel *S<sup>j</sup>* <sup>1</sup> can be described as {*L<sup>j</sup> <sup>t</sup>*}*t*=1,...,*T*, as shown in Figure 2d. We can see that the superpixels along the entire video sequence is about the right foot. There is a high probability that it is a part of a person with somewhat semantic information, which makes the motion information extraction algorithm very robust.

(d) Horizontal positions of the superpixel in all frames

**Figure 2.** Motion information extraction based on superpixel tracking. (**a**) The image of frame #1; (**b**) One of the superpixels in frame #1, labeled in red color; (**c**) The superpixel tracking results in the images of frame #2 to #70, labeled in red color; (**d**) Horizontal positions (red dots) of the superpixel in all frames. The figure is best viewed in color.

It is worth noting that we track all the superpixel on the lowest portion of an image, without segmenting the lowest portions of human out of the background. The main reasons are two-fold: (1) foreground segmentation is not dependable since it is difficult due to noise, occlusions, and low-resolution (see the last row in Figure 9); (2) the motion information of the superpixels in the background actually is easy to eliminate, since the motion information of the superpixels from the foreground and from the background in the lowest portion of the frame is quite different. Figure 3 presents the motion information of some superpixels on the background. We can see that their motion information is quite different from that of the superpixels on person as shown in Figure 2, and do not match the intrinsic periodicity property of walking persons. This observation indicates that we can easily select "good" walking cycles from noise and redundancy motion information, as described in Section 3.2.

**Figure 3.** Motion information of four superpixels on the background, from the same sequence in Figure 2. The figure is best viewed in color.

#### *3.2. Walking Cycle Selection*

To address the problem of abundance and noise caused by cluttered background and occlusion, we use only one walk cycle of frames to represent a video sequence. Therefore, after obtaining the redundant motion information (i.e., horizontal positions) {*L<sup>j</sup> <sup>t</sup>*}*j*=1,...,*N*,*t*=1,...,*<sup>T</sup>* of the *N* superpixels, we are trying to select the "best" walking cycle (*t* ∗ *start*, *t* ∗ *end*). The motion information of a superpixel of the lowest portions of human is described as its horizontal displacements with time as in Section 3.1. The fragments of the motion information can be considered as the candidate walking cycles. Then, a natural question is: what is the "best" walking cycle for person representation? The key to this question is to mathematically model the motion information of a walking cycle, for which we adopt the sinusoid function, based on two observations: (1) Many bipedal robots use an intuitive walking method with sinusoidal foot [76]. It is based on the hypothesis that the trajectory of the feet follows sine waves in the *x*, *y* and *z* directions. That is consistent with the biomechanics of gait [77]. (2) The motion information of feet (horizontal positions) annotated on the iLIDS-VID dataset is almost exactly sinusoidal, as shown in Figure 4. Therefore, we model the horizontal displacements of feet as a sinusoid.

**Figure 4.** Illustration of motion information of feet with some examples on the iLIDS-VID dataset. We annotate the mean horizontal positions of the right foot of a person's image sequence, and then we show the horizontal positions (red dots) with frame index.

By modeling the horizontal displacements of feet as a sinusoid, we propose an effective criterion to select the "best" walking cycle. We expect the criterion has two characteristics: (1) it should be a complete walking cycle (i.e., a sinusoidal cycle) since it covers the entire dynamic information and variety of poses and shapes, and (2) it should be quite similar to sinusoid, namely, with less noise caused by cluttered background and occlusion. That is, we search the complete candidate walking cycles from the curves of the horizontal displacements according to the prior of walking persons. Then we evaluate how good a candidate walking cycle is, by measuring its fit error to a sinusoid.

Specifically, we first try to find candidate walking cycles from the motion information curves based on extreme points, as shown in Figure 5. Ideally, an extreme point corresponds to the postures when the distance of two legs is maximum. However, in the presence of noise and occlusions in practice, an extreme point with a small distance to the horizontal center line might be a false alarm. To obtain more accurate walking cycles, we process it in two ways: (1) we smooth the curve to extract more accurate extreme points by the least-squares polynomial fitting, and (2) we set upper bound *y*\_*up* and lower bound *y*\_*low* to eliminate the false alarms. The second way is based on the observation of the public datasets: the walking person is roughly cropped out in each frame, and approximately at the center of the frame. That means the horizontal center line is the symmetrical axis of two legs in a frame. Thus, we set the upper bound *y*\_*up* and lower bound *y*\_*up* with the same distance to the horizontal center line: 

$$\begin{cases} \quad y\_- \mu p &= c + \lambda \\ \quad y\_- \mu w &= c - \lambda \end{cases} \tag{2}$$

where *λ* is the threshold distance to the horizontal center line, *c* is the location of the horizontal center line, i.e., *c* = *W*/2, *W* is the width of the image. We set the bounds to filter the extreme points on the smooth curve, and alleviate the influence of noise and occlusions.

**Figure 5.** Walking Cycle Extraction. (**a**) the motion curve of a superpixel; (**b**) the smoothed curve, and the extreme points indicated as the red dot. (**c**) Four candidate cycles, (5,27), (14,37), (27,50), and (37,58), and their scores. (**d**) the final selected walking cycle (5,27).

We denote by (*P*1, *P*2, ..., *PK*) the *K* extreme points and by *tk* the frame number of the *k*-th extreme point *Pk*. Then we define a candidate walking cycle (*tstart*, *tend*) based on a group of three consecutive extreme points (*Pk*, *Pk*+1, *Pk*<sup>+</sup>2). If all three of them are bigger than *y*\_*up* or smaller than *y*\_*low*, this group will be considered as a candidate cycle (*tstart* = *tk*, *tend* = *tk*+2).

To evaluate how "good" a candidate cycle (*tstart*, *tend*) of *j*th superpixel is, we fit its positions {*Lj <sup>t</sup>*}*t*=*tstart*,...,*tend* to sinusoid {*Q<sup>j</sup> <sup>t</sup>*}*t*=*tstart*,...,*tend* in a least-squares sense, and calculate the score *Rj* (*tstart*, *tend*) with its fit error to sinusoid by:

$$R(t\_{start}, t\_{end}) = \log(1 - \frac{\sum\_{t=t\_{start}, t\_{end}} |L\_t^l - Q\_t^l|\_2^2}{(t\_{end} - t\_{start} + 1) \* \mathcal{W}}) \tag{3}$$

where *W* is the width of the image. The final selected walking cycle (*t* ∗ *start*, *t* ∗ *end*) is the one with the highest score of all the candidate cycles on the motion curves of all superpixels:

$$\mathbf{R}(t\_{start}^\*, t\_{end}^\*) = \underset{(t\_{start}, t\_{end})}{\text{argmax}} \ R(t\_{start}, t\_{end}). \tag{4}$$

We also present an example to visually illustrate our algorithm in Figure 5, corresponding to the scenario in Figure 2. For the *<sup>j</sup>*th superpixel {*S<sup>j</sup>* <sup>1</sup>} shown in Figure 2b, we first smooth its motion information curve (as shown in Figure 5a), the smoothed curve is shown in Figure 5b. Figure 5b also indicates the extreme points with red dot. Next we evaluate these curves, the scores are calculated using their fit error to sinusoid, as shown in Figure 5c. The best cycle is selected with the highest score, as shown in Figure 5d. Note that, it may still suffer from background clutter and occlusions, although the proposed superpixel based method is more robust than pixel based methods. And selecting the best walking cycle can avoid this problem to some extent. Two more examples and more detailed discussion are shown and discussed in Section 4.3.

## *3.3. Superpixel-Based Representation*

As mentioned in Section 1, a temporally aligned representation is proposed for video-based re-identification. In this paper, we refer to local maximal occurrence representation (LOMO) [8] to represent the individual frames. However, to further improve the robustness, we enhance the original LOMO in two aspects: (1) Fixed-sized patches are used in original LOMO. Although it is more robust to noise than aggregating pixel-level information, patches can span multiple distinct image regions, which can degrade the robustness. It is known that superpixels are superior to patches in many tasks because they can be considered as semantic visual primitives by aggregating visually homogeneous pixels. Thus, we propose to use superpixel-based LOMO to describe the still image. (2) The inclusion of background is another factor which may degrade the robustness of representation. To address this problem, person segmentation is employed to extract the masks of persons. And only the superpixels on the masks of the persons are considered for still image representation.

To sum up, we proposed a superpixel-based LOMO (SPLOMO) representation for still images, as shown in Figure 6. Particularly, we first perform superpixel segmentation and person segmentation, using SLIC method [75] and Deep Decompositional Network (DDN) [78] respectively. Note that, the person mask is a binary map, where the semantic information of different parts is not used. Then for each strip, only the superpixels lie in this strip and the person mask are considered to compute the histogram based representation. In our implementation, for a given strip, a superpixel is considered when meeting two conditions: (1) its overlap with the corresponding person mask *O* is bigger than the threshold *TO* = 0.8; (2) its center is in the strip. The final feature is obtained by a max operation as in [8].

**Figure 6.** Illustration of the superpixel based representation.

### *3.4. Temporally Aligned Representation*

Given the selected walking cycle (*t* ∗ *start*, *t* ∗ *end*), the next problem for video-based re-identification is how to represent the 3D spatio–temporal video data and learn to match various person videos. This paper focuses on the former one. We propose a representation of a walking cycle, by temporally aligned pooling the descriptors of all the individual frames (or key frames) in the walking cycle. For the still images, we represent them using SPLOMO, introduced in Section 3.3.

Note that the frame numbers of walk cycles in different video sequences are usually different, which is not convenient to learn a metric. An alternative modality is multi-versus-multi (MvsM), in which there is a group of multiple exemplars for each person in the gallery and group of multiple images of each person in the probe set. However, temporal information of a video sequence is missing in the MvsM method, while it is quite important in video-based re-id. To address this problem, we propose to use temporally aligned pooling method to normalize the representations of all the frames, according to the intrinsic periodicity property. That is, we perform temporally aligned pooling according to the sinusoid corresponding to the walking cycle. Specifically, we equally divide the sinusoid into *M* segments {Φ*m*}*m*=1,...,*M*, and then describe the corresponding phases {Ψ*m*}*m*=1,...,*<sup>M</sup>* to the *M* segments in the walking cycle. We describe the phase *m* as *Fm* by temporally aligned pooling

using the features of the images in Ψ*m*. We finally concatenate {*Fm*}*m*=1,...,*<sup>M</sup>* together to form a final representation, termed as superpixel-based temporally aligned representation (STAR). Three pooling manners are proposed for temporal alignment: average pooling, max pooling, and key frame pooling, as shown in Figure 7. It is worth noting that, *Fm* is the feature of the first frame in the Ψ*<sup>m</sup>* in key frame pooling.

**Figure 7.** Illumination of temporally aligned pooling representation, with *M* = 4. Three pooling manners for temporal alignment are presented: (**a**) average pooling, (**b**) max pooling, and (**c**) key frame pooling.
