Pedestrian Localization in a Video Sequence Using Motion Detection and Active Shape Models

Antonio Velázquez, Juan Alberto; Romero Huertas, Marcelo; Alejo Eleuterio, Roberto; Granda Gutiérrez, Everardo Efrén; Del Razo López, Federico; Rendón Lara, Eréndira

doi:10.3390/app12115371

Open AccessArticle

Pedestrian Localization in a Video Sequence Using Motion Detection and Active Shape Models

by

Juan Alberto Antonio Velázquez

¹

,

Marcelo Romero Huertas

²

,

Roberto Alejo Eleuterio

^3,*

,

Everardo Efrén Granda Gutiérrez

⁴

,

Federico Del Razo López

³ and

Eréndira Rendón Lara

³

¹

Technological Institute of Higher Studies of Jocotitlan, Jocotitlan 50700, Mexico

²

Faculty of Engineering, Autonomous University of the State of Mexico, Toluca 50110, Mexico

³

Division of Postgraduate Studies and Research, National Technological of Mexico, Campus Toluca, Metepec 52149, Mexico

⁴

UAEM University Center at Atlacomulco, Autonomous University of the State of Mexico, Atlacomulco 50450, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(11), 5371; https://doi.org/10.3390/app12115371

Submission received: 25 April 2022 / Revised: 21 May 2022 / Accepted: 23 May 2022 / Published: 26 May 2022

(This article belongs to the Special Issue Emerging Feature Engineering Trends for Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

There is increasing interest in video object detection for many situations, such as industrial processes, surveillance systems, and nature exploration. In this work, we were concerned with the detection of pedestrians in video sequences. The aim was to deal with issues associated with the background, scale, contrast, or resolution of the video frames, which cause inaccurate detection of pedestrians. The proposed method was based on the combination of two techniques: motion detection by background subtraction (MDBS) and active shape models (ASM). The MDBS technique aids in the identification of a moving region of interest in the video sequence, which potentially includes a pedestrian; then, the ASM algorithm actively finds and adjusts the silhouette of the pedestrian. We tested the proposed MDBS + ASM method with video sequences from open repositories, and the results were favorable in scenes where pedestrians were in a well-illuminated environment. The mean fit error was up to 4.5 pixels. In contrast, in scenes where reflections, occlusions, or pronounced movement are present, the identification was slightly affected; the mean fit error was 8.3 pixels in the worst case. The main contribution of this work was exploring the potential of the combination of MDBS and ASM for performance improvements in the contour-based detection of a moving pedestrian walking in a controlled environment. We present a straightforward method based on classical algorithms which have been proven effective for pedestrian detection. In addition, since we were looking for a practical process that could work in real-time applications (for example, closed-circuit television video or surveillance systems), we established our approach with simple techniques.

Keywords:

pedestrian identification; motion detection; background subtraction; active shape model

1. Introduction

Pedestrian detection is an essential field of study for video surveillance or similar applications [1]. It consists of detecting mobile subjects in a 2D digital image or video sequence. Some factors, such as variability in the appearance, clothing, occlusions, lighting levels, background scene, and weather conditions, represent challenges for accurate detection [2]. Furthermore, the detection of pedestrians with video surveillance cameras requires methods to capture their different walking postures, which vary throughout the video sequence [3]. Currently, pedestrian detection has been carried out through different algorithms—for example, deformable techniques such as statistical shape models (SSM) active contour models (ACM), and active appearance models (AAM) [4]. A drawback of these methods is the capturing of the texture, in addition to the variations in the shape [5].

In this work, pedestrian detection is investigated by combining two techniques: motion detection using background subtraction (MDBS), which identifies the region of interest where the possible pedestrian is located, and active shape models (ASM), which is a deformable technique that detects the contour of the pedestrian. MDBS has been successfully tested in the detection of vehicles [6]; it is also characterized by its low computational cost, which makes it appropriate in this context. ASM is a natural continuation of the research reported in reference [7]. Our approach seeks to detect the contour of a person who appears on the scene despite being seen in different backgrounds. The detection process begins with the automatic calculation of the necessary translation parameters obtained with the MDBS technique; later, ASM adjusts the silhouette of the pedestrian.

The main contribution of this work is combining MDBS and ASM to explore the potential improvement provided in detecting the contour of a moving pedestrian walking in a controlled environment. In summary, the MDBS technique finds the movement region through a rectangle that automatically provides the translation parameters; later, these parameters are calculated as coordinates used by one of the stages of the ASM algorithm to find and fit the contour. In this sense, this work shows a different technique concerning the state-of-the-art on how to get the parameters that ASM uses to locate the contour. Moreover, the results obtained in this work are similar to those obtained in references [8,9]; however, MDBS has not been used together with the ASM algorithm, especially in the adaptive form of adjustment in each frame

t + 1

through the updated coordinates provided by MDBS.

2. Related Work

Background subtraction is one of the main methods to address the problem of motion detection [10]. It provides separation between the background and the objects of interest, but it is sensitive to dynamic changes due to lighting and occlusions. There are different techniques to perform motion detection by background subtraction (basic background modeling, statistical background modeling, among others [11]). Khalifa et al. [12] proposed a method that looks for a dynamic background region to reduce false positives in the video obtained by a CCTV (closed-circuit television) camera. This method was evaluated with the CDnet 2012/2014 datasets (CDNET, http://jacarini.dinf.usherbrooke.ca, accessed on 8 January 2020). Average precisions of 86.50% and 76.68% were obtained on the CDnet2012 and CDnet2014 databases, respectively. Camplani and Salgado [13] used a combination of classifiers that improves background subtraction, reducing false detections. Although the image had variability in color, shadows, and lighting, an average precision of 70% was obtained in their work. Ramya and R. Rajeswar [14] presented a new technique that improves the frame difference method by classifying at the pixel level the first frames as background and their planes with a correlation coefficient. They compared their results with different original images. The best precision was 0.9984 in a camouflaged image. Sehairi et al. [15] presented 12 techniques based on motion detection to find the object of interest. They used the CDnet2014 dataset, obtaining the best result with the GMM (Gaussian mixture model) technique with a specificity of 0.99593.

About ASM, a wide variety of literature related to the adjustment of forms in organs and parts of the human body, such as hands or faces, is cited [16,17,18]. Esfadianrkhani and Hossein [19] mention the construction of a point distribution model (PDM) where the silhouette of the human and its anatomical joints can be combined to build the ASM model, and subsequently be able to segment the modeled body as new images. The use of deformable methods in pedestrian detection is limited because ASM is considered to require great time consumption due to the computational load [20]. For this reason, most of the investigations are based on the detection of static people (standing pose and standing facing the camera). Although it should be mentioned how walking people are detected with ASM, this is done with gait recognition techniques [21].

Baumberg and Hogg [22] made a detector that works for poses and views, even with two or three rigid pedestrians. Pedestrian detection is performed by means of an ellipsoidal silhouette trained with 40 characteristic points of the image (landmarks) using the Kalman filter. Koshan et al. [23] actively applied a model to detect pedestrians in a video sequence using a different number of landmarks (10, 14, 21, and 42) and received three lineups of silhouettes of a single person. The contour detection was performed, and the normalized error between the actual ground landmarks and the estimated landmarks was calculated: the mean error in the distance between such points was 2.74 pixels with 21 landmarks, and it was 2.42 pixels with 42 landmarks.

J. Changhyuk and J. Keechul [24] proposed a model based on an exoskeleton to detect human poses. They used background subtraction and a pattern matching algorithm in the fitting process. The accuracy in three poses resulted in 97.32%, 93.3%, and 80.62%. Kim et al. [25] identified the way pedestrians walk. Pedestrians were detected with ASM, using 32 landmarks for the silhouette of a human from images of 720 × 480 pixels, reaching up to 90% accuracy. Lee and Choi [26] detected pedestrians using active models using infrared (IR) images and visible-light cameras that allow pedestrians to be followed in degraded and low-light environments. Ma and Ren [27], proposed a model-based method to recognize pedestrians; however, they did not specify the number of landmarks with which to train the model. Images with a resolution of 320 × 240 pixels were obtained from an omnidirectional RGB (red-green-blue) camera. A 94% in success rate in detecting pedestrians was achieved. Ide [28] proposed a technique to segment pedestrians using a model trained with silhouettes identified with the grab-cut tool. This method compares segmentation results of grab-cut against the results of the model, and error rates of 2.32% and 2.15% with the foreground (FG) and background (BG) models, respectively, were reported.

3. Experimental Set-Up

This section presents the experimental methodology used in this work. The process is summarized in four stages: (a) Input video, from which the video sequences to be studied were selected. (b) Detection of movement. In this step, the proposed model (MDBS) for motion detection by background subtraction was applied. (c) Adjustment of the model (using ASM). Finally, (d) evaluation of the effectiveness of the proposal.

3.1. Experimental Data

In the first stage of the experimental process, two datasets of video sequences commonly used in pedestrian detection were selected: CDnet2014 [29] and CASIA Gait dataset [30,31]. CDnet2014 contains six recording categories: baseline, dynamic background, camera jitter, shadow, intermittent object motion, and thermal. CASIA Gait is a database used to investigate the walking patterns of various pedestrians.

The CDnet2014 and CASIA Gait datasets contain scenes where videotaped subjects appear. They present three critical problems: angle of vision, clothing, and changes in conditions when walking; thus, their use is appropriate for the objectives of this study because they involve situations where there are underlying problems and allow the evaluation of the effectiveness of the proposed model. In this sense, only scenes with situations where pedestrians present problems with background variations were selected.

Three video sequences with pedestrians were chosen from the CDnet2014 database: office, PETS2006, and sofa. The first two video recordings belong to the baseline category, and the third belongs to the intermittent object motion category (see Figure 1). The three selected video sequences (Figure 1), contain changes, mainly lighting and background, were recorded a speed of 0.17 frames per second, and are 360 × 240 pixels in the cases of office scenes and sofa, whereas the PETS2006 scene has a resolution of 720 × 576 pixels. The CASIA Gait dataset contains video sequences where encoded videos are stored (320 × 240 pixels). They have information on 124 pedestrians walking in different directions. For experimentation, four scenes of pedestrians walking in four directions (0

^{\circ}

, 36

^{\circ}

, 54

^{\circ}

, and 90

^{\circ}

) were selected (see Figure 2).

It is essential to notice that our approach converts color frames from the selected video sequences into grayscale images to perform a unique solution to process color frames. Thus, it allows the processing of not only color frames but also frames from closed circuit television video (CCTV) or video surveillance systems.

3.2. Motion Detection by Background Subtraction (MDBS)

In this stage, the challenge is to detect regions of movement. The motion detection with background subtraction technique is proposed to separate the object of interest (foreground) from the background. Algorithm 1 shows the main steps for the detection of motion regions. The detailed sequence to perform the detection of the movement is as follows:

Algorithm 1 The basic sequence of the detection of motion with the MDBS technique.

1:: Convert the first frame I to grayscale and set it as the background F
2:: while remaining frames in the video sequence do
3:: Read next frame I and convert it to grayscale
4:: $B (x, y) \leftarrow I (x, y) - F (x, y)$
5:: Binarize B given the threshold $T h$
6:: Erode B
7:: Eliminate residuals in B with 4 neighbors
8:: Fill holes in B
9:: Calculate regression line $y^{'} = a_{0} + a_{1} x$ ▹ See Equations (3)–(5)
10:: Calculate Principal Components ▹ See Equations (8) and (10)
11:: Delimit the region of movement
12:: end while

Convert the first frame (I) of the video sequence to grayscale and set it as the background frame (F).
Read the next frame in the video sequence $(I)$ and convert it to grayscale.
Subtract the background F from the image with Equation (1) for every pixel $(x, y)$ in the image.

$B (x, y) = I (x, y) - F (x, y)$

(1)
Binarize B, considering a threshold $T h = 61$ . Values greater than $T h$ are normalized to 1, and values less than $T h$ are normalized to 0; consequently, the possible pedestrian contour is segmented. $T h$ is computed by Otsu’s method [32]. This considered one of the best and most stable methods for choosing threshold values automatically [33]. The simplicity of procedure characterizes Otsu’s method, and it is a non-parametric, valuable technique because of its unsupervised nature for threshold selection. It minimizes the intraclass variance of black and white pixels to obtain a optimal threshold ( $T h$ ), following a sequential search and using the simple cumulative quantities as follows:
Consider that $G = [0, L - 1]$ is the range of the grayscale of an image $f (x, y)$ and $P_{i}$ is the probability of i grayscale level. The threshold value t splits the image into two classes $C_{0} = [0, t]$ and $C_{1} [t + 1, L - 1]$ , with probabilities of $α_{0} = \sum_{i = 0}^{t} P_{i}$ and $α_{1} = 1 - α_{0}$ , respectively. The means $(μ)$ of the gray values to $C_{0}$ and $C_{1}$ are $μ_{0} = \sum_{i = 0}^{t} i P_{i} / α_{0} = μ_{t} / α_{0}$ and $μ_{1} = \sum_{i = t + 1}^{L - 1} i P_{i} / α_{1} = (μ - μ_{t}) / (1 - α_{1})$ , respectively, where $μ = \sum_{i = 0}^{L - 1} i P_{i}$ and $μ_{t} = \sum_{i = 0}^{t} i P_{i}$ . Thus, the criterion function is defined in Equation (2) as the variance between two classes ( $C_{0}$ and $C_{1}$ ), where the optimal threshold $T h$ is the t value that maximizes $η$ .

$η^{2} (t) = α_{0} {(μ_{0} - μ)}^{2} + α_{1} {(μ_{1} - μ)}^{2} = α_{0} α_{1} {(μ_{0} - μ_{1})}^{2} .$

(2)

In this case, the calculation of $T h$ was carried out under controlled conditions in environments with artificial light.
Erode B, using a circular structured element E of radius $R = 5$ , which helps to round the contour. R = 5 was obtained through a trial-and-error strategy (a greedy solution). This result is a tread-off between the frame resolution and the person’s scale into the frame.
Eliminate residuals in B applying the 4-neighbor technique, using the set of isolated pixels of the person’s silhouette as the residual.
Fill gaps in B, which contain 0 values inside the outline of the person’s silhouette.
Calculate the equation of the regression line, which is used to delimit the region of interest where the moving object was detected (Equation (3)).

$y^{'} = a_{0} + a_{1} x,$

(3)

where $y^{'}$ is the estimated ordinate of the n points $(x, y)$ , which correspond to the pixels in B whose values are 1; $a_{0}$ and $a_{1}$ are constants that get the values of the normal equations of the least-squares line, which in essence form a system of two equations whose solutions are defined by Equations (4) and (5).

$\sum_{i = 1}^{n} y_{i} = n a_{0} + \sum_{i = 1}^{n} a_{1} x_{i},$

(4)

$\sum_{i = 1}^{n} x_{i} y_{i} = a_{0} \sum_{i = 1}^{n} x_{i} + \sum_{i = 1}^{n} a_{1} x_{i}^{2},$

(5)
Calculate the principal components considering $x$ (of dimension n), which represents the M points with coordinates $(x, y)$ corresponding to the pixels in B, whose value is 1. The mean value $m_{x}$ is given by Equation (6).

$m_{x} = \frac{1}{M} \sum_{i = 1}^{M} x_{i} .$

(6)

Similarly, the covariance matrix $C_{x}$ , of size $n \times n$ , can be approximated by Equation (7).

$C_{x} = (x - m_{x}) {(x - m_{x})}^{T} .$

(7)

The principal component transform (also called Hotelling transform [34]), is obtained by Equation (8).

$y = A (x - m_{x}),$

(8)

where A is the transformation matrix that applies $x$ on $y$ , and the mean of $y$ is $m_{y} = 0$ . Rows in the matrix $A$ are eigenvectors of $C_{x}$ , sorted from largest to smallest eigenvalues. The covariance matrix of $y$ is obtained by $A$ and $C_{x}$ as follows:

$C_{y} = A C_{x} A^{T} .$

(9)

Therefore, the values of $x$ can be obtained through the inverse transformation in Equation (10).

$x = A^{T} (y + m_{x})$

(10)
Finally, to delimit the region of movement (see Figure 3), estimation of the extreme points is performed: $P 1 (x_{1}, y_{1})$ , $P 2 (x_{2}, y_{2})$ , $P 3 (x_{3}, y_{3})$ , and $P 4 (x_{4}, y_{4})$ , of the lines $L_{1}$ and $L_{2}$ , whose direction vectors are the eigenvectors $[{\vec{v}}_{1}, {\vec{v}}_{2}]$ . The orthogonal lines intersect at the midpoint $O ({\bar{x}}_{i}, {\bar{y}}_{i})$ of the pixels in B whose value is 1. Therefore, the region of movement is bounded by the points $E 1 (x_{4}, y_{1})$ , $E 2 (x_{2}, y_{1})$ , $E 3 (x_{2}, y_{3})$ , and $E 4 (x_{4}, y_{3})$ , as shown in Figure 3. The points that form the line $L_{1}$ can also be calculated using the regression line of Equation (3).

3.3. Active Shape Model (ASM)

ASM is a deformable model, which is generated from a statistical model containing variations created from a training set of labeled images. The proposed ASM is detailed in Figure 4.

ASM consists of two main stages: (a) training or generation of the point distribution model (PDM) and (b) an adjustment stage. The PDM is trained on a set of annotated images; a perpendicular gray-level image model or a gradient profile is extracted for each PDM landmark, usually trained with the same set of annotated images called the training set. In each image of the training set, several reference points (landmarks) are annotated (manually) along the contour of the object of interest, as illustrated in Figure 5, where an image with 50 landmarks can be observed.

Afterward, a non-aligned set of images of the pedestrian marked with 50 landmarks in each one of the frames is obtained, as exemplified in Figure 6.

The set of annotated contours is aligned to remove differences in the pose (translation at

X_{t}

and

Y_{t}

, scaling s, and rotation

θ

). Then, principal component analysis is performed to identify the main modes of shape variation in the aligned set of annotated contours. Next, image models are built, obtaining the gray or gradient levels for each point of the PDM by sampling the image profile perpendicular to the contour of the object at each landmark (l) for each of the training images, as is illustrated in Figure 7. The mean image profile

{\bar{g}}_{i}

and the corresponding profile covariance

\sum_{i}

are calculated for each point l of the PDM. In our example, it is necessary to calculate 50 average profiles perpendicularly and thus obtain the covariance matrices of each image [35].

In summary, the ASM-based algorithm consists of the following steps: (1) assignment of reference points or landmarks, (2) application of principal components analysis, (3) adjustment of the PDM, and (4) modeling of the local structure. Figure 8 shows a person with 50 manually selected landmarks on the contour of a training set. The necessary transformations for the alignments are determined in an iterative process.

3.3.1. Landmarks

When a video frame is captured, suitable reference points must be assigned to the object’s outline. The landmarks must be located in the contour of the pedestrian and must be updated from one image to another. In a two-dimensional image, they are represented as landmarks by a two-dimensional vector given by Equation (11). The configuration of this system consists of 50 manually assigned reference points (

L = 50

). The main feature of reference points or landmarks is to control the shape of the model contours. Specifically, the initially assigned landmarks are updated by minimizing the deviation from the original profile, which is normal to the boundary at each landmark.

x = {[x_{1}, \dots, x_{L}, y_{1}, \dots, y_{L}]}^{T}

(11)

3.3.2. Principal Component Analysis (PCA)

A set of L reference points (landmarks) represents the shape of the object. They are used to train the ASM model, which is the human contour detection method. Figure 8 shows a set of M misaligned shapes of the same image which belongs to the training set. Although each shape m in the training set is in two-dimensional space, the shape can be modeled with a reduced number of parameters using the PCA technique. Suppose we have M shapes in the training set that are represented by

X_{m}

, for

m = 1, \dots, M

. The PCA algorithm works as follows:

Calculates the mean of the M sample shapes in the training set (Equation (12)).

$\bar{x} = \frac{1}{M} \sum_{m = 1}^{M} x_{m}$

(12)
Computes the covariance matrix $(S)$ of the training set (Equation (13)).

$S = \frac{1}{M} \sum_{i = 1}^{M} (x_{m} - \bar{x}) {(x_{m} - \bar{x})}^{T}$

(13)
Build the matrix of eigenvectors (Equation (14)).

$Φ = \frac{1}{M} \sum_{m = 1}^{M} [ϕ_{1} ∣ ϕ_{2} ∣ \dots ∣ ϕ_{q}]$

(14)

where $Φ_{j}$ , $j = 1, \dots, q$ represents the eigenvectors of S corresponding to the largest q eigenvalues.
Given $Φ$ and $\bar{x}$ , each form can be updated as described by Equation (15).

$x_{m} \approx \bar{x} + Φ b_{m}$

(15)

where:

$b_{m} = Φ^{T} (x_{m} - \bar{x})$

(16)
Calculate q such that the sum of the largest eigenvalues is greater than 98% of the sum of all eigenvalues.

To generate fair shapes, the distribution of b is evaluated. To constrain acceptable values, hard conditions can be applied to each element

b_{m}

or a constraint to b, which can be reflected in a hyper-ellipsoid.

The PCA algorithm reduces the amount of data by simplifying the initial image to a set of eigenvalues and eigenvector while preserving enough information to perform the training stage. The matrix of eigenvectors

Φ

(Equation (14)) contains the principal components of the training subset, which are used to update the model parameters during the fitting stage explained in the following section.

3.3.3. Fitting the Model

To perform the adjustment of the model, the best shape and pose parameters to match a shape in the model coordinate frame, x and y, with a new shape in the image coordinate frame are located, while minimizing the error function as shown in Equation (17).

E = {(y - T x)}^{T} W (y - T x)

(17)

where W is a diagonal matrix whose elements are weight factors for each landmark, and T represents the geometric transformation of rotation, translation, and scaling. The weight factors are set relative to the offset between the computed positions of the old and new landmarks in the profile. If the offset is large, then the corresponding weight factor in the array is set to low, and if the offset is small, the weight is set to high. Given a single point denoted by

{[x_{0}, y_{0}]}^{T}

, the geometric transformation matrix is defined by Equation (18).

M = [\begin{matrix} x_{0} \\ y_{0} \end{matrix}] = s [\begin{matrix} c o s (θ) & s i n (θ) \\ - s i n (θ) & c o s (θ) \end{matrix}] [\begin{matrix} x_{0} \\ y_{0} \end{matrix}] + [\begin{matrix} x_{t} \\ y_{t} \end{matrix}]

(18)

After computing the set of pose parameters (

θ

, rotation; t, translation; and s, scaling), the projection of y onto the model’s coordinate frame is obtained from Equation (19).

x_{m} = M^{- 1} y

(19)

Finally, the model parameters are updated as depicted in Equation (20).

b_{m} = Φ^{T} (x_{m} - \bar{x})

(20)

The optimal displacement of a landmark is obtained from the search procedure through the detected contours. The combination of optimally updated landmarks generates a new shape in the image coordinate frame. This new form is now used to find the closest form using Equation (17). After computing the best pose, denoted by M, this new shape is projected onto

Φ

, which contains the principal components of the given training set. This process updates the

b

parameter of the model. As a result, only a similar variation corresponding to the principal components can affect the model parameters. After computing the model parameters, the new form, denoted by

x_{m}

, can be generated by Equation (15), and this new form is used for subsequent iterations, as in Equation (17). After a suitable number of iterations, the final form obtained is

x_{n e w}

.

In this work, experimental parameters M = 50, L = 50, and the segment length of 40 pixels were set according to state of the art and the frame resolution [9,36,37]; additionally, they are based on the scale of the person’s contour. A programming function could be used to analyze the relationship of those parameters and estimate an optimal value; nevertheless, we considered that out of this study’s scope, which was focused on the analysis of MDBS and ASM methods for pedestrian localization in a video sequence.

4. Experiments

This section shows the results of the moving pedestrian detection process with the combination of MDBS and ASM (MDBS + ASM). MDBS identifies motion regions through the background subtraction motion detection method (Section 3.2); then, ASM (Section 3.3) adjusts the silhouette of the pedestrian.

All video sequences have different numbers of frames, but we standardized them to 50 frames of each video, only using those that contained moving pedestrians in the scene. For example, the office video scene has 2700 frames, but only frames from 620 to 700 (50 in total) were chosen because those are the frames is which the pedestrian appears to be walking. Then, the PDM (see Section 3.3) used 50 landmarks in the pedestrian contour, considering some anthropometric points of the body silhouette. In Figure 5, an example of the initial marking (which is done manually) is shown. Subsequently, the training set was aligned, originating the average form seen in Section 3.3. In capturing gray profiles in the CDnet2014 and CASIA Gait scenes, a 40 pixel (20 pixels up and 20 pixels down) line perpendicular to each landmark was drawn (see Figure 7).

4.1. Qualitative Results

The results obtained with the CDnet2014 dataset are shown in Figure 9 and Figure 10. They exhibit the original marking (GT, ground truth, in green) used for training and the estimated contour (red points calculated by our method). These points mark the silhouettes of the pedestrians detected with the ASM algorithm. As can be seen in the scene exhibited in Figure 9, the algorithm fits the contour of the pedestrian very closely in the office scene because the lighting, and especially the background, allow precise detection. However, there are many problems in the PETS2006 scene (Figure 10a): the reduced scale of the pedestrian, occlusion from the train, and reflections from the floor (background problems); these problems cause ASM not to fit correctly. In the sofa scene (Figure 10b), there are also occlusion problems, especially the reflections from the floor, which create contrast problems, making it more difficult for ASM to perform a close fit. Despite the mentioned troubles, the pedestrian’s silhouette was correctly identified in all cases.

The results from the CASIA Gait dataset, where the pedestrian is walking in different orientations with respect to the camera, are shown in Figure 11. The pedestrian is shown walking at 0 degrees, i.e., directly to the camera, in Figure 11a. The red points (approximate result) are very close to the landmarks (green points), which implies a good performance of the combination of MDBS + ASM. These were the best results of all orientations because there was not as much variation in the pedestrian’s arms and legs. However, when the pedestrian is walking at 36 degrees (Figure 11b), the red points in the arms (approximate contour) are moderately far from the true landmarks (green dots). In this scene, the pedestrian has variations in the movement of his arms, the opening of his legs, and his angle of walking (movement factors), which means that the ASM adjustment does not match the reference landmarks. Next, Figure 11c exhibits the adjustment results of a pedestrian walking at 54 degrees. The approximate points tend to be slightly off the landmarks in the legs, but the contour is more precise where the arms are closer to the body. Finally, the results of the adjustment of a pedestrian walking at 90 degrees are presented in Figure 11d, where the pedestrian walks side-on to the camera. The fit results shown by the (approximate) red dots tend to be far from the landmarks on the legs. The adjustment by ASM in this scene beings the dots close to the contour of the pedestrian at all points except for the gap between the legs due to the movement.

Good approximation of pedestrian detection is shown in Figure 9, Figure 10 and Figure 11 due to the combination of MDBS and ASM from a qualitative point of view. The contour adjustment of a moving pedestrian is better when background problems (occlusion or illumination) are not evident and the factor of movement associated with the pedestrian is not too pronounced (Figure 9). In contrast, in situations with background problems (Figure 10) or too many movement factors (Figure 11), the detection with MDBS and the adjustment to the contour with ASM showed limited (but acceptable) performance; i.e., the closeness between the real landmarks and those obtained by the approximation is not as good as expected, but the pedestrian was correctly identified, and the contour is well defined.

4.2. Quantitative Results

A quantitative analysis of the results obtained by the combination of MDBS + ASM was performed to deepen the study of this proposal. Leave-one-out cross-validation [38] was selected to verify the accuracy of the fitness achieved by the model. The ASM model was trained with

M - 1

frames, and the values obtained (or approximated) were evaluated with respect to the frame that was left out of the training. This process was repeated for M frames. The leave-one-out cross-validation is very useful to strengthen the validity of the experimental results when the available dataset is relatively small [39]—in this study,

M = 50

. The fit error for each frame m was obtained by Equation (21), which uses the Euclidean distances between the landmarks

(x_{l}, y_{l})

and the points approximated by ASM

(x_{l}^{'}, y_{l}^{'})

for the total of landmarks L.

e_{m} = \frac{1}{L} \sum_{l = 1}^{L} \sqrt{{(x_{l} - x_{l}^{'})}^{2} + {(y_{l} - y_{l}^{'})}^{2}}

(21)

Despite its simplicity, the Euclidian distance was used in this work because it has proved its effectiveness as part of the ASM procedure. Even recent state-of-art research continues applying the Euclidian distance for this reason (references [16,40]). Moreover, we consider using this parameter appropriate for the aim of this work because it calculates the distances between the landmarks and the approximate points obtained by the ASM during the contour detection.

Then, the mean fit error of the ASM is given by Equation (22), where

M = 50

and corresponds to the number of cross-validation iterations (number of frames) in the training set.

\bar{e} = \frac{1}{M} \sum_{m = 1}^{M} e_{m}

(22)

In this paper, we use the box-plot data analysis method to explain the behavior of experimental results obtained by the MDBS + ASM approach. This tool (also called the box and whisker plot method) easily allows the comparison of different datasets. It uses the median, the approximate quartiles (Q

_{f}

, where

f = {1, 2, 3}

), and the lowest (Min) and highest (Max) data points to convey the level, spread, and symmetry of a distribution of data values. It can also be easily refined to identify outlier values [41]. Figure 12 and Figure 13, visually summarize and compare groups of data corresponding to the studied scenes. The data correspond to the accuracy of the fitness achieved by the model, and they are represented in terms of the mean fit error of pixels (Equation (22)). Details of data shown in Figure 12 and Figure 13 are presented in Table 1, where important information for each scene, such as maximum and minimum values, ranges, IQR (interquartile range), and the average of pixels by quartiles is presented.

The box plot in Figure 12 shows the mean fit error for the CDnet2014 dataset. Each frame in this dataset has a resolution of 360 × 240 pixels for the office and sofa scenes and 720 × 576 pixels for the PETS2006 scene. The number of evaluated frames was set to

M = 50

, the number of landmarks was

L = 50

, and the size of the landmarks in the gray-scale profiles was 40 pixels. The analysis was done twice. The office scene resulted in a mean fit error of 7.2 pixels, which is a satisfactory result considering the level of occlusion in the lower extremities of the pedestrian in the scene. In the PETS2006 scene, a mean fit error of 8.3 pixels was obtained, despite the challenges of floor reflections, shadows, and occlusion. Similarly, in the sofa scene, a mean fit error of 7.1 pixels was achieved. The best performance of the MDBS + ASM proposal was observed for the sofa and office scenes and was slightly worse for PETS2006, because of the reduced size of the captured pedestrian, but also due to the background problems and occlusion with the train in the scene. In all cases, the error is small in comparison to the size of the original frames.

From a general point of view, Figure 12 shows that the proposed approach had a better performance in the Office scene than in PETS2006 or Sofa scenes (in this order, respectively). This is explained by the Office scene having less movement and low illumination variations (only when the pedestrian enters the office and is near a piece of furniture or when the pedestrian is reading a book). In the PETS2006 scene, the MDBS + ASM method presents the most significant errors when the camera is far from the objective (the pedestrian is at the extreme of the stage); also, the camera makes some adjustments due to the brightness of the floor or the occlusion that exists with background objects (such as a train that passes behind the pedestrian). Similarly, in the Sofa scene, the natural and artificial lighting affects the brightness of the floor, and it causes occlusion when the pedestrian passes near the sofa. Furthermore, opening and closing their feet and hands (movement variations) contributes to worsening the algorithm’s performance (compared to the other scenes).

For the evaluation of scenes in the CASIA Gait dataset, where pedestrians walk at different orientations relative to the camera, the results are shown in Figure 13. In this collection of video scenes, all frames have a resolution of 320 × 240 pixels. The number of evaluated frames was set to

M = 50

,

L = 50

landmarks were used, the size of each one of the landmarks was 40 pixels, and the analysis was done twice, as it was for the CDnet2014. In the scene where the pedestrian walk at zero degrees, a mean fit error of 4.5 pixels was calculated (the overall best result in our experiments). Tegarding the pedestrian walking at 36 degrees, the mean fit error was 6.4 pixels; in the case of the scene at 54 degrees, the error was 6.1 pixels; lastly, in the scene with a pedestrian walking at 90 degrees, the error was 5.6 pixels. Higher error was noted when the pedestrian opened his legs, or variations in the movement of the arms were detected. However, the excess or lack of illumination or occlusions (background problems) also affect the performance of the MDBS + ASM technique because they avoid correctly detecting the pedestrian silhouette.

The best performance observed in Figure 13 was found for scenes where the position of the camera showed less movement variation. In the scene of the pedestrian walking at zero degrees, the shot is from the front, and the movement of hands and feet is not perceptible by the camera, so there was reduced error at the time of detection. However, in the scene with a pedestrian walking at 36 degrees, the camera angle is a factor when it focuses on the pedestrian moving his arms and legs when walking. In the scene of a pedestrian walking at 54 degrees, like the previous one, the camera’s position influences the detection, although there is less variation in the movement of the arms and legs. Finally, in the scene of the pedestrian walking at 90 degrees, it is observed walking sideways and walking from left to right, and some errors occurred at the time of detection between the legs. Moreover, the pedestrian’s clothing also affects the detection process, especially when pants are occluded by the floor squares (which can be almost the same as the pants, especially in the grayscale). To summarize, the complexity intrinsic to changes in the camera position causes varied performance of the MDBS + ASM model. For the above reasons, we consider it necessary to deeply study the method to improve its performance in complex scenarios.

To analyze if the mean fit error evaluated in the selected datasets has a normal distribution, the Jarque–Bera test was performed. It is a goodness-of-fit test of departure from normality based on the sample skewness and kurtosis. If a data distribution has a normal distribution, then the skewness (W) will be close to zero, and the kurtosis will be

K = 3

, ideally. From a collection of N observations, the statistic test

J B

is defined as follows [42]:

J B = \frac{N}{6} (W^{2} + \frac{{(K - 3)}^{2}}{4})

(23)

In our experiments, the number of observations (groups) could be assumed as the total of frames used for each scene, being

N = 50

. Then, the total of landmarks by frame

L = 50

led to 2500 data points per scene, analyzed to estimate the

J B

significance value. The null hypothesis

H_{0} : W = 0

and

K = 3

is rejected if high values of

J B

are obtained; i.e., the observed data do not follow a normal distribution. Moreover, the probability (p-value) of the null hypothesis assumes the fact that

J B

is asymptotically distributed as a chi-Square function with two degrees of freedom [43], and the p-value could be estimated from classical probability tables [44] or standard spreadsheet software. Considering the right tail of the chi-square distribution, the null hypothesis can be rejected if the p-value < 0.05 [42]. In Table 2, the summary of the results of the Jarque–Bera normality test is exhibited.

From the normality test, it can be assumed that the fit error in all datasets does not follow a normal distribution, which in this case implies that each landmark in each frame has the same probability of error, and there was not a trend or bias in the experiments. This can be visually seen in the graphs in Figure 14, which have the mean fit error by frame for the Office, PETS2006, and Sofa. Similarly, Figure 15 exhibits the behavior of the mean error by frame for the CAISA Gait collection for each pedestrian walking orientation. In all cases, the error varies from one frame to another, in the intervals specified in Table 1. The variations in the error are explained because, in each frame, different conditions could be encountered: lighting, occlusions, pose, and so on. However, the error size is always relatively low (in the range of 2.8 to 11 pixels) compared to the size of the frame (hundreds of pixels).

5. Conclusions

The performance of a combination of MDBS with ASM for detecting moving pedestrians was studied in this work. In previous works, ASM has been reported as an appropriate technique for adjusting people’s silhouettes in static poses [9]. For this reason, it is relevant to explore its possibilities in the detection of people in motion. Our approach uses MDBS to locate pedestrians in motion; then, the silhouette of the located pedestrian is fitted with ASM. Background subtraction allows the detection of a moving region in a video sequence; then, image transformations, namely, translation, rotation, scaling, and especially translation coordinates, are necessary to adjust the silhouette of the pedestrian via the active shape model.

The primary contribution of this work was exploring the improvement potential of the combination of MDBS and ASM in detecting the contour of a moving pedestrian walking in a controlled environment. MDBS + ASM is a straightforward method based on classical algorithms which have shown effectiveness in pedestrian detection. We are looking for a practical process that could work in real applications, which is why we established our approach based on simple techniques.

Experimental results showed that the combination of MDBS and ASM achieves good performance in detecting moving pedestrians; however, it is slightly affected in situations where evident background problems exist (occlusion or lighting) or when pronounced movements of either arms or legs are present. The future of this work is to search for new mechanisms to improve the capacity of the MDBS + ASM proposal for the detection of moving pedestrians under challenging scenarios, such as those with background problems or movement factors, leading to its application in diverse environments.

Author Contributions

Conceptualization, J.A.A.V.; formal analysis, R.A.E.; investigation, J.A.A.V.; methodology, R.A.E. and E.R.L.; project administration, M.R.H.; software, F.D.R.L.; supervision, R.A.E. and E.R.L.; writing—review and editing, E.E.G.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jordao, A.; Schwartz, W.R. The Good, The Fast and The Better Pedestrian Detector. Master’s Thesis, Universidade Federal de Minas Gerais-Departamento de Ciência da Computação, Minas Gerais, Brazil, 2016. Volume 1. pp. 1–51. [Google Scholar]
Angonese, A.T.; Rosa, P.F.F. Multiple people detection and identification system integrated with a dynamic simultaneous localization and mapping system for an autonomous mobile robotic platform. In Proceedings of the 2017 International Conference on Military Technologies (ICMT), Brno, Czech Republic, 31 May–2 June 2017; pp. 779–786. [Google Scholar]
Boillot, F.; Vinant, P.; Pierrelée, J.C. On-field experiment of the traffic-responsive co-ordinated control strategy CRONOS-2 for under-and over-saturated traffic. Transp. Res. Part A Policy Pract. 2019, 124, 189–202. [Google Scholar] [CrossRef]
Mesejo, P.; Ibáñez, O.; Cordón, O.; Cagnoni, S. A survey on image segmentation using metaheuristic-based deformable models: State of the art and critical analysis. Appl. Soft Comput. 2016, 44, 1–29. [Google Scholar] [CrossRef] [Green Version]
Li, Y.J.; Luo, Z.; Weng, X.; Kitani, K.M. Learning shape representations for clothing variations in person re-identification. arXiv 2020, arXiv:2003.07340. [Google Scholar]
Nine, J.; Anapunje, A.K. Dataset Evaluation for Multi Vehicle Detection using Vision Based Techniques. Embed. Selforganising Syst. 2021, 8, 8–14. [Google Scholar] [CrossRef]
Antonio, J.A.; Romero, M. Detección de peatones con variaciones de forma al caminar con Modelos de Forma Activa. CIENCIA Ergo-Sum 2020, 27, 426–440. [Google Scholar] [CrossRef]
Rouai-Abidi, B.; Kang, S.; Abidi, M. A Fully Automated Active Shape Model for Segmentation and Tracking of Unknown Objects in a Cluttered Environment. In Advances in Image and Video Segmentation; IGI Global: Hershey, PA, USA, 2006; pp. 161–187. [Google Scholar]
Vasconcelos, M.J.M.; Tavares, J.M.R. Human motion segmentation using active shape models. In Computational and Experimental Biomedical Sciences: Methods and Applications; Springer: Berlin/Heidelberg, Germany, 2015; pp. 237–246. [Google Scholar]
Babu, P.; Parthasarathy, E. FPGA implementation of multi-dimensional Kalman filter for object tracking and motion detection. Eng. Sci. Technol. Int. J. 2022, 33, 101084. [Google Scholar] [CrossRef]
Xu, Z.; Min, B.; Cheung, R.C. A robust background initialization algorithm with superpixel motion detection. Signal Process. Image Commun. 2019, 71, 1–12. [Google Scholar] [CrossRef] [Green Version]
Lee, S.H.; Lee, G.C.; Yoo, J.; Kwon, S. Wisenetmd: Motion detection using dynamic background region analysis. Symmetry 2019, 11, 621. [Google Scholar] [CrossRef] [Green Version]
Camplani, M.; Salgado, L. Background foreground segmentation with RGB-D Kinect data: An efficient combination of classifiers. J. Vis. Commun. Image Represent. 2014, 25, 122–136. [Google Scholar] [CrossRef] [Green Version]
Ramya, P.; Rajeswari, R. A modified frame difference method using correlation coefficient for background subtraction. Procedia Comput. Sci. 2016, 93, 478–485. [Google Scholar] [CrossRef] [Green Version]
Sehairi, K.; Fatima, C.; Meunier, J. A Benchmark of Motion Detection Algorithms for Static Camera: Application on CDnet 2012 Dataset. In Proceedings of the International Conference on Computer Science and its Applications, Melbourne, Australia, 2–5 July 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 235–245. [Google Scholar]
Nguyen, D.H.; Nguyen, D.M.; Truong, M.T.; Nguyen, T.; Tran, K.T.; Triet, N.A.; Bao, P.T.; Nguyen, B.T. ASMCNN: An Efficient Brain Extraction Using Active Shape Model and Convolutional Neural Networks. Inf. Sci. 2022, 591, 25–48. [Google Scholar] [CrossRef]
Bi, H.; Jiang, Y.; Tang, H.; Yang, G.; Shu, H.; Dillenseger, J.L. Fast and accurate segmentation method of active shape model with Rayleigh mixture model clustering for prostate ultrasound images. Comput. Methods Programs Biomed. 2020, 184, 105097. [Google Scholar] [CrossRef] [Green Version]
Montúfar, J.; Romero, M.; Scougall-Vilchis, R.J. Automatic 3-dimensional cephalometric landmarking based on active shape models in related projections. Am. J. Orthod. Dentofac. Orthop. 2018, 153, 449–458. [Google Scholar] [CrossRef] [Green Version]
Esfandiarkhani, M.; Foruzan, A.H. A generalized active shape model for segmentation of liver in low-contrast CT volumes. Comput. Biol. Med. 2017, 82, 59–70. [Google Scholar] [CrossRef] [PubMed]
El-Rewaidy, H.; Fahmy, A.S.; Khalifa, A.M.; Ibrahim, E.S.H. Multiple two-dimensional active shape model framework for right ventricular segmentation. Magn. Reson. Imaging 2022, 85, 177–185. [Google Scholar] [CrossRef]
Choudhury, S.D.; Tjahjadi, T. Robust view-invariant multiscale gait recognition. Pattern Recognit. 2015, 48, 798–811. [Google Scholar] [CrossRef] [Green Version]
Baumberg, A.; Hogg, D. An efficient method for contour tracking using active shape models. In Proceedings of the 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects, Austin, TX, YSA, 11–12 November 1994; pp. 194–199. [Google Scholar]
Koschan, A.; Kang, S.; Paik, J.; Abidi, B.; Abidi, M. Color active shape models for tracking non-rigid objects. Pattern Recognit. Lett. 2003, 24, 1751–1765. [Google Scholar] [CrossRef]
Jang, C.; Jung, K. Human pose estimation using Active Shape Models. Proc. World Acad. Sci. Eng. Technol. 2008, 46, 7. [Google Scholar]
Kim, D.; Lee, S.; Paik, J. Active shape model-based gait recognition using infrared images. In Proceedings of the International Conference on Signal Processing, Image Processing, and Pattern Recognition, Macau, China, 8–10 December 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 275–281. [Google Scholar]
Sadoghi Yazdi, H.; Fariman, H.J.; Roohi, J. Gait recognition based on invariant leg classification using a neuro-fuzzy algorithm as the fusion method. Int. Sch. Res. Not. 2012, 2012, 289721. [Google Scholar] [CrossRef] [Green Version]
Ma, J.; Ren, F. Detect and track the dynamic deformation human body with the active shape model modified by motion vectors. In Proceedings of the 2011 IEEE International Conference on Cloud Computing and Intelligence Systems, Beijing, China, 15–17 September 2011; pp. 587–591. [Google Scholar]
Pourjam, E.; Deguchi, D.; Ide, I.; Murase, H. Statistical shape feedback for human subject segmentation. IEEJ Trans. Electron. Inf. Syst. 2015, 135, 1000–1008. [Google Scholar] [CrossRef] [Green Version]
Vijayan, M.; Raguraman, P.; Mohan, R. A Fully Residual Convolutional Neural Network for Background Subtraction. Pattern Recognit. Lett. 2021, 146, 63–69. [Google Scholar] [CrossRef]
Han, F.; Li, X.; Zhao, J.; Shen, F. A Unified Perspective of Classification-Based Loss and Distance-Based Loss for Cross-View Gait Recognition. Pattern Recognit. 2022, 125, 108519. [Google Scholar] [CrossRef]
Gul, S.; Malik, M.I.; Khan, G.M.; Shafait, F. Multi-view gait recognition system using spatio-temporal features and deep learning. Expert Syst. Appl. 2021, 179, 115057. [Google Scholar] [CrossRef]
Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef] [Green Version]
Lin, Y.; Diao, Y.; Du, Y.; Zhang, J.; Li, L.; Liu, P. Automatic cell counting for phase-contrast microscopic images based on a combination of Otsu and watershed segmentation method. Microsc. Res. Tech. 2022, 85, 169–180. [Google Scholar] [CrossRef]
Gonzalez, R.C.; Woods, R.E. Digital Image Processing; Prentice Hall: Upper Saddle River, NJ, USA, 2008. [Google Scholar]
Inthiyaz, S.; Madhav, B.; Kishore, P. Flower image segmentation with PCA fused colored covariance and gabor texture features based level sets. Ain Shams Eng. J. 2018, 9, 3277–3291. [Google Scholar] [CrossRef]
Van Ginneken, B.; Frangi, A.; Staal, J.; ter Haar Romeny, B.; Viergever, M. Active shape model segmentation with optimal features. IEEE Trans. Med. Imaging 2002, 21, 924–933. [Google Scholar] [CrossRef]
Zhou, X.; Leonardos, S.; Hu, X.; Daniilidis, K. 3D Shape Estimation from 2D Landmarks: A Convex Relaxation Approach. arXiv 2014, arXiv:1411.2942. [Google Scholar]
Li, X.; Tripe, D.; Malone, C.; Smith, D. Measuring systemic risk contribution: The leave-one-out z-score method. Financ. Res. Lett. 2020, 36, 101316. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Naveen, H.M.; Naveena, C.; Aradhya, V.N.M. Segmentation of Lung Region: Hybrid Approach. In Book of the ICT with Intelligent Applications; Senjyu, T., Mahalle, P.N., Perumal, T., Joshi, A., Eds.; Springer: Singapore, 2022; pp. 383–391. [Google Scholar]
Williamson, D.F.; Parker, R.A.; Kendrick, J.S. The box plot: A simple visual method to interpret data. Ann. Intern. Med. 1989, 110, 916–921. [Google Scholar] [CrossRef]
Jarque, C.M. Jarque-Bera Test. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 701–702. [Google Scholar] [CrossRef]
Huzak, M. Chi-Square Distribution. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 245–246. [Google Scholar] [CrossRef]
Beyer, W. Handbook of Tables for Probability and Statistics; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar] [CrossRef]

Figure 1. Scenes selected from the CDnet2014 database: (a) office, (b) PETS2006, and (c) sofa.

Figure 2. Scenes selected from the CASIA Gait dataset: (a) pedestrian walking at 0 degrees, (b) at 36 degrees, (c) at 54 degrees, and (d) at 90 degrees.

Figure 3. Movement region (green rectangle) bounded by the extreme points:

E 1 (x_{4}, y_{1})

,

E 2 (x_{2}, y_{1})

,

E 3 (x_{2}, y_{3})

, and

E 4 (x_{4}, y_{3})

, of the lines

L_{1}

and

L_{2}

, whose direction vectors are the eigenvectors

[{\vec{v}}_{1}, {\vec{v}}_{2}]

.

Figure 3. Movement region (green rectangle) bounded by the extreme points:

E 1 (x_{4}, y_{1})

,

E 2 (x_{2}, y_{1})

,

E 3 (x_{2}, y_{3})

, and

E 4 (x_{4}, y_{3})

, of the lines

L_{1}

and

L_{2}

, whose direction vectors are the eigenvectors

[{\vec{v}}_{1}, {\vec{v}}_{2}]

.

Figure 4. The tweaking of the model was carried out in two phases: training and adjustment.

Figure 5. Example of marking with 50 characteristic points (landmarks) in the PDM stage of the office scene (frame 23) of CDnet 2014.

Figure 6. Representation of the misaligned shapes of the training set with 50 points marked.

Figure 7. Representation of the misaligned shapes of the training set.

Figure 8. Representation of the aligned shapes of the training set.

Figure 9. Detection of a pedestrian in selected frames of the CDnet2014 office scene. “GT” represents the manually marked landmarks, and “Approx” shows the estimated landmarks. (a) Frame 630; (b) Frame 634.

Figure 10. Detection of a pedestrian in selected scenes of the CDnet2014 collection. (a) PETS2006 scene, frame 1128. (b) Sofa scene, frame 108.

Figure 11. Detection of a pedestrian walking at different orientations with respect to the camera in the CASIA Gait dataset. (a) Pedestrian walking at 0 degrees. (b) Pedestrian walking at 36 degrees. (c) Pedestrian walking at 54 degrees. (d) Pedestrian walking at 90 degrees.

Figure 12. Mean fit error between the landmarks and the estimated contour in the office, PETS2006, and sofa scenes of CDnet2014 collection.

Figure 13. Mean fit error between landmarks and the estimated contour for the CASIA Gait collection when the pedestrian walks at 0, 36, 54, or 90 degrees, relative to the camera.

Figure 14. Mean fit error (in pixels) for each frame in the (a) Office, (b) PETS2006, and (c) Sofa scenes.

Figure 15. Mean fit error for each frame in the CAISA Gait collection for pedestrians walking at (a) 0 degrees, (b) 36 degrees, (c) 54 degrees, and (d) 90 degrees.

Table 1. Summary of the main characteristics of data shown in the box-plot diagrams of Figure 12 and Figure 13.

Scene	Min	Max	Median	Range	IQR	Q1	Q2	Q3
Office	3.0	6.5	4.5	3.5	1.5	3.8	4.5	5.3
PETS 2006	3.7	10	5.9	6.2	3.2	4.8	5.9	8.0
Sofa	2.8	8.9	6.1	6.1	3.4	3.9	6.1	7.3
Pedestrian 0 degrees	3.0	6.5	4.5	3.5	1.4	3.8	4.5	5.2
Pedestrian 36 degrees	3.7	10.1	5.9	6.4	3.4	4.6	5.9	8.0
Pedestrian 54 degrees	3.0	9.9	6.2	6.9	3.3	3.9	6.2	7.2
Pedestrian 90 degrees	2.9	11	5.1	8.0	3.9	3.3	5.1	7.2

Table 2. Results of the Jarque–Bera test for the selected datasets.

Scene	Mean Error	Skewness	Kurtosis	JB	p-Value
Office	4.5	1.220	2.375	13.212	0.00135
PETS 2006	8.3	0.769	0.928	13.868	0.00097
Sofa	7.2	0.916	0.498	20.031	0.00004
Pedestrian 0 degrees	4.5	1.268	2.492	13.940	0.00094
Pedestrian 36 degrees	6.4	1.260	3.200	13.486	0.00118
Pedestrian 54 degrees	6.2	1.390	2.680	16.39	0.00028
Pedestrian 90 degrees	5.6	1.090	1.170	17.03	0.00020

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Antonio Velázquez, J.A.; Romero Huertas, M.; Alejo Eleuterio, R.; Granda Gutiérrez, E.E.; Del Razo López, F.; Rendón Lara, E. Pedestrian Localization in a Video Sequence Using Motion Detection and Active Shape Models. Appl. Sci. 2022, 12, 5371. https://doi.org/10.3390/app12115371

AMA Style

Antonio Velázquez JA, Romero Huertas M, Alejo Eleuterio R, Granda Gutiérrez EE, Del Razo López F, Rendón Lara E. Pedestrian Localization in a Video Sequence Using Motion Detection and Active Shape Models. Applied Sciences. 2022; 12(11):5371. https://doi.org/10.3390/app12115371

Chicago/Turabian Style

Antonio Velázquez, Juan Alberto, Marcelo Romero Huertas, Roberto Alejo Eleuterio, Everardo Efrén Granda Gutiérrez, Federico Del Razo López, and Eréndira Rendón Lara. 2022. "Pedestrian Localization in a Video Sequence Using Motion Detection and Active Shape Models" Applied Sciences 12, no. 11: 5371. https://doi.org/10.3390/app12115371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pedestrian Localization in a Video Sequence Using Motion Detection and Active Shape Models

Abstract

1. Introduction

2. Related Work

3. Experimental Set-Up

3.1. Experimental Data

3.2. Motion Detection by Background Subtraction (MDBS)

3.3. Active Shape Model (ASM)

3.3.1. Landmarks

3.3.2. Principal Component Analysis (PCA)

3.3.3. Fitting the Model

4. Experiments

4.1. Qualitative Results

4.2. Quantitative Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI