*3.1. Pre-Processing*

During image pre-processing, color frames were extracted from a static video camera *E* = [*f* 1,*f* 2,*f* 3, ... , *fZ*], where *Z* is the total number of frames. These color images were then passed through a Laplacian filter to reduce the noise and sharpen the edges. A Laplacian filter was applied using Equation (1):

$$
\nabla^2 f = \frac{\partial^2 f}{\partial^2 x} + \frac{\partial^2 f}{\partial^2 y} \tag{1}
$$

where ∇<sup>2</sup> *f* is the 2nd order derivative for obtaining the filtered mask. However, a pure Laplacian filter did not produce an enhanced image, thus, to achieve the sharpened enhanced image, we subtracted the Laplacian outcome from the original image using Equation (2):

$$g(\mathbf{x}, y) = f(\mathbf{x}, y) - \left[\nabla^2 f\right] \tag{2}$$

where the *g*(*<sup>x</sup>*, *y*) is the sharpened image and *f*(*<sup>x</sup>*, *y*) is the input image. After obtaining the sharpened image *g*(*<sup>x</sup>*, *y*), histogram equalization was performed on the sharpened image in order to adjust the contrast of an image using Equation (3):

$$s\_k = T(r\_k) = (L-1) \sum\_{j=0}^k p\_r(r\_j) \quad k = 0, 1, 2, \dots, \ L-1 \tag{3}$$

where variable *r* denotes the intensities of an input image to be processed. As usual, we assumed that *r* is in the range [0 *L* − 1], with *r* = 0 representing black and *r* = *L* − 1 representing white, while *s* represents the output intensity level after intensity mapping for every pixel in the input image, having intensity *r.* However, *pr*(*r*) is the probability density function (PDF) of *r*, where the subscript on *p* were used to indicate that it was a PDF of *r*. Thus, a processed (output) image was achieved using Equation (3) by mapping each pixel in the input image with intensity *rk* into a corresponding pixel with level *sk* in the output image, as shown in Figure 2.

**Figure 2.** Preprocessing steps. (**a**) Original color frame of a video, (**b**) histogram of original image, (**c**) histogram of enhanced image, and (**d**) enhanced image.

#### *3.2. Human Silhouettes Extraction*

After obtaining the preprocessed frames, we performed human/non-human detection by performing multi-level thresholding using Equation (4), as depicted in Figure 3c.

$$th(x,y) = \begin{cases} 1 & \text{if } \quad l(x,y) > t\_1, t\_2, t\_3 \\ 0 & \text{otherwise} \end{cases} \tag{4}$$

where *th*(*<sup>x</sup>*, *y*) is the threshold image and *t*1, *t*2, *t*3 are the applied thresholds that are defined by Otsu's procedure. In order to extract more useful information, the resultant binary image was inverted using a point processing operation that subtracts every pixel of an image from the maximum level of the image, as shown in Equation (5).

$$C(\mathbf{x}, y) = 1 - th(\mathbf{x}, y) \tag{5}$$

where *<sup>C</sup>*(*<sup>x</sup>*, *y*) is the inverted image, as shown in Figure 3d, and *th*(*<sup>x</sup>*, *y*) is the binary image with a maximum level of 1. After obtaining the human/non-human binary foreground frames, we performed morphological operations to remove imperfections in the inverted image *C*. For the removal of small unwanted objects, erosion was performed, and then, to fill small holes while preserving the size and shape of objects, morphological closing was performed. Every object in image *C* was first eroded using erosion as represented in Equation (6) and then dilated using Equation (7), after which the dilated image was eroded again using the disk-shaped structuring element, as shown in Equation (8).

$$m(\mathbf{x}, \mathbf{y}) = \begin{cases} 1 & \text{if } \qquad \text{S fits } \mathbf{C} \\ 0 & \text{otherwise} \end{cases} \tag{6}$$

$$m(x,y) = \begin{cases} 1 & \text{if } \qquad \text{S hits C} \\ 0 & \text{otherwise} \end{cases} \tag{7}$$

$$M\mathfrak{o} = (\mathbb{C} \hookrightarrow \mathbb{S})((\mathbb{C} \oplus \mathbb{S}) \ominus \mathbb{S})\tag{8}$$

where *C* represents the input inverted image and *S* is the disk-shaped structuring element used for erosion and dilation, while *Mo* is the resultant image. The erosion of *C* by *S* is denoted as (*C <sup>S</sup>*); however, the dilation of *C* by *S* is denoted as (*C* ⊕ *<sup>S</sup>*). After morphological operations, all the objects in the image were grouped and labeled, which helped in extracting and uniquely analyzing every object that was required for human silhouette extraction.

**Figure 3.** Object detection steps. (**a**) Original color frame of a video, (**b**) enhanced image, (**c**) binary image after multi-level thresholding, and (**d**) inverse of a threshold image.

After human/non-human detection, for human silhouette extraction, we calculated the center and extreme points of each of the labeled objects of *Mo*, then we extracted each object one by one, and the distance from center to two extreme points was calculated for every object for non-human filtering, as shown in Figure 4. The same procedure was adopted for the frames from frame 1 to frame *Z*.

**Figure 4.** Human silhouette extraction. (**a**) Distance algorithm from the center to two extreme points for every object, (**b**) single silhouette extracted uniquely through labeling, along with its distance graph, and (**c**) a single non-silhouette, along with its distance graph.

After calculating the distances, those objects whose distances were greater than the set threshold were discarded using Equation (9), and only silhouettes resembling humans were retained.

$$E\_h = \begin{cases} 0 & \text{if } d\_1 > T \cap d\_2 > T \\ 1 & \text{otherwise} \end{cases} \tag{9}$$

where the distance from the center to one extreme point is denoted by *d*1, the center to the other extreme point distance is represented by *d*2, *T* is the set threshold and *Eh* is the resultant image. After human silhouette extraction, most of the non-human objects were discarded by the distance algorithm; however, some non-human objects that resembled human objects remained.

#### *3.3. Multi-Person Tracking*

For accurate human tracking, the extraction of the true foreground, i.e., human pixels only, is a primary step. Thus, after application of the distance algorithm (mentioned in Section 3.2) for multi-person tracking, we performed the human silhouette verification step using the particles force model, and then the multi-person counting and tracking steps were executed.

#### 3.3.1. Human Silhouettes Verification: Particles Force Model

We present a robust particles force model for human silhouette verification. First of all, every extracted labeled silhouette was converted into particles, as shown in Figure 5a. We treated all pixels as fluid particles, thus, every extracted silhouette was a collection of many particles, as depicted in the magnified view in Figure 5b. Therefore, in our designed method, each silhouette was represented by a set of particles *Q* = [*p*1, *p*2, *p*3, ... , *pN*], where *N* is the total number of particles in one silhouette.

**Figure 5.** The particles force model. (**a**) Particle conversion of every extracted silhouette and (**b**) magnified view of particle conversion.

We know from physics that, in solids, particles do not have enough kinetic energy to overcome the strong forces of attraction, called bonds, which attract the particles toward each other. Using this physics phenomenon, we found the force of attraction between particles of every extracted silhouette, as shown in Figure 6:

**Figure 6.** Particles force model. (**a**) Interacting force between two particles (**b**) for non-human silhouettes and (**c**) human silhouettes.

For simplicity, we found the force of attraction between only two mutually interacting particles using Equation (10) in all frames from 1 to *Z*.

$$F\_i = \frac{p\_1 p\_2}{r^2} \tag{10}$$

where *i* is in the range [1 *E*] with *E*, representing the maximum number of silhouettes per frame, while *Fi* is the force of attraction between particle *p*1 and *p*2 of the *i*th silhouette and *r2* is the square of Euclidian distance between particles *p*1 and *p*2. After calculating the force between particles of every silhouette in all video frames, we discarded those silhouettes whose force of attraction was static in frame t and frame *t* + 1 using Equation (11):

$$H\_s = \begin{cases} 1 & \text{if } \frac{dF\_i}{dt} > 0\\ 0 & \text{otherwise} \end{cases} \tag{11}$$

where *dFi dt* represents the change in attraction force between particles of every *i*th silhouette, with respect to time between frames *t* to *t +* 1. After application of the particles force model, we only retained human silhouettes in each frame, as depicted in Figure 7:

**Figure 7.** A few examples of verified multi-human silhouettes.

#### 3.3.2. Multi-Person Counting

After extraction of the verified human silhouettes, to count these detected humans silhouettes, which consist of a set of particles, we performed cluster estimation. Since every silhouette is a collection of particles, the group of particles that makes one silhouette was treated as one cluster, and, by using the K-nearest neighbor search algorithm, cluster estimation was performed on every frame, as depicted in Figure 8:

**Figure 8.** Human contours for cluster estimations.

After that, we labeled clusters in all frames, as shown in Equation (12), and, to make them appear visually, we drew green bounding boxes around each cluster. Thus, by performing cluster estimation and labeling, we counted all the extracted human silhouettes, as shown in Figure 9:

$$I\_c = I\_m p\_N \tag{12}$$

where *pN* is the total number of particles in one cluster (the total number of particles in each cluster varies from cluster to cluster and the number of clusters in each frame varies from frame to frame), while *Lm* represents the label of cluster *m* and *Ic* is the resultant extracted labeled cluster that was treated as one silhouette and was considered in counting.

**Figure 9.** Sample frames of multi-person counts at different time intervals.

3.3.3. Multi-Person Tracking

The goal of person tracking is to establish correspondence between individuals across frames. Thus, to establish correspondence between persons in frame *t* and frame *t* + 1, we calculated the position and velocity of every detected human silhouette in all frames. In our model, we assumed that people can enter or leave the scene, thus, for temporally fixing of all humans across frames, the position of each human silhouette was located and locked by assigning a unique integer ID, which was fixed to that particular silhouette in all frames. The states of all the predicted persons in frame *Ft* were stored in a structure and matched with the states of frame *Ft* + 1, while the detected fixed human silhouettes were tracked using the Jaccard similarity index.

$$S\_t = \sum\_{i=1}^{n} I\_{Li} \tag{13}$$

While using data association and cross-correlation as a cost function, detected and predicted persons were associated in consecutive frames, as represented in Figure 10. The root steps involved in multi-person tracking are illustrated in Figure 11.

**Figure 11.** Key steps involved in multi-person tracking.

#### *3.4. Crowd Behavior Detection*

Understanding that accurate crowd behavior requires robust global and local feature extraction [101–103], along with a potent decision-making classifier, for crowd behavior detection after applying the distance algorithm (mentioned in Section 3.3), the extracted silhouettes were passed through the feature extraction step and multiple distinguishable global and local features were extracted for every frame. Next, bat optimization was applied for optimal feature extraction and decisions were made by the improved entropy classifier.

#### 3.4.1. Global-Local Descriptors

For the global-local descriptor, we used a fusion of global and local image properties. In global features, we described the visual content of the whole image and we had the ability to represent an image with a single vector. Here, we extracted the crowd contour as a global feature. For local features, we used our newly proposed particles gradient motion features, geometric features, and speeded up robust feature (SURF) [104]. For local features, we extracted interest points and represented them as a set of vectors that respond more vigorously to clutter and occlusions.

Initially, in global features, we found the center of each human and considered all the humans in the scene as a vertex; this can be denoted as *P* = { *P*1, *P*2, ... , *Pn|Pi* = (*Xi*, *Yi*)}, where *P* represents the whole human crowd in the scene, considered as a set of vertices, and (*Xi*, *Yi*) are the coordinates of the *i*th human. We considered only those humans that were at the extreme points and joined them with a line, forming the biggest graph, covering all extreme vertices, as shown in Figure 12. The graph represented the human crowd contour, and thus, the variations in the shape of a graph threw a flash on variations in the outer area of the human crowd, i.e., on global changes. To measure the variations in the crowd contour, we compared the contour temporally by integrating over all of the pixels of the contour. In general, we defined the (*p*, *q*) moment of a contour as in Equation (14):

$$m\_{p,q} = \sum\_{\mathbf{x}} \sum\_{y}^{u} I(\mathbf{x}, y) \mathbf{x}^{p} y^{q} \tag{14}$$

where *<sup>I</sup>*(*<sup>x</sup>*, *y*) is the intensity of the pixels in coordinate (*<sup>x</sup>*, *y*). Here, *p* is the *x*-order and *q* is the *y*-order, whereby, order means the power to which the corresponding component is taken in the sum just displayed. The summation is over all of the pixels of the contour boundary (denoted by *n* in the equation). It then follows immediately that, if *p* and *q* are both equal to 0, then the *<sup>m</sup>*0,0 moment is actually just the length in pixels of the contour. The moment computation just described gives some rudimentary characteristics of a contour that can be used to compare two contours.

**Figure 12.** Extraction of human crowd contour as a global feature.

In the SURF descriptor [105], we computed distinctive invariant local features, which detected the interest points and elaborate features that depict some invariance to image noise, rotation, direction, scaling, and changes in illumination. Using SURF, we computed 75 local points for every human silhouette in an image, and thus, for every frame, we had 1050 SURF descriptors in a set of vectors, as shown in Figure 13:

**Figure 13.** (**a**) SURF features for all human silhouettes and (**b**) magnified view of SURF features for two human silhouettes.

In geometric local features, we first identified the skeleton joints of every human silhouette in each frame using a skeleton model, and then, by considering skeleton joints as vertices, we drew poly-shapes and triangles with three or four vertices. By using the left arm, neck, left shoulder, and torso, a left polygon wing was drawn and filled with a color. Similarly, a right polygon wing was drawn and filled with different colors using the right arm, neck, torso, and right shoulder. Additionally, the torso area, lower area, left shoulder triangles, and right shoulder triangles were drawn, as depicted in Figure 14. The areas enclosed under these polygons were analyzed frame by frame, and on the basis of angle differences and area size, normal and abnormal behaviors of human crowds were detected. Algorithm 1 depicts the overall procedure used for the extraction of the strongest body points for human silhouettes.

**Figure 14.** (**a**) Geometric features for all human silhouettes. (**b**) Magnified view of geometric features for two human silhouettes.

In particles gradient motion (PGM), we first converted every human silhouette into particles and then only those particles that were on the human contour were considered, and their interaction force was calculated. Generally, every pedestrian in a crowd has a desired direction and velocity *vid*, calculated using Equation (16). However, in crowded scenes, because of the presence of multiple persons, individual movements are limited, and the actual velocity of each pedestrian *vi* is different from their respective expected motion. The actual velocity of particles is calculated using Equation (15).

$$
v\_i = F\_{\text{av}\%}(x\_i, y\_i) \tag{15}$$

where *Favg*(*xi*, *yi*) is the *i*th particle average optical flow in the coordinate (*xi*, *yi*). We calculated the desired velocity *vid* of particles as:

$$w\_i^d = \begin{pmatrix} 1 \ -w\_i \end{pmatrix} F(\mathbf{x}\_{i\prime}, y\_i) + w\_i F\_{\text{avg}}(\mathbf{x}\_{i\prime}, y\_i) \tag{16}$$

where *<sup>F</sup>*(*xi*, *yi*) represents *i*th particle optical flow with coordinates (*xi*, *yi*) and *wi* is the panic weight parameter. The pedestrian *i* displays vanity behaviors as *wi* → 0 and collective behaviors as *wi* → 1. Linear interpolation was used for the enumeration of efficient optical flow and the adequate average flow field of particles. Thus, on the basis of the actual velocity and the desired velocity, we can calculate the interaction force using Equation (17):

$$F\_{\rm int} = \frac{1}{T} \left( v\_i^{\,d} - v\_i \right) - \frac{dv\_i}{dt} \tag{17}$$

where *F*int is the resultant interaction force, as represented in Figure 15 and *T* is the relaxation parameter. When the interaction force of particles was greater than the set threshold, it was detected as an abnormal event; otherwise, it was considered to be normal.

**Algorithm 1** Extract strongest body points for human silhouettes

**Input: I:** Extracted Human Silhouettes **Output:** Strongest body points, i.e., head, shoulders, legs, arms, hips /\* for each connected component, extract body points. B = bwboundaries(binary\_image); lbl = bwlabel(binary\_image); CC2 = bwconncomp(lbl); L52 = labelmatrix(CC2); for objectidx2 = 1:CC2.NumObjects individualsilheouts2 = bsxfun(@times, closezn, L52 == objectidx2); [labeledImage2,numberofBlobs2] = bwlabel(individualsilheouts2,4); end Aa = individualsilheouts2; /\* Defining a upper, middlle and lower portion for each individual silheouts \*/ th = thershold; rps = regionprops(Aa,'Boundingbox', 'Area'); **for** k = 1 to length(rps) do w = rps(k). Boundingbox if height > th and width > th then upper\_region = struct('x',w(1), 'y', w(2), 'width',w(3), 'height', w(4)/5); /\* head \*/ middle\_region = struct('x',w(1), 'y', w(2) + w(4)/4, 'width',w(3), 'height', w(4)/4); /\* arms \*/ lower\_region = struct('x',w(1), 'y', w(2) + w(4)/2, 'width',w(3), 'height', w(4)/2); /\* legs \*/ j = j+1; s(j) = w; **end end** top = [x,max\_y]:left = [min\_x,y]:bottom = [x,min\_y]:right = [max\_x,y]; % label the head region% Head =top pixels of upper region Right Shoulder = Bottom right pixels of upper region Left Shoulder = Bottom left pixels of upper region Right arm = Right Pixels of middle region Left arm = Left Pixels of middle region Right foot = Bottom right pixels of lower region Left foot = Bottom left pixels of lower region **return**Head,Shoulders,arms,foots

**Figure 15.** (**a**) Particles gradient motion descriptors for all human silhouettes and (**b**) magnified view of PGM for two human silhouettes.

#### 3.4.2. Event Optimization: Bat Optimization

Optimization is a process by which the optimal solutions of a problem that satisfies and objective function are accessed [106–109]. Yang, in [110], introduced an optimization algorithm inspired by a property of bats, known as echolocation. Echolocation is a type of sonar that enables bats to fly and hunt in the dark. The bat optimization (BO) algorithm is composed of multiple variables of a given problem. Using the echolocation capability, bats can detect obstacles in the way and the distance, orientation, type, size, and even the speed of their prey.

BO has multiple agents depicting the parameters of the layout dilemma, as any other metaheuristic mechanism. From real-valued vectors, the initial population is randomly generated with number *N* and dimension d by considering lower and upper boundaries using Equation (18):

$$X\_{ij} = X\_{min} + \varphi (X\_{max} - X\_{min}) \tag{18}$$

where *Xmax* and *Xmin* are higher and lesser boundaries for dimension *j*,respectively, *j* = 1, 2, . . ., *d*, *i* = 1, 2, ... , and *N* and *ϕ* ranged from 0 to 1 is a randomly generated value. After population initialization, we calculated the fitness function and stored the local and the global best. We evaluated the fitness values of all humans, and, on the basis of their movements, new local and global best solutions were obtained; all the humans had velocity *Vi<sup>t</sup>* affected by a predefined frequency *fi*, and finally, their new position *Xi<sup>t</sup>* was located temporally, as described in the following Equations:

$$f\_i = f\_{\min} + \beta (f\_{\max} - f\_{\min}) \tag{19}$$

$$V\dot{\mathbf{i}}^t = V\dot{\mathbf{i}}^{t-1} + (X\dot{\mathbf{i}}^t - X \ast)f\_{\dot{\mathbf{i}}} \tag{20}$$

$$X\dot{i}^t = X\dot{i}^{t-1} + V\dot{i}^t\tag{21}$$

where *fi* is the frequency of the *i*th human, *fmin* and *fmax* are lower and higher frequency values, respectively, *β* represents a randomly generated value, and, after comparison of all solutions, *X*∗ illustrates achieved global best location (solution). Figure 16 depicts the flow chart of the algorithm and Figure 17 represents optimization results.

**Figure 16.** Bat optimization flow chart.

**Figure 17.** Bat optimization results. (**a**) Normal optimal features; (**b**) abnormal optimal features.

#### 3.4.3. Improved Entropy Classifier

Using Shannon's information entropy theory [53] to describe the degree of uncertainty, we proposed an improved entropy classifier for the detection of human crowd behavior. First of all, we standardized all the features using Equation (22):

$$X\_{ij}" = \frac{X\_{ij} - \min\{X\_j\}}{\max\{X\_j\} - \min\{X\_j\}} \tag{22}$$

where *Xij*∗ is the value of the *j*-th feature for *i*-th human. *j* = 1, 2, ... , *m*, *i* = 1, 2, ... , *n*, while *n* is the count of humans and *m* represents the count of features. After that, the weight of *j*-th feature for *i*-th human was calculated using Equation (23):

$$\eta\_{ij} = \frac{X\_{ij}{}^\*}{\sum\_{i=1}^n X\_{ij}{}^\*} \tag{23}$$

Thus, the information entropy of each feature was calculated using Equation (24):

$$c\_{\dot{j}} = -k \sum\_{i=1}^{n} \left( q\_{i\dot{j}} \times \ln q\_{i\dot{j}} \right) \tag{24}$$

where *k* = 1 ln *m* . After calculating the information entropy, we then calculated the difference coefficient and maximum ratio of the difference coefficient using Equations (25) and (26):

$$d\_{\hat{l}} = 1 - e\_{\hat{l}} \tag{25}$$

$$D = \frac{\max\left(d\_{\vec{j}}\right)}{\min\left(d\_{\vec{j}}\right)},\tag{5.2.7.1} \qquad \qquad \left(\vec{j} = 1, 2, \dots, m\right)\tag{26}$$

After calculating *D*, we then built up the scale ratio chart 1–9 using Equation (27):

$$R = \sqrt[a-1]{\frac{D}{a}}\tag{27}$$

where *a* depicts the highest scale-value worked as an adjustment coefficient by calculating the power (*a* − 1). The *D* is allocated to the mapping values from 1 to 9 in the above Equation. After that, from scale 1–9, mapped values were calculated, and judgment matrix *R* was established with elements *rij*, respectively, using Equation (28):

$$r\_{ij} = \frac{d\_i}{d\_j}, \ (d\_i > d\_j) \tag{28}$$

The obtained judgment matrix satisfied the consistency test because the elements *rij* demonstrated the ratio of difference coefficient of two features.

Thus, the consistent weights *Wj* for each feature were then calculated using an analytical hierarchy process. After that, information entropy was again calculated for each feature, using these weights by utilizing Equation (24). The crowd behavior entropy of the whole system was the summary of all entropies. In this way, for every frame, the entropy value was calculated and utilized as a template. For a small entropy value less than the defined threshold, the behavior was predicted as normal; however, for entropy values higher than the set threshold, the behavior was presumed to be abnormal. A flow chart of the proposed improved entropy classifier is shown in Figure 18. Figure 19 depicts results over event classes.

**Figure 18.** Flow chart of the improved entropy classifier.

**Figure 19.** Crowd behavior detection. (**<sup>a</sup>**,**<sup>c</sup>**) Normal frames and (**b**,**d**) abnormal frames.

#### **4. Performance Evaluation**

In this section, we evaluated the performance of our proposed system. We conducted experiments on two publicly available benchmark datasets to evaluate the accuracy and performance of our proposed model. The PETS2009 dataset was used to evaluate the accuracy of multi-person tracking and the UMN dataset was used to evaluate the accuracy of crowd behavior detection. We started by briefly describing the datasets used, and then the experimental results were discussed. Finally, we showed the mean accuracy of our proposed system. We also compared our proposed model with other state-of-the-art multi-person tracking and crowd behavior detection systems.

#### *4.1. Datasets Description*

#### 4.1.1. PETS2009 Dataset

To evaluate different video surveillance challenges, we used PETS2009, one of the publicly available benchmark datasets. The challenges included the S1 dataset for counting persons in a low-density crowd, the S2 dataset for detecting and tracking persons in medium-density crowds, and the S3 dataset for tracking and estimating the number of persons in a high-density crowd. Some sample frames of different synchronized views from PETS2009 dataset are depicted in Figure 20.

**Figure 20.** Sample frames of different synchronized views from the PETS2009 dataset.

#### 4.1.2. UMN Dataset

To evaluate different video surveillance challenges for crowd behavior detection, UMN is one of the publicly available benchmark datasets. The UMN dataset consists of three different scenes, specifically, two outdoor and one indoor, with videos of 11 various panic scenarios. For the detection of abnormal behavior of a crowd, the UMN dataset is one of the best datasets that is publicly available. There were two outdoor scenes: the lawn scene, consisting of two scenarios with 1453 frames, and the Plaza scene, with three scenarios that had 2142 frames. There were six scenarios in the indoor scene, with 4144 frames. Sample frames of different scenarios of the UMN dataset are shown in Figure 21.

**Figure 21.** Sample frames of different scenarios of the UMN dataset.

#### *4.2. Experimental Settings and Results*

We performed all the experiments on MATLAB, and the hardware system had a 64-bit intel core-i3 2.5 GHz CPU and 6 GB of RAM. Three experimental measures were used to evaluate the performance of the system: (1) mean accuracy of multi-person tracking, (2) the accuracy of human crowd behavior detection, and (3) comparisons between our proposed new system with other current and well-known systems. Experimental results showed that our proposed system produces a higher accuracy rate over existing systems.

4.2.1. Experiment 1: Multi-Person Tracking over the PETS2009 Dataset

Experimental results and mean accuracy of our proposed multi-person counting and tracking model on a publicly available PETS2009 dataset are shown in Tables 1 and 2. The ground truth was obtained by counting the number of persons in every sequence, where one sequence contained 20 frames. Table 1 depicts the mean accuracy of our proposed multi-person counting system on the first 30 sequences. As shown, the mean accuracy of our proposed model was 89.80%.


**Table 1.** Multi-person counting accuracy over the PETS2009 dataset.


**Table 2.** Multi-person tracking accuracy over PETS2009 dataset.

Table 2 presents the mean accuracy of our proposed multi-person tracking system. The actual number of humans is the same as for Table 1, while column 2 represents the successful tracking rate of our proposed particles force model and column 3 depicts the failure case. The mean accuracy of our proposed model for multiple person tracking was 86.95%.

4.2.2. Experiment 2: Human Crowd Behavior Detection over the UMN Dataset

Experimental results using the confusion matrix and the mean accuracy of our proposed HCB model on the publicly available UMN dataset are shown in Table 3. The way to evaluate algorithms is to run them throughout a test sequence with initialization from the ground truth position in the first frame.

**Table 3.** Confusion matrix, showing mean accuracy for human crowd behavior detection on the UMN dataset.


4.2.3. Experiment 3: Multi-Person Tracking and HCB Detection Comparisons with State-of-the-Art Methods

We compared our proposed system with other well-known multi-person tracking and human crowd behavior detection methods. As depicted, our system performed better compared to existing well-known state-of-the-art methods. Table 4 shows that, in comparison to other state-of-the-art methods, our proposed system achieved an admirable accuracy rate of 86.06% for crowd behavior detection, which is higher than the accuracy of the force field model (FF) (81.04%) and the social force model (SF) (85.09%). The accuracy of other methods under the same evaluation settings was taken from [77,79].

**Table 4.** Comparison of the proposed approach with other state-of-the-art methods for human crowd behavior detection on the UMN dataset.


Table 5 presents the comparison of our proposed system with other state-of-the-art systems for multi-person counting. Experiment results show that our proposed system achieved a higher accuracy rate of 89.8% over existing methods.

**Table 5.** Comparison of proposed approach with state-of-the-art multi-person counting methods.


In Table 6, comparisons of multi-person tracking with other state-of-the-art methods show that our proposed system achieved a higher accuracy rate of 86.9% over existing methods.

**Table 6.** Comparison of the proposed approach with state-of-the-art multi-person tracking methods.


## **5. Conclusions**

In this paper, we proposed a new robust approach for crowd counting. We introduced and tested tracking and human behavior detection using the idea of a mutually interacting particles force model and an improved entropy classifier with spatio-temporal and particles gradient motion descriptors. Through detailed experiments, we proved the ability of the method to efficiently count, track, and detect the behavior of multiple persons efficiently in crowded scenes. The performance of our new tracking system decreases marginally with increasing numbers of persons in the scene. This is mainly due to full occlusions that occur in the test videos. We achieved promising results on the publicly available benchmark PETS2009 dataset, with an accuracy of 89.80% for multi-person counting and 86.95% for person tracking, as shown in Tables 1 and 2. However, for HCB detection, we achieved

promising results on the publicly available benchmark UMN dataset, with an accuracy of 86.06%, as shown in Table 3. Our future work will focus on some occlusion reasoning methods to further tackle the occlusion problems. We will also extend our work to multiple scene detection. We are interested in recognition of different scenes, such as sport scenes, fight scenes, robbery scenes, traffic scenes, and action scenes.

**Author Contributions:** Conceptualization, F.A., Y.Y.G. and K.K. methodology, F.A. and A.J.; software, F.A.; validation, F.A., Y.Y.G. and K.K.; formal analysis, M.G. and K.K.; resources, Y.Y.G., M.G. and K.K.; writing—review and editing, F.A., A.J. and K.K.; funding acquisition, Y.Y.G. and K.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) under Grant 2018R1D1A1A02085645, by the Korea Medical Device Development Fund gran<sup>t</sup> funded by the Korean governmen<sup>t</sup> (the Ministry of Science and ICT; the Ministry of Trade, Industry and Energy; the Ministry of Health and Welfare; and the Ministry of Food and Drug Safety) (202012D05-02), and by Hanyang University Grant No. 201800000000647.

**Data Availability Statement:** Data sharing not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.
