**1. Introduction**

Multi-person tracking is currently one of the most essential and challenging research topics in the computer vision community [1–9]. Because of the common availability of highquality low-cost video cameras and considering the inefficiency of manual surveillance and

**Citation:** Abdullah, F.; Ghadi, Y.Y.; Gochoo, M.; Jalal, A.; Kim, K. Multi-Person Tracking and Crowd Behavior Detection via Particles Gradient Motion Descriptor and Improved Entropy Classifier. *Entropy* **2021**, *23*, 628. https://doi.org/ 10.3390/e23050628

Academic Editor: Amelia Carolina Sparavigna

Received: 1 April 2021 Accepted: 12 May 2021 Published: 18 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

monitoring systems, automated video surveillance is now essential for today's crowded and complex environments. To monitor, control, and protect crowds, accurate information about numbers plays a vital role in operational and security efficiencies [10–16]. The counting and tracking of many persons is a challenging problem [17–25] due to occlusions, the constant displacement of people, different perspectives and behaviors, varying illumination levels, and because, as the crowd gets bigger, the allocation of pixels per person decreases.

A primary concern in surveillance and monitoring systems is to identify human crowd behaviors and supervise the crowd to prevent disasters and unforeseen events [26–34]. The analysis of human behavior in crowded scenes is one of the most important and challenging areas in current research [35–43]. Traditional visual surveillance systems that depend purely on manpower to analyze videos is inefficient because of the enormous number of cameras and screens that require monitoring, human fatigue due to time spent on lengthy monitoring periods, paucity of essential fore-knowledge and training in what to look for, and also because of the colossal amount of video data that is generated per day. Such issues necessitate an automated visual surveillance system that can reliably detect, isolate, analyze, identify, and alert responders to unusual events in real time. Automated surveillance systems seek to detect human behaviors automatically in crowded scenes, and it has many potential applications, such as security, care of the elderly and infirm, traffic monitoring, inspection tasks, military applications, robotic vision, sports analysis, video surveillance, and pedestrian traffic monitoring [44–52].

In this research article, we propose a robust new particles-based approach for multiperson counting and tracking, which addresses the problematic fact that, as the density of a crowd increases, the number of pixels allocated per human decreases. By using our particles-based approach, we were able to count and track multiple persons in crowded scenes and efficiently deal with occlusions, arbitrary movements, and overlaps. We also propose a new approach for crowd behavior detection using an improved entropy classifier based on the fusion of global and local descriptors extraction. First of all, we applied pre-processing steps on extracted video frames for noise removal, edge detection, and contrast adjustment, then human/non-human detection was performed using multi-level thresholding and morphological operations. We applied a distance algorithm for human silhouette extraction. After that, our work involved two facets: (i) multi-people tracking and (ii) crowd behavior detection. In the multi-person tracking phase, we first verified the extracted silhouettes by a particles force model, then we converted extracted foreground objects into particles, and, using physics phenomena of the mutually interacting particles force model, non-human objects were discarded. As every extracted human silhouette is a collection of particles, by treating groups of particles that make one silhouette as a cluster, we performed labeling and cluster estimation using a K-nearest neighbors search algorithm to count the persons. We then fixed the human silhouettes with a unique integer ID, and, using normalized cross correlation as a cost function and the Jaccard similarity index, multiperson tracking was performed. However, for crowd behavior detection, we used a fusion of global and local descriptors, that is, after foreground extraction, we extracted a human crowd contour as a global descriptor and a particles gradient motion (PGM) descriptor, along with geometric and speeded up robust features (SURF) as local descriptors. Using this fusion of global and local descriptors, bat optimization was then applied for optimal descriptors. Finally, by using Shannon's information entropy theory [53], we introduced an improved entropy classifier to detect crowd behavior.

Experimental results show that our proposed system performed better compared to existing well-known state-of-the-art methods. The proposed system has huge potential applications, such as crowd density estimation, security, care of the elderly and vulnerable, sports analysis, inspection tasks, military applications, robotic vision, video surveillance, and pedestrian traffic monitoring. The major contributions of this paper can be highlighted as follows:


The remaining structure of this paper was arranged as follows: Section 2 describes related work. A detailed overview of the proposed model for multi-person tracking and crowd behavior detection is mentioned in Section 3, which includes preprocessing, human silhouettes extraction, the particles force model, multi-person counting, multi-person tracking, global and local features extraction, bat optimization, and an improved entropy classifier. In Section 4, we evaluate the performance of our proposed approach on a publicly available benchmark dataset and give a detailed comparison of our proposed approach with other state-of-the-art methods. Lastly, in Section 5, we sum up the paper and outline future directions.

#### **2. Related Work**

During the last few years, several algorithms and systems have been developed by different researchers for crowd counting, tracking, and human behavior detection [54–62]. Here, we divide the related work into two parts, namely, human crowd behavior detection systems and multi-person counting and tracking systems.

#### *2.1. Crowd Behavior Detection Systems*

Many contributions have been proposed to describe crowd behavior using various models [63–69]. Crowd behavior detection is a challenging problem due to the arbitrary movements of individuals and groups, partial or full occlusions, different outlooks and behaviors, posture changes, and composite backgrounds [70–76]. To detect human behaviors automatically in crowded areas, S. Wu et al. in [77] constructed a density function of optical flow based on class-conditional probability and described the motion of crowds using divergent centers and potential destinations so that anomalies can be detected on the basis of a Bayesian framework. However, the system is not effective for arbitrary movements or overlaps. S. Choudhary et al. in [78] proposed a SIFT feature extraction technique, along with a Genetic Algorithm for optimal feature extraction; anomalies were detected by checking feature set movement behaviors. Their proposed system has a very high computational processing demand. Direkoglu et al. in [79] used a one-class SVM, along with features based on optical flow to detect crowd behavior; their system is limited by the accuracy limitations of optical flow estimation. W. G. Aguilar et al. in [80] introduced a moved-pixels density-based statistical modeling approach for detecting abnormal crowd behavior. This system has low computational cost, but the efficiency decreases with increasing complexity of the situation being monitored, e.g., serious occlusions. A. Shehzed et al. in [81] first detected humans and then the gaussian smoothing technique was used to detect anomalous behavior; however, the accuracy of the system decreases with illumination changes and occlusions because thresholding is used for detection. W. Ren et al. in [82] introduced a behavior entropy model for detecting abnormal crowd behavior using spatio-temporal information, along with behavior certainty of pixels, but the system is vulnerable to certain misclassifications due to interclass similarities. G. Wang et al. in [83] addressed the crowd behavior detection problem by using the pyramid LucasKanade optical flow [84] method based on location estimation of adjacent flow; however, the proposed method is not effective for an unstructured crowd. R Mehran et al. in [85] placed a grid of particles on the image and introduced a social force model for detecting crowd behavior. Bellomo, N. et al. in [86] pursued two specific objectives: the derivation of a general mathematical structure based on appropriate developments of the kinetic theory suitable for capturing the main features of crowd dynamics and the derivation of macroscopic equations from the underlying mesoscopic description. Colombo, R.M. et al. in [87] dealt with macroscopic modelling of crowd movements, particularly how non-local interactions are influenced by walls, obstacles, and exits. An ad hoc numerical algorithm, along with heuristic evaluation of its convergence, was also provided. Khan, S.D. et al. in [88] proposed scale estimation network SENet and head detection network. The SENet takes the input image and predicts the distribution of scales (in terms of histogram) of all heads in the input image, which are later on classified by a detection network.

#### *2.2. Multi-Person Counting and Tracking Systems*

True foreground extraction, i.e., human pixels, is only one of the primary steps for accurate counting and tracking of humans in crowded scenes [89–93]. Several approaches and systems have been introduced by many researchers for multi-person counting and tracking. In [94], S. Choudri et al. proposed a pixels-based people counting model using the fusion of a pixel map-based algorithm along with human detection to count only human classified pixels. They applied a depth map, image segmentation, and a human presence map that was updated with a human mask for the purpose of counting people; however, the system has misclassification problems due to interclass similarities. H. Chen et al. in [95] proposed a new color and intensity patch segmentation approach for tracking and detection of human body parts and for the full body. They applied fusion of color space segmentations for the detection of body parts and for the full body. For tracking, based on the velocity of a target, they adaptively selected the track gate size. A target's likely forward position was predicted based on the target's previous velocity and direction. The proposed algorithm achieved satisfactory results only when the count of peoples was limited in the view, i.e., efficiency decreases as the crowd increases. In [96], J. Garcia et al. introduced a head tracking-based directional people counter. Using several circular patterns and preprocessing steps, people's heads were detected. For the tracking application, a Kalman filter was used, and counting was achieved on the bases of head detection and tracking. The effectiveness of the proposed algorithm decreases during serious occlusions, arbitrary movements, and overlaps. M. Vinod et al. in [97] introduced object tracking and counting using new morphological techniques. The frame-difference technique, followed by morphological processing and region growing, was used for counting people. Moving objects were extracted by determining their movements, and then tracking was performed using color features. As the illumination of the scene changed, the efficiency of the proposed algorithm decreased. G. Liu et al. in [98] proposed a tracker based on a correlation filter. Kalman filter applications were used for tracking. They designed a tracker that detects numerous positions and alternate templates. However, the system was not efficient in dealing with complex situations, such as occlusions and random movements. E. Ristani et al. in [99] used deep learning to track multi-persons. Using CNN, they extracted features and then introduced a weighted triple loss strategy to assign weights during training. Their system was computationally complex, and a huge dataset was essential for training. H. Xu et al. in [100] located humans by their shoulders and heads, and, for tracking, they used trajectory analysis and the Kalman filter, but the system was not effective for arbitrary movements or overlaps.

#### **3. Proposed System Methodology**

This section elaborates our proposed methodology for multi-person tracking and crowd behavior detection. We propose a robust multi-person tracking system based on a particles force model and human crowd behavior detection system using an improved entropy classifier with spatio-temporal and particles gradient motion descriptors. In the proposed system, the first step is the preprocessing of extracted video frames from a static camera. Secondly, object detection is transacted using multi-level thresholding, morphological operations, and labeling. Thirdly, for human silhouette extraction, a distance algorithm is applied, and non-human filtering is performed on all extracted labeled objects. At this stage, we administered our work into two streams: the first was for multi-person counting and tracking, where we first performed a human silhouette verification step by converting extracted objects into particles and a robust particles force model was introduced for human silhouette verification. In the next step, after verification of human silhouettes, as all verified human silhouettes are a collection of particles, by treating each group of particles as a cluster we performed labeling and cluster estimation using a K-nearest neighbors searching algorithm for multi-person counting. After that, for multi-person tracking, the position of each detected human silhouette was then located and locked by assigning an integer ID for temporally fixing each human silhouette in the full video, and detected fixed humans were tracked using a Jaccard Similarity Index. However, in the second facet, for crowd behavior detection, the extracted foreground objects were passed through a feature extraction step and multiple distinguishable global and local features were extracted from every frame. After that, all the extracted features were standardized using the bat optimization algorithm. Lastly, in the classification phase, an improved entropy classifier was proposed for detection of crowd behavior. Figure 1 depicts the synoptic schematics of our proposed system.

**Figure 1.** Synoptic schematics of the proposed Multi-Person Tracking and Crowd Behavior Detection system.
