Skip Content
You are currently on the new version of our website. Access the old version .
Remote SensingRemote Sensing
  • Article
  • Open Access

15 July 2021

A Wide Area Multiview Static Crowd Estimation System Using UAV and 3D Training Simulator

,
and
Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK
*
Author to whom correspondence should be addressed.
S.S. wrote the software, conducted the experiments and wrote the paper. B.T. and H.C.M. advised and supervised S.S., helped with the experimental design, and proof read the paper.
This article belongs to the Special Issue Advances in Object and Activity Detection in Remote Sensing Imagery

Abstract

Crowd size estimation is a challenging problem, especially when the crowd is spread over a significant geographical area. It has applications in monitoring of rallies and demonstrations and in calculating the assistance requirements in humanitarian disasters. Therefore, accomplishing a crowd surveillance system for large crowds constitutes a significant issue. UAV-based techniques are an appealing choice for crowd estimation over a large region, but they present a variety of interesting challenges, such as integrating per-frame estimates through a video without counting individuals twice. Large quantities of annotated training data are required to design, train, and test such a system. In this paper, we have first reviewed several crowd estimation techniques, existing crowd simulators and data sets available for crowd analysis. Later, we have described a simulation system to provide such data, avoiding the need for tedious and error-prone manual annotation. Then, we have evaluated synthetic video from the simulator using various existing single-frame crowd estimation techniques. Our findings show that the simulated data can be used to train and test crowd estimation, thereby providing a suitable platform to develop such techniques. We also propose an automated UAV-based 3D crowd estimation system that can be used for approximately static or slow-moving crowds, such as public events, political rallies, and natural or man-made disasters. We evaluate the results by applying our new framework to a variety of scenarios with varying crowd sizes. The proposed system gives promising results using widely accepted metrics including MAE, RMSE, Precision, Recall, and F1 score to validate the results.

1. Introduction

Crowd estimation refers to the practice of calculating the total number of people present in a crowd. Manual crowd estimation and automated crowd estimation are the two most common broad approaches to measuring crowd size, but the method varies according to the crowd size. Manually monitoring and estimating a small crowd by splitting people into groups is a traditional way that still exists. However, manual estimation of a large crowd is not possible and may be very expensive and time-consuming. It has prompted scientists and researchers from various disciplines across the globe to develop automated crowd estimation systems that calculate the number of people in a large crowd. In the last five years, the domain has expanded rapidly. The introduction of deep learning methods, coupled with easy availability of powerful GPU based systems, has provided a step change in computer vision algorithms across a range of problem domains, starting with classification, but it has quickly moved on to other areas such as crowd estimation. A number of well-publicized crowd-related incidents and gatherings have drawn the attention of researchers and the computer vision community; it has prompted them to develop accurate crowd surveillance systems.
An example application is for use in major disasters. In such a scenario, a crowd estimation system would give a more accurate picture of the crowd and number of affected people and their geographical spread. This would enable proper coordination of the disaster teams, leading to more efficient relief-aid work.
Crowd estimation systems using Unmanned Aerial Vehicles (UAVs) is an emerging research area, due to its potential to cover a wide area in a short period. However, it presents automated estimation issues. Both the camera and the crowd are likely to be moving, so there is a risk of multiple counting of the same person. Most of the existing automated methods focus on individual frames from a single static camera. Recently, there has been some promising research conducted in multiple view systems and UAV-based cameras. UAVs can cover a large area. However, they pose problems of (1) a moving camera, (2) the crowd that may move during the capture time and (3) different view points which require extensive additional training and testing data.
Given the challenges of gathering and annotating data, our paper explores the use of a simulator to generate the training images and annotated ground-truth data. Furthermore, we have introduced a novel automated 3D crowd estimation system using a UAV, that was trained and tested with our simulator. In the initial phase, we have focused on the problem of static crowds and intend to move towards more dynamic crowds in the future. The motivation for developing our system is for crowd flow management, large-scale public gathering monitoring, public event security and relief-aid work by welfare organisations in disaster-hit areas.
This paper covers the following contributions:
  • We have extensively studied the existing crowd estimation methods, data sets, and open-source crowd simulators, along with an assessment of their shortcomings. We have focused on the intended use to identify the need to develop a new simulator for estimating the crowd using UAV.
  • We have explored in detail the development of a new 3D crowd simulation system that can generate the required training images and annotated ground truth data. Furthermore, we have generated various 3D models along with accompanying camera locations and orientations.
  • We have trained, tested and validated the simulation system against real-crowd data, where we have tested synthetic data against real crowd data sets using various state-of-the-art methods. Furthermore, we have trained a new model based on our aerial synthetic data and tested it against the real-crowd data.
  • We have introduced a novel 3D crowd estimation technique using UAV for a robust and accurate estimation of a crowd spread over a large geographical area. Our proposed solution overcomes the issue of counting of the same individual multiple times from a moving UAV.
  • We also discuss the remaining challenges for wide area crowd estimation and suggest future directions for research. Additionally, we have covered significant issues for aerial crowd data collection and have put across some promising research challenges that needs to be explored.
The remainder of this paper is organized as follows:
Section 2 provides an up-to-date review of the most relevant recent literature including recently introduced crowd estimation methods. Section 3 provides a detailed step-by-step discussion of our approach and describes the benchmarks used for evaluation. In Section 4, we discuss the implementation and setup of our system. In Section 5, we present and analyze the results of our experimental evaluation of the system. The work presented in the paper is summarized and the results and their interpretation have been discussed in Section 6. Finally, we discuss the conclusion and future directions in Section 7.

3. 3D Crowd Estimation Using UAV

In this section, various techniques and tools used to develop 3D crowd estimation technique with UAVs have been covered in detail. We have also highlighted the way these tools can be used in conjunction with one another. An overview of the development of a crowd simulation for training and testing data has also been discussed. Unreal Engine has been used as the main tool for simulation and Make Human and Anima have been employed to design and import random crowd that mimic real-life settings. Furthermore, we have discussed the process used for training, testing and validation of synthetic data against real crowd data and vice-versa. Finally, we have introduced our novel method of 3D crowd estimation using UAVs in real time.

3.1. 3D Simulation and Modeling

Unreal Engine [81] is a game engine developed by Epic Games that focuses on first person shooter games. It was created using Blueprint and C++ as the main languages in version 4 (v4). With features such as blueprint interface, game mode, simulation, real-time output and automatic annotations, it is the perfect fit for reproducing 3D framework, especially for simulating real-life scenarios that rarely occur.
Make Human [82] is an open-source 3D computer graphics software used to create realistic humanoids. Make Human is used to design and create crowds size considering different genders, age and features. Given a larger community comprising of programmers, artists and people with academic interests in 3D modeling of characters, this tool is written in Python and is compatible with almost all the available operating systems. Make human is easy to use and extracts the skeleton or a static mesh as per the requirements of any other simulation tool such as unreal engine.
Anima [83] is a 3D people animation application developed specifically for architects and designers and is ideal for creating amazing 3D animated people quickly and easily. The tool has been used to create many 3D animated people and realistic scenarios. The crowd flow and movement direction are plotted in such a way to avoid collision and maintain a realistic flow. Many realistic 3D models such as stairs, escalators, tracks, and moving sidewalks are pre-designed and easy to access for UE4 which not only helps to design and simulate any complex scenarios quickly but also saves time while creating any new realistic setting.
Colmap [84] is a 3D reconstruction tool and uses the patch-based stereo to reconstruct 3D dense point clouds. In our proposed method, it has been used to generate 3D models using images extracted from Unreal Engine. The Colmap provides intrinsic parameters such as camera model and extrinsic parameters such as camera location, rotation, etc. Several studies [85,86] critically compared the results of popular multiview stereo (MVS) techniques and concluded that COLMAP achieves the best completeness and on average, it produced promising results for most individual categories.

3.2. Why 3D Simulation?

3D simulation is less time-consuming and cost-effective to build a 3D simulator of a crowd to train and test the system as well as provide accurate ground truth information about people and their locations. In addition, a 3D simulator is useful to create and design simulations of seldom occurring events and understanding the real-world outcomes. Additionally, it can be useful to self-train the system by finding out about those uneven possibilities, such as stampedes, public gatherings, etc. within the crowd data sets. Another factor that influences and attracts computer vision researchers toward 3D simulation is the virtual data set. It makes it possible to consider and construct a virtual data set by creating various scenarios, events, and their outcomes in real time, which can help to train and test the system [87].

3.3. Overview of the Proposed Testbed

In a limited time, UAVs have gained enormous prominence due to their ability to resolve major issues. Obtaining a licence and permission to fly a UAV near a crowd in most countries is hard, expensive, and time-consuming due to rigorous restrictions and regulatory limits. Navigating and coping with a variety of precise settings and unforeseen situations can also be difficult. Handling a UAV in a gusty environment with a shorter flying time and distance, for example, highlights the inefficiency of mapping a large area, which could be dangerous in real life. All of these variables make 3D simulation the ideal solution because it has no negative implications or ethical issues.
Considering the challenges of gathering and annotating real data, we built the crowd simulation system using the Unreal Engine version 4 (UEv4). The design of the basic prototypes and reusable meshes such as houses and trees was the first step involved in creating a virtual environment and shown in Figure 1. Furthermore, we placed all those meshes within the environment to give a real-life look. We have used smooth, linear, and spherical features to flatten and reduce surface noise. Animation and wind effects were incorporated to make the virtual environment more realistic, but only in 3D. These models can be imported and utilized in a variety of settings, making the process of building and generating scenarios quick, easy, and adaptive to the requirement.
Figure 1. The figure demonstrates various steps involved before simulating the crowd, whereas (a) shows the skeleton design sample of a person containing in-depth details which can be exported and further used in UE4, (b) shows one of the samples designed to be a part of the crowd, (c) contains the basic template for the first person used as a map to place various objects, (d) demonstrates the designing and animating of the mesh. A tree sample has been presented in this image. (e) shows a sample blueprint command line to calibrate and establish the working between different objects, (f) shows an initial output map designed before placing the crowd in the environment, (g) demonstrates how different crowd samples look like when they are ready for simulation, (h) depicts the final image after starting the simulation where the image was captured from the top view and showed our UAV prototype used in UE4, (il) demonstrates various scenarios where the crowd was randomly distributed in diverse settings.
Having said that, it is necessary to create a synthetic crowd prototype comprised of different genders before simulating the environment. Hence, we have used Make Human and Anima to design and generate random crowds using random sets of features for different random variables to mimic the real world. The random crowd consists of individuals of different genders, ages, weights, heights, ethnicities, proportions, outfits, poses, colors, and geometries. For proper representation, we have used different geometric shapes and topologies for the eyes, hair, teeth, eyebrows, eyelashes, etc. of each individual.
Manually annotating the crowd in any dense crowded image is an extremely laborious and time-consuming task with a higher possibility of getting false annotations or multiple count of the same individual. While the captured 2D data holds good image resolution, the inherent limitation of 2D does not make it efficient to provide every single detail required for estimating the crowd in 3D. That said, data collection within the 3D simulation system is relatively easy and accurate to generate reams of data, especially when using a moving camera over a large crowd. Our proposed 3D simulation system is efficient enough to generate automatic annotations and can provide 3D world and relative locations to estimate the crowd in any static or dynamic event. The simulation system is also able to generate virtual data sets that could be beneficial in future research within the domain. Furthermore, it resolves existing issues such as the availability of massive crowd data sets, among others. The flowchart in Figure 2 depicts the steps taken to capture the frame while storing ground truth (GT) positions at the same time. The collected data was used in the subsequent Section 3.4 for the training, testing and validation of the simulation system and generation of synthetic data. Furthermore, the 3D annotations collected by flying the UAV were extracted from the simulator and further used in the final 3D method introduced in Section 3.5.
Figure 2. The flowchart presents the whole pipeline for capturing synthetic data with a 3D simulation system.

3.4. Training, Testing and Validation the Simulation System against Real-Crowd Data

Given the requirement for a UAV crowd estimation and limitations of a flying UAV in the real world, we have introduced a novel way to estimate the crowd using synthetic images extracted from our simulation system. We started by implementing our initial idea of building and testing the simulation system. Using aerial photos gathered from the drone as a foundation for assessment was a very challenging task. So we prototyped photo-realistic humanoids of various sizes and integrated all of their meshes and skeletons into the simulator to make it as realistic as possible.
With a variety of methods discussed in Section 2.3, we have evaluated the advantages and disadvantages of the broad approaches. Most recent studies [34,38,39,40] suggest that multicolumn CNN [10] method achieves the best results on ShanghaiTech data set and is efficient enough to train, test and validate the simulation system against real-crowd data. ShanghaiTech is the best fit as it is one of the largest large-scale crowd counting data sets in previous few years. It consists of 1198 images with 330,165 annotations. According to different density distributions, the data set has been divided into two parts: Part A (SHA) and Part B (SHB). SHA contains images randomly selected from the internet, whereas Part B includes images taken from a busy street of a metropolitan area in Shanghai. The density in Part A is much larger than that in Part B which make SHA a more challenging data set and an ideal fit for large crowd testing.
To test and validate the simulation system, we extracted the aerial video captured through UAV within the simulator and split it into different frames. Initially, we set up the ShanghaiTech data set for testing and validation against the synthetic images (Figure 3). For testing the system, we set up data and created the training and validation set along with ground truth files. We calculated the errors using mean absolute error (MAE) and root mean square error (RMSE), and the output in the form of density maps.
Figure 3. The pipeline shows the steps involved in testing of synthetic data against publically available crowd data set.
We trained the model on synthetic data using multicolumn convolutional neural network (MCNN) after obtaining a high throughput and validating the simulated data. Three parallel CNNs, whose filters were attached with local receptive fields of different sizes, were used as shown in Figure 4. We utilized the same network structures for all the columns (i.e., conv–pooling–conv–pooling) except for the sizes and numbers of filters. Max pooling was applied to each 2 × 2 region, and Rectified linear unit (ReLU) was adopted as the activation function. We used fewer filters to minimise computation time.
Figure 4. The figure depicts the network architecture design and overview of single image crowd counting via multi-column network.

3.5. Our Approach to Crowd Estimation Using UAV

In this section, various tools and techniques used to develop the 3D crowd estimation technique using UAV have been discussed. We have highlighted step-by-step how these tools are interlinked with each other. We have briefly discussed the Make Human and Anima for designing and importing random crowd that mimic real-life settings.
In the most recent studies, counting the same individual from a moving camera has been a major issue. We have attempted to overcome the issue by introducing a novel 3D crowd estimation technique using UAV for a robust and accurate estimation of a crowd spread over a large geographical area. Figure 5 shows the step-by-step process of our presented method where the basic prototypes and meshes were designed to setup a simulation environment in Unreal Engine. Anima and Make Human were used to generate random crowds size using random sets of features for different random variables to mimic real-life settings. After preparing the simulation environment, we flew a virtual UAV around the crowd and captured the ground truth 3D locations which we will use at the end to map the estimated 3D crowd locations. Various frames were also captured associated with the crowd to train, test and validate the system. After extracting the captured data from Unreal Engine, we tested the captured virtual data using state-of-the-art method MCNN. Later, Laplacian of Gaussian (LOG) was applied in the extraction of the density map provided by the MCNN to identify the possible 2D crowd location. It was later used to ray trace the possible crowd locations in 3D. In the third step, we reconstructed the 3D model from the frames captured using UE4 and collected in-depth details of the model such as camera location, quaternion matrix, camera translation and points such as screen points and 3D points for every 2D image provided as input. Finally, we initiated a ray hit testing and traced the possible 2D crowd location extracted from the blob-detector and stored the intersection points between ray and plane, considering them as the possible crowd locations in 3D model. Although the traced 3D locations overlapped in the initial frame capturing, we set up an averaging method and discarded most of the overlapping points from each frame. To map the output estimated point with the ground truth point captured from UE4, we used the ICP algorithm for registering both point sets. Once it converged, we mapped the ground truth points with the estimated points using the nearest neighbour search algorithm and extracted the matched pair between the two sets, where p i is considered a match to q j if the closest point in Q to p i is q j and the closest point in P to q j is p i and tested it against various universally-agreed and popularly adopted measures for crowd counting model evaluation which have been discussed further in Section 3.6.
Figure 5. The system architecture diagram provides a detail representation and steps involved in our approach to crowd estimation using UAV.
To make a clearer representation of the method, we divided the process into several steps and tried to present the working of every step. To give a realistic vision, we also attempted to visualise how the output would look like. The steps involved in the presented method are as follows:
Step 1: Make Human and Anima have been used to design and generate random crowds size using random sets of features for different random variables. They give a random crowd that mimics real-life settings. Furthermore, this image covers people of different genders, age, weight, height, ethnicity, proportion, clothes, pose, colour, geometries etc. Different geometries for each person have been used to make the synthetic crowd more appropriate for the real crowd, including eyes, hair, teeth, topologies, eyebrows, eyelashes, and so on. Furthermore, these individuals were involved in the simulation system’s estimation process.
Step 2: Unreal Engine (UE4) is primarily used as a platform to simulate various real-life scenarios that rarely occur. To make it more practical and closer to real-life situations, we have used random crowd distribution (Figure 6). Because of the random distribution, the crowd size for each simulated scenario is unknown before testing. Algorithm 1 demonstrates steps 1 and 2 with a detailed overview of how the simulation scenario was created and 3D locations were extracted for the simulated crowd within the system.
Algorithm 1 Algorithm for 3D simulation and data collection.
Remotesensing 13 02780 i001
Figure 6. The figure shows the demonstrations from Steps (1–2), where the synthetic image has been captured from the UAV.
Step 3: Density estimation and blob detection were used for projection and verification. To gather all the information and evaluate the output images, the system was trained using real images and tested on synthetic images provided by the simulator. Few state-of-the-art pre-trained models were considered for checking against both the synthetic and real data to train and test the system. Moreover, we incorporated a multicolumn convolutional neural network for single image crowd counting. We repeated this process for all the data. The density heat maps generated using the person detector (Figure 7) for all the 2D images were used for mapping the 3D data. Later, we used a Gaussian blob detector to extract the individual’s 2D locations from the density maps. The coordinates were later used to ray trace these 2D locations to obtain the 3D locations. These points were crucial for filtration and determining whether or not the estimate point in the 3D model belonged to a person. Algorithm 2 demonstrates step 3 and highlights the procedure followed to extract the 2D coordinates for each person from each image that has been extracted.
Algorithm 2 Algorithm for density estimation and Blob detection.
Remotesensing 13 02780 i002
Figure 7. The figure shows the demonstrations of step 3, whereas (a) shows the network output from MCNN in the form of density map and (b) represents the Step 3, where the blob detected from the density map are shown and further used for mapping and tracing the crowd.
Step 4: Colmap is used to generate 3D models using the synthetic data (Figure 8) gathered from the simulator. Various simulated images were captured by flying the UAV over the randomly distributed crowd. The gathered data was merged into a realistic model using structure-from-motion (SfM) and multiview stereo (MVS). The whole pipeline returned the 3D parameters such as camera location, quaternion matrix, camera translation and points such as screen points and 3D points for every 2D image provided as input.
Figure 8. The image presents the first step of COLMAP 3D reconstruction where a set of simulated overlapped images have been provided as an input.
This approach uses a set of multiview images captured by RGB cameras to reconstruct a 3D model from the object of interest. 3D reconstruction is often identified as SfM-MVS. SfM is an acronym for structure-from-motion. It creates a sparse point cloud model from the input images.
First, the SfM technique determines intrinsic (distortion, focal length, etc) and extrinsic (position and orientation) camera parameters (Figure 9) for putting the multiview images into context by identifying the local features/keypoints of the images. The corresponding points were then used to measure the 3D model and find the relationship between images. Algorithm 3 represents how the 3D model (Figure 10) has been reconstructed as explained in step 4.
Algorithm 3 Algorithm for 3D model reconstruction using COLMAP.
Remotesensing 13 02780 i003
Figure 9. The figure shows the UAV path trajectory. The data was captured by following a circular path to store every crowd detail from the scene.
Figure 10. The figure explains the final step involved in the COLMAP reconstruction. A 3D model has been provided as an output.
Step 5: A ray hit test was set up using the starting point and direction to find the intersection point between the ray and 3D model plane. It was used to track down and estimate the crowd size in 3D, while considering the challenge of a moving camera and crowd. It is possible to ray trace every point in each 2D image, but it would be a very expensive and time-consuming process. To overcome this problem, unrelated points were filtered and discarded and ray trace was set up only for the points extracted after the blob detection obtained the exact 3D location points. The returned ray intersection points with the relevant frame numbers were stored and used in the next step to overcome the issue of counting the same individual multiple times.
Structure-from-Motion (SfM) is the process of reconstructing 3D structure from its projections into a series of images. The input is a set of overlapping images of the same object taken from different viewpoints. The output is a 3D reconstruction of the object as well as the reconstructed intrinsic and extrinsic camera parameters of all images. Typically, Structure-from-Motion systems divides this process into three stages: feature detection and extraction, feature matching and geometric verification and structure and motion reconstruction. Furthermore, multiview stereo (MVS) takes the SfM output to compute depth and normal information for every pixel in an image. Fusion of the depth and normal maps of multiple images in 3D then produces a dense point cloud of the scene. Using the depth and normal information of the fused point cloud, algorithms such as the Poisson surface reconstruction [88] can then recover the 3D surface geometry of the scene.
Figure 11 depicts the original model reconstructed using an overlapped image provided as input. Before moving forward, plotting the traced point back is an efficient way of checking the accuracy. For this, we used the reconstructed intrinsic and extrinsic camera parameters of all images stored in a database. Later, we plotted the same traced points back to create the same model to double-check the data accuracy. Figure 12 refers to the back projected traced points to the point cloud which creates an accurate model and proves the reconstructed model’s accuracy. Various steps followed in step 5 have been presented in Algorithm 4 that demonstrate how the intersection points were extracted using a ray hit test.
Algorithm 4 Algorithm for Ray Hit Testing.
Remotesensing 13 02780 i004
Figure 11. The figure shows the original 3D Model reconstructed using five humanoid prototypes.
Figure 12. The figure has traced points back projected to the point cloud, while reconstructing the original model.
Step 6: A merging algorithm was developed to find the average of the total number of points hit by the ray tracer. Then, a list of intersection points for each ID and the threshold was set up as the input. The closest point to the threshold was selected. Each point of the frame number (from ID 1 to N-1) was checked against all the neighbouring points with the same ID. If the difference between the point P and the intersection point Q was greater than the threshold, the point was appended to a new point set while the rest were discarded. The detailed explanation and the steps involved in the algorithm have been discussed in Algorithm 5.
Algorithm 5 Merging of Ray traced intersection points.
Remotesensing 13 02780 i005
Step 7: Point matching for evaluation was carried out in the final step. To evaluate our detections, we had to match the ground truth 3D locations to the estimated locations from our system. The Iterative Closest Point (ICP) [89,90] algorithm was used to find the best fit transform and to validate the estimated points against the ground truth points. Fast Library for Approximate Nearest Neighbors (FLANN) [91] was used for the nearest neighbour search. A two-way matching of points was carried out and cross-checked between the two 3D point sets pairs. Algorithm 6 demonstrates the procedure followed for the two-way matching from the two different point sets where Q represented the estimated average points and P represented the ground truth points extracted from the simulation system. It has been explained in step 1 of the presented method.
Algorithm 6 Algorithm for 2-way point matching using ICP.
Input: P = p 0 , p 1 , , p N ;
 Q = q 0 , q 1 , , q M ;
Output: Matched pairs
 Initialize transform M to be the identity;
 until converged;
 R = Find 2-way closest points between P and MQ ( M Q = M q 0 , M q 1 , M q M ) ;
 update M based on matches in R;
return Pairs ( P M Q ) and ( M Q P );

3.6. Evaluation Metrics

Many evaluation metrics are available to predict the estimation and ground truths. They are universally agreed and popularly adopted measures for crowd counting model evaluation. They are classified as image-level for evaluating the counting performance, pixel-level for measuring the density map quality and point-level for assessing the precision of localisation.
The most commonly used metrics include Mean Absolute Error (MAE) and Mean Squared Error (RMSE), which are defined as follows:
M A E = 1 N i = 1 N | C I i p r e d C I i g t |
where N is the number of the test images, C I i p r e d and C I i g t represent the prediction results and ground truth, respectively.
R M S E = 1 N i = 1 N | C I i p r e d C I i g t | 2
Roughly speaking, M A E determines the accuracy of the estimates whereas R M S E indicates the robustness of the estimates.
Precision is a good measure to determine when the costs of False Positive are high. For instance, in the current crowd estimation approach, a false positive means that a point hit by the model is not the right point (actual negative) and has been identified as a person (predicted crowd). The crowd estimation system might lose the actual individual out of the crowd, if the precision is not high for the crowd estimation model.
P r e c i s i o n = T r u e P o s i t i v e s T r u e P o s i t i v e s + F a l s e P o s i t i v e s
Recall calculates how many of the Actual Positives our model has captured by labeling it as Positive (True Positive). For instance, in the current system, if an individual (Actual Positive) is not predicted and counted null (Predicted Negative), then the cost associated with False Negative will be extremely high, and it might collapse the whole estimation model.
R e c a l l = T r u e P o s i t i v e s T r u e P o s i t i v e s + F a l s e N e g a t i v e
F1 Score may be a better measure to use, if we need to strike a balance between Precision and Recall and see if there is an uneven class distribution (a large number of Actual Negatives).
F 1 = 2 × P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l

4. Implementation Details

In our experiments, we used Pytorch for training and testing synthetic data. For the hardware equipment, the training was done on a 64-bit computer with 32 cores Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz processors, 48 GB RAM and two Tesla P100-PCIE-16GB GPU devices. To improve the training set for training using MCNN, we cropped 9 patches from each image at different locations; each patch was 1 4 size of the original image. We trained 133 images that contained 1197 patches using the MCNN model. The 2D detector model was trained on a shared network of 2 Convolutional layers with a Parametric Rectified Linear Unit (PReLU) activation function after every layer to enhance the accuracy of the traced blob points. For the CMTL training, we cropped 16 patches from each image at different locations; each patch was compressed to 1 4 size of the original image. We trained 133 images containing 2128 patches using the CMTL model.
The implementation of 3D crowd estimation was performed using a ray caster on the reconstructed 3D model. The model was reconstructed using the 127 images that were captured from our 3D simulator. The model was rebuilt using an Intel Core i7-8750H processor with a 6 Cores/12 Threads @ 4.1 GHz CPU, Windows 10 on 16 GB RAM, and an NVIDIA Geforce GTX 1060 Max-Q graphics card (6GB of dedicated memory). We used a simple radial camera to capture the data while flying the UAV above the crowd. Initially, the data was captured by following a circular path. The capturing angel varied from 45 to 90 while keeping the height and speed constant. The crowd was randomly placed considering the fact that there is no ground truth in real-time.

5. Experimental Results

Blob detection aimed to detect regions, either in a digital image or synthetic image. They were tested on the pre-existing state-of-the-art methods known as: From Open Set to Closed Set: Supervised Spatial Divide-and-Conquer for Object Counting (S-DCNet) [42], Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detection (LSC-CNN) [92], CNN-based Cascaded Multitask Learning of High-level Prior and Density Estimation for Crowd Counting (CMTL) [93], and Single-Image Crowd Counting via Multicolumn Convolutional Neural Network (MCNN) [10]. The estimated count of our data set against the ground truth was promising and presented in the form of MAE and RMSE. Moreover, we demonstrated that the simulator data is compatible and worked appropriately with real-world crowd data.
The simulated images we used demonstrated a high degree of realism and quality that worked with crowd estimation algorithms trained on real images. As demonstrated in Table 5, S-DCNet, MCNN and CMTL showed promising results on our data set against SHA. CMTL performed better and provided the best MAE of 27.6 and RMSE of 34.6.
Table 5. Testing of Our Data against Shanghai Tech Part_A (SHA) using state-of-the-art methods where the highlighted text demonstrates the methods which performed better on our data set.
Comparing the publicly available aerial crowd data sets using individual state-of-the-art methods (Figure 13), our synthetic data set performed comparatively better than the other two data sets (Table 6). A similar number of images were used for testing and chosen randomly. The VisDone2020_CC data performed better than our data set on S-DCNet with a MAE of 71.39 and RMSE of 123.5. However, our data set performed better than the other two data sets in the remaining methods as shown in Table 6 with a lowest MAE of 27.6 and RMSE of 34.6. For an accurate estimation, the original model was trained on a source domain and can be easily transferred to a target domain by fine-tuning only the last two layers of the trained model, which demonstrates good generalisability. To augment the training sample set for training the MCNN, we cropped 16 patches from each image at various locations and each patch was compressed to 1 4 size of the original image. The pre-training crowd density was very high where it used geometry-adaptive kernels to generate the density maps and calculate the overlapping region density by calculating the average of the generated maps to assist in more accurate estimation.
Figure 13. This graph shows the estimated and ground truth count of the CMTL method tested using ShanghaiTech data set.
Table 6. Comparison of aerial crowd data set against state-of-the-art methods.
For any method, data augmentation is important. The S-DCNet results suggest that S-DCNet method is able to adapt to the crowded scenes. The method cropped the original image into 9 sub-images of 1 4 resolution. Mirroring performance and random scaling doesn’t work well on our data. Due to random crowd distribution in our data, the first 4 cropped 224 × 224 sub-images which refers the four corners of the image, didn’t fit well and failed to identify the crowded regions in some images which downgraded the performance of our data set. On the other hand, the randomly cropped images improved the downgraded performance and identified the crowded regions which eventually delivered a better performance. However, the VisDrone2020-CC data contains a higher density crowd than ours where the sub-images or cropped patches located the crowd easily. It performed comparatively better on high-density images that justifies that S-DCNet effectively generalises to large crowd data and makes accurate predictions.
After analysing the methods and their best results, we chose CMTL and MCNN for training the model on synthetic data. We selected the CMTL’s and MCNN’s best model using error on the validation set during training, and set 10% of the training data for validation. Then, we obtained the ground truth density maps using simple Gaussian maps and compared them against network output (Figure 14). The method performed better when the system was trained using synthetic data and tested against the ShanghaiTech data set.
Figure 14. The graph shows the comparison between the ground truth (GT) and the estimated count (ET) that were tested against the CNN-based Cascaded Multitask Learning of High-level Prior and Density Estimation for Crowd Counting (CMTL) [93] method using the aerial synthetic images. We randomly selected 140 images from the synthetic data for testing and compared them against the ground truth.
Table 7 shows the output comparison between CMTL model, MCNN model and model trained on our synthetic data set. Our model demonstrates a better performance against the original CMTL model using the same data set which is evident by a low MAE of 98.08 and RMSE of 131.22 which is comparatively better than the original CMTL model. To show the advantage of using our simulator in training with various scenarios, we have additionally trained a multicolumn convolutional neural network (MCNN) on synthetic data and tested against SHA.
Table 7. The table presents the results of CMTL model, MCNN model and our synthetic data trained model that were tested against the ShanghaiTech data set.
Finally, we tested our own data set as shown in Table 8 using the model trained on synthetic images. The CMTL performed better with the results depicting a lower MAE of 8.58 and RMSE of 10.39. This model offered an accurate estimation of the synthetic data and significantly improved the accuracy of 3D crowd estimation method.
Table 8. The table shows the output of synthetic data model tested against our synthetic data set.
To the best of our knowledge, this is the first UAV-based system for crowd estimation. The developed system efficiently captures and calculates large crowds spread over a large geographical area. To determine the system’s robustness, the results have been compared to standard metrics such as accuracy, recall, RMSE, and MAE. Our proposed method outperforms with a randomly distributed static crowd from a moving camera in 3D and shows a throughput with an accuracy of 89.23%. The output shows the accurate estimation of 116 people out of 130 which highlights the robustness of the proposed method with a possibility to improve the detection rate in further testing. With a precision of 94.30% and recall of 95.86% shown in Table 9, the RMSE of 0.0002748 justifies that the proposed method is efficient to capture and estimate a large geographical area as well as produce an accurate count in minimal time. The method also validated using two-way mapping methods where the output was matched with the ground truth points to cross-check the initial performance.
Table 9. The table shows the results for 3D crowd estimation using UAV method.
Figure 15 illustrates the final output from the ICP [94] where the ground truth points (P) were plotted against the 3D estimated points (MQ). In the ICP, we provided the input point set as P and Q and initialized transform M to be the identity until it was converged. The converged ICP in 1 iteration highlighted the accuracy of the estimated and ground-truth locations. FLANN [91] was used for the nearest neighbour search. Two-way matching of points was carried out and cross-checked between P and MQ. The final result outputs with a list of closest points and 116 pairs matched between P and MQ out of 128 pairs. A wider comparison of our results with the state-of-the-art methods, however, is not possible as no similar method that can justify and motivate us to compare the results with the ground truth exists.
Figure 15. The figure shows the output from the ICP. The plot shows the GT points as P and possible estimated points as MQ.

6. Discussion

The simulation system generated virtual crowd data set was initially tested in conjunction with four well-known state-of-the-art approaches. While performing experiments with virtual crowd data set, we encountered less errors, which is evident by the low MAE and RMSE. The CMTL method outperforms with a MAE of 27.6 and RMSE of 34.6. During the testing, we noted that annotating the accurate position is the most important aspect in accurate computation and generation of density maps. Crowds with distinct features and geometries are important to obtain better results from the virtual data. It not only reduces the chances of overlapping but also helps to create a robust reconstruction model.
In the aerial data set comparison, our simulation system generated data outperforms against DLR_ACD and VisDron2020-CC data sets when tested against S-DCNet. Due to the sparse crowd distribution in our data, the methods which evaluates the entire image as an input such as MCNN and CMTL preforms better than the approaches like S-DCNet that divide the whole image into patches where the accuracy depends on the image density. It should also be noted that the number of patches that lie in the empty region surpasses the crowded region and could not help much in estimation and the error rate will be high.
For a better evaluation of the crowd counting method performance under practical conditions, we have simulated and labeled our new data set. Furthermore, our model has been trained on a source domain that can be easily transferred to a target domain by fine-tuning only the last few layers of the trained model. To enhance the training sample set for training the MCNN, we have cropped 16 patches from each image at various locations and each patch is compressed to 1 4 size of the original image. Our data set outperforms against the state-of-art CMTL model with a higher throughput and lowest MAE of 98.08 and RMSE of 131.22. We have also tested our data against the model trained on the same set of data which shows a MAE of 8.58 and RMSE of 10.39. This trained model is helpful especially with the same synthetic data and provides a higher accuracy than any other methods but is limited to the same set of data. Further testing needs to be done on the existing publicly available data sets where we want to see how these synthetic data trained model behaves with a new set of data.
The proposed method of 3D crowd estimation system has been tested on various scenes using random crowd distribution. Further testing needs to be done to improve the consistency of the method. Initial test on a moving camera and static crowd provided the accuracy of 89.23% which need to be improve and tested on a large scale. That said, the problem of moving crowd and moving UAV is still being worked on. Here, the reconstruction of a 3D model needs to be considered carefully because the points not aligned properly leads to a false estimation or an output with a lower accuracy. The overlapping of the data and stability of the moving camera is very crucial and needs to be considered while capturing the crowd.
At any given time, the most important issue is optimising the flight path over a wide area to get the most accurate estimate of the available crowd. For example, crowd density may be higher along roads or maybe spilling out radially from the town centre which needs to be dealt in the near future for more accurate estimation. We have captured the data and gathered information for future analysis of different crowd distribution. This data needs to be studied in terms of how synthetic data differs from real data considering domain randomisation, transfer learning and adaptation.

7. Conclusions

Crowd estimation in the 3D domain has grabbed the attention of the computer vision industry, as it provides a more reliable and comprehensive information of the crowd. In this article, we have presented an up-to-date review of open-source simulators and relevant crowd data sets with their shortcomings. It primarily justifies the need of a 3D simulator and explains the type of data the simulator should generate. The paper describes the initial issues of crowd estimation from a moving camera and proposes a solution by developing a 3D crowd simulator for training and testing. It also covers the testing of 3D simulator data by implementing the pre-existing techniques such as LSC-CNN, S-DCNet, CMTL and MCNN. Moreover, it highlights a pre-developed approach to train the synthetic data precisely and validate it using state-of-the-art methods, which justifies that virtual data is as effective as the existing data captured in reality. This will contribute in future development by generating more virtual data sets which could be useful for training deep learning models. In addition, it identifies three big and precise crowd estimation issues, along with introducing a method for 3D crowd estimation using UAV. The presented method can estimate large crowd spread over a large geographical area. Lastly, it explores the limitations that the current model do not address, as well as what needs to be addressed in the future and how the current state will assist in addressing future problems.
In the future, our presented approach could be extended for various potential 3D applications which include tourist attraction [95] using video information to attract and maintain tourist flow, suspicious action detection [96] by monitoring crowded areas and alerting authorities of any suspicious activities and safety monitoring [97] in various facilities, such as religious gatherings, airports, and public areas to monitor crowds, among others.

Author Contributions

Conceptualization, S.S., B.T. and H.C.M.; methodology, S.S. and B.T.; validation, S.S., B.T. and H.C.M.; formal analysis, B.T. and H.C.M.; resources, S.S. and B.T.; data curation, S.S. and B.T.; writing—original draft preparation, S.S.; writing—review and editing, B.T. and H.C.M.; visualization, S.S.; supervision, B.T. and H.C.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data and the details regarding where data supporting reported results in this paper are available from the corresponding author.

Acknowledgments

In this section you can acknowledge any support given which is not covered by the author contribution or funding sections. This may include administrative and technical support, or donations in kind (e.g., materials used for experiments).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
2D2-Dimensional
3D3-Dimensional
CMTLCascaded Multitask Learning
CPUCentral Processing Unit
FLANNFast Library for Approximate Nearest Neighbors
GPUGraphics Processing Unit
GTGround Truth
ICPIterative Closest Point
LSC-CNNLocate, Size and Count
MAEMean Absolute Error
MCNNMulticolumn Convolutional Neural Network
MVSMultiview Stereo
PReLUParametric Rectified Linear Unit
RMSERoot Mean Square Error
S-DCNetSpatial Divide-and-Conquer
SfMStructure-from-Motion
UE4Unreal Engine 4
UAVUnmanned Aerial Vehicle

References

  1. Jacobs, H. To count a crowd. Columbia J. Rev. 1967, 6, 37. [Google Scholar]
  2. Marsden, M.; McGuinness, K.; Little, S.; O’Connor, N.E. ResnetCrowd: A residual deep learning architecture for crowd counting, violent behaviour detection and crowd density level classification. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–7. [Google Scholar]
  3. Loy, C.C.; Chen, K.; Gong, S.; Xiang, T. Crowd counting and profiling: Methodology and evaluation. In Modeling, Simulation and Visual Analysis of Crowds; Springer: Berlin/Heidelberg, Germany, 2013; pp. 347–382. [Google Scholar]
  4. Dollar, P.; Wojek, C.; Schiele, B.; Perona, P. Pedestrian detection: An evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 743–761. [Google Scholar] [CrossRef]
  5. Li, M.; Zhang, Z.; Huang, K.; Tan, T. Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. In Proceedings of the IEEE 2008 19th International Conference on Pattern Recognition (ICPR), Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
  6. Arteta, C.; Lempitsky, V.; Zisserman, A. Counting in the wild. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 483–498. [Google Scholar]
  7. Ryan, D.; Denman, S.; Fookes, C.; Sridharan, S. Crowd counting using multiple local features. In Proceedings of the IEEE Digital Image Computing: Techniques and Applications (DICTA’09), Melbourne, Australia, 1–3 December 2009; pp. 81–88. [Google Scholar]
  8. Ma, R.; Li, L.; Huang, W.; Tian, Q. On pixel count based crowd density estimation for visual surveillance. In Proceedings of the 2004 IEEE Conference on Cybernetics and Intelligent Systems, Singapore, 1–3 December 2004; Volume 1, pp. 170–173. [Google Scholar]
  9. Idrees, H.; Soomro, K.; Shah, M. Detecting humans in dense crowds using locally-consistent scale prior and global occlusion reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1986–1998. [Google Scholar] [CrossRef] [PubMed]
  10. Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
  11. Zhang, Q.; Chan, A.B. 3d crowd counting via multi-view fusion with 3d gaussian kernels. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12837–12844. [Google Scholar] [CrossRef]
  12. Zhao, Z.; Shi, M.; Zhao, X.; Li, L. Active Crowd Counting with Limited Supervision. In Proceedings of the Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  13. Wang, M.; Cai, H.; Han, X.; Zhou, J.; Gong, M. STNet: Scale Tree Network with Multi-level Auxiliator for Crowd Counting. arXiv 2020, arXiv:2012.10189. [Google Scholar]
  14. Ranjan, V.; Wang, B.; Shah, M.; Hoai, M. Uncertainty estimation and sample selection for crowd counting. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
  15. Mustapha, S.; Kassir, A.; Hassoun, K.; Dawy, Z.; Abi-Rached, H. Estimation of crowd flow and load on pedestrian bridges using machine learning with sensor fusion. Autom. Constr. 2020, 112, 103092. [Google Scholar] [CrossRef]
  16. Almeida, I.; Jung, C. Crowd flow estimation from calibrated cameras. Mach. Vis. Appl. 2021, 32, 1–12. [Google Scholar] [CrossRef]
  17. Choi, H.; Moon, G.; Park, J.; Lee, K.M. 3DCrowdNet: 2D Human Pose-Guided3D Crowd Human Pose and Shape Estimation in the Wild. arXiv 2021, arXiv:2104.07300. [Google Scholar]
  18. Fahad, M.S.; Deepak, A. Crowd Estimation of Real-Life Images with Different View-Points. In Proceedings of the International Conference on Innovative Computing and Communications, Delhi, India, 21–23 February 2021; pp. 1053–1062. [Google Scholar]
  19. Zhan, B.; Monekosso, D.N.; Remagnino, P.; Velastin, S.A.; Xu, L.Q. Crowd analysis: A survey. Mach. Vis. Appl. 2008, 19, 345–357. [Google Scholar] [CrossRef]
  20. Junior, J.C.S.J.; Musse, S.R.; Jung, C.R. Crowd analysis using computer vision techniques. IEEE Signal Process. Mag. 2010, 27, 66–77. [Google Scholar]
  21. Teixeira, T.; Dublon, G.; Savvides, A. A survey of human-sensing: Methods for detecting presence, count, location, track, and identity. ACM Comput. Surv. 2010, 5, 59–69. [Google Scholar]
  22. Ferryman, J.; Ellis, A.L. Performance evaluation of crowd image analysis using the PETS2009 dataset. Pattern Recognit. Lett. 2014, 44, 3–15. [Google Scholar] [CrossRef]
  23. Li, T.; Chang, H.; Wang, M.; Ni, B.; Hong, R.; Yan, S. Crowded scene analysis: A survey. IEEE Trans. Circuits Syst. Video Technol. 2014, 25, 367–386. [Google Scholar] [CrossRef] [Green Version]
  24. Hu, W.; Tan, T.; Wang, L.; Maybank, S. A survey on visual surveillance of object motion and behaviors. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2004, 34, 334–352. [Google Scholar] [CrossRef]
  25. Ryan, D.; Denman, S.; Sridharan, S.; Fookes, C. An evaluation of crowd counting methods, features and regression models. Comput. Vis. Image Underst. 2015, 130, 1–17. [Google Scholar] [CrossRef] [Green Version]
  26. Chan, A.B.; Liang, Z.S.J.; Vasconcelos, N. Privacy preserving crowd monitoring: Counting people without people models or tracking. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 23–28 June 2008; pp. 1–7. [Google Scholar]
  27. Ferryman, J.; Shahrokni, A. Pets2009: Dataset and challenge. In Proceedings of the 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Snowbird, UT, USA, 7–9 December 2009; pp. 1–6. [Google Scholar]
  28. Tan, B.; Zhang, J.; Wang, L. Semi-supervised elastic net for pedestrian counting. Pattern Recognit. 2011, 44, 2297–2304. [Google Scholar] [CrossRef]
  29. Chen, K.; Loy, C.C.; Gong, S.; Xiang, T. Feature mining for localised crowd counting. BMVC 2012, 1, 3. [Google Scholar]
  30. Zhou, B.; Wang, X.; Tang, X. Understanding collective crowd behaviors: Learning a mixture model of dynamic pedestrian-agents. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2871–2878. [Google Scholar]
  31. Saleh, S.A.M.; Suandi, S.A.; Ibrahim, H. Recent survey on crowd density estimation and counting for visual surveillance. Eng. Appl. Artif. Intell. 2015, 41, 103–114. [Google Scholar] [CrossRef]
  32. Zitouni, M.S.; Bhaskar, H.; Dias, J.; Al-Mualla, M.E. Advances and trends in visual crowd analysis: A systematic survey and evaluation of crowd modelling techniques. Neurocomputing 2016, 186, 139–159. [Google Scholar] [CrossRef]
  33. Grant, J.M.; Flynn, P.J. Crowd scene understanding from video: A survey. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2017, 13, 1–23. [Google Scholar] [CrossRef]
  34. Sindagi, V.A.; Patel, V.M. A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognit. Lett. 2018, 107, 3–16. [Google Scholar] [CrossRef] [Green Version]
  35. Walach, E.; Wolf, L. Learning to count with cnn boosting. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 660–676. [Google Scholar]
  36. Shang, C.; Ai, H.; Bai, B. End-to-end crowd counting via joint learning local and global count. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 1215–1219. [Google Scholar]
  37. Onoro-Rubio, D.; López-Sastre, R.J. Towards perspective-free object counting with deep learning. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 615–629. [Google Scholar]
  38. Kang, D.; Ma, Z.; Chan, A.B. Beyond counting: Comparisons of density maps for crowd analysis tasks—Counting, detection, and tracking. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 1408–1422. [Google Scholar] [CrossRef]
  39. Tripathi, G.; Singh, K.; Vishwakarma, D.K. Convolutional neural networks for crowd behaviour analysis: A survey. Vis. Comput. 2019, 35, 753–776. [Google Scholar] [CrossRef]
  40. Gao, G.; Gao, J.; Liu, Q.; Wang, Q.; Wang, Y. Cnn-based density estimation and crowd counting: A survey. arXiv 2020, arXiv:2003.12783. [Google Scholar]
  41. Yan, Z.; Yuan, Y.; Zuo, W.; Tan, X.; Wang, Y.; Wen, S.; Ding, E. Perspective-guided convolution networks for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 952–961. [Google Scholar]
  42. Xiong, H.; Lu, H.; Liu, C.; Liang, L.; Cao, Z.; Shen, C. From Open Set to Closed Set: Counting Objects by Spatial Divide-and-Conquer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 8362–8371. [Google Scholar]
  43. Tian, Y.; Lei, Y.; Zhang, J.; Wang, J.Z. Padnet: Pan-density crowd counting. IEEE Trans. Image Process. 2019, 29, 2714–2727. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  44. Kleinmeier, B.; Zönnchen, B.; Gödel, M.; Köster, G. Vadere: An open-source simulation framework to promote interdisciplinary understanding. arXiv 2019, arXiv:1907.09520. [Google Scholar] [CrossRef] [Green Version]
  45. Maury, B.; Faure, S. Crowds in Equations: An Introduction to the Microscopic Modeling of Crowds; World Scientific: Singapore, 2018. [Google Scholar]
  46. Curtis, S.; Best, A.; Manocha, D. Menge: A modular framework for simulating crowd movement. Collect. Dyn. 2016, 1, 1–40. [Google Scholar] [CrossRef]
  47. Wagoum, A.K.; Chraibi, M.; Zhang, J.; Lämmel, G. JuPedSim: An open framework for simulating and analyzing the dynamics of pedestrians. In Proceedings of the 3rd Conference of Transportation Research Group of India, Kolkata, India, 17–20 December 2015; Volume 12. [Google Scholar]
  48. Grimm, V.; Revilla, E.; Berger, U.; Jeltsch, F.; Mooij, W.M.; Railsback, S.F.; Thulke, H.H.; Weiner, J.; Wiegand, T.; DeAngelis, D.L. Pattern-oriented modeling of agent-based complex systems: Lessons from ecology. Science 2005, 310, 987–991. [Google Scholar] [CrossRef] [Green Version]
  49. Van Den Berg, J.; Patil, S.; Sewall, J.; Manocha, D.; Lin, M. Interactive navigation of multiple agents in crowded environments. In Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games, Redwood City, CA, USA, 15–17 February 2008; pp. 139–147. [Google Scholar]
  50. McGrattan, K.; Hostikka, S.; McDermott, R.; Floyd, J.; Weinschenk, C.; Overholt, K. Fire dynamics simulator user’s guide. NIST Spec. Publ. 2013, 1019, 1–339. [Google Scholar]
  51. Idrees, H.; Saleemi, I.; Seibert, C.; Shah, M. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2547–2554. [Google Scholar]
  52. Sindagi, V.A.; Patel, V.M. Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1879–1888. [Google Scholar]
  53. Wang, Q.; Gao, J.; Lin, W.; Li, X. NWPU-Crowd: A Large-Scale Benchmark for Crowd Counting and Localization. IEEE Trans. Pattern Anal. Mach. Intell. 2020. [Google Scholar] [CrossRef]
  54. Sindagi, V.A.; Yasarla, R.; Patel, V.M. JHU-CROWD++: Large-Scale Crowd Counting Dataset and A Benchmark Method. IEEE Trans. Pattern Anal. Mach. Intell. 2020. [Google Scholar] [CrossRef]
  55. Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2018; pp. 532–546. [Google Scholar]
  56. Hu, D.; Mou, L.; Wang, Q.; Gao, J.; Hua, Y.; Dou, D.; Zhu, X.X. Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions. arXiv 2020, arXiv:2005.07097. [Google Scholar]
  57. Lian, D.; Li, J.; Zheng, J.; Luo, W.; Gao, S. Density Map Regression Guided Detection Network for RGB-D Crowd Counting and Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  58. Fang, Y.; Zhan, B.; Cai, W.; Gao, S.; Hu, B. Locality-constrained Spatial Transformer Network for Video Crowd Counting. arXiv 2019, arXiv:1907.07911. [Google Scholar]
  59. Wang, Q.; Gao, J.; Lin, W.; Yuan, Y. Learning from Synthetic Data for Crowd Counting in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 8198–8207. [Google Scholar]
  60. Liu, W.; Lis, K.M.; Salzmann, M.; Fua, P. Geometric and Physical Constraints for Drone-Based Head Plane Crowd Density Estimation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019. [Google Scholar]
  61. Zhang, Q.; Chan, A.B. Wide-Area Crowd Counting via Ground-Plane Density Maps and Multi-View Fusion CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8297–8306. [Google Scholar]
  62. Deng, L.; Wang, S.H.; Zhang, Y.D. Fully Optimized Convolutional Neural Network Based on Small-Scale Crowd. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 12–14 October 2020; pp. 1–5. [Google Scholar]
  63. Mallapuram, S.; Ngwum, N.; Yuan, F.; Lu, C.; Yu, W. Smart city: The state of the art, datasets, and evaluation platforms. In Proceedings of the 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), Wuhan, China, 24–26 May 2017; pp. 447–452. [Google Scholar]
  64. Lim, M.K.; Kok, V.J.; Loy, C.C.; Chan, C.S. Crowd saliency detection via global similarity structure. In Proceedings of the IEEE 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 3957–3962. [Google Scholar]
  65. Zhang, C.; Kang, K.; Li, H.; Wang, X.; Xie, R.; Yang, X. Data-driven crowd understanding: A baseline for a large-scale crowd dataset. IEEE Trans. Multimed. 2016, 18, 1048–1061. [Google Scholar] [CrossRef]
  66. Bahmanyar, R.; Vig, E.; Reinartz, P. MRCNet: Crowd counting and density map estimation in aerial and ground imagery. arXiv 2019, arXiv:1909.12743. [Google Scholar]
  67. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Hu, Q.; Ling, H. Vision Meets Drones: Past, Present and Future. arXiv 2020, arXiv:2001.06303. [Google Scholar]
  68. Zhu, P.; Sun, Y.; Wen, L.; Feng, Y.; Hu, Q. Drone Based RGBT Vehicle Detection and Counting: A Challenge. arXiv 2020, arXiv:2003.02437. [Google Scholar]
  69. Chen, L.; Wang, G.; Hou, G. Multi-scale and multi-column convolutional neural network for crowd density estimation. Multimed. Tools Appl. 2021, 80, 6661–6674. [Google Scholar] [CrossRef]
  70. Guo, L.; Zhou, W. Crowd Density Estimation Based on Multi-Column Hybrid Convolutional Network. J. Phys. Conf. Ser. 2021, 1828, 012025. [Google Scholar] [CrossRef]
  71. Jingying, W. A Survey on Crowd Counting Methods and Datasets. In Advances in Computer, Communication and Computational Sciences; Springer: Berlin/Heidelberg, Germany, 2021; pp. 851–863. [Google Scholar]
  72. Ma, Y.J.; Shuai, H.H.; Cheng, W.H. Spatiotemporal Dilated Convolution with Uncertain Matching for Video-based Crowd Estimation. IEEE Trans. Multimed. 2021. [Google Scholar] [CrossRef]
  73. Chen, H.; Guo, P.; Li, P.; Lee, G.H.; Chirikjian, G. Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-view Geometry. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 541–557. [Google Scholar]
  74. Benzine, A.; Chabot, F.; Luvison, B.; Pham, Q.C.; Achard, C. Pandanet: Anchor-based single-shot multi-person 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6856–6865. [Google Scholar]
  75. Song, J.Y.; Chung, J.J.Y.; Fouhey, D.F.; Lasecki, W.S. C-Reference: Improving 2D to 3D Object Pose Estimation Accuracy via Crowdsourced Joint Object Estimation. Proc. ACM Hum.-Comput. Interact. 2020, 4, 1–28. [Google Scholar] [CrossRef]
  76. Chen, H.; Guo11, P.; Li, P.; Lee, G.H.; Chirikjian, G. Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View Geometry—Supplementary Material. Available online: https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123480545.pdf (accessed on 10 January 2021).
  77. Hashmi, M.F.; Ashish, B.K.K.; Keskar, A.G. GAIT analysis: 3D pose estimation and prediction in defence applications using pattern recognition. In Proceedings of the Twelfth International Conference on Machine Vision (ICMV 2019), International Society for Optics and Photonics, Amsterdam, The Netherlands, 16–18 November 2020; Volume 11433, p. 114330S. [Google Scholar]
  78. Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3d human pose estimation with spatial and temporal transformers. arXiv 2021, arXiv:2103.10455. [Google Scholar]
  79. Li, W.; Liu, H.; Ding, R.; Liu, M.; Wang, P. Lifting Transformer for 3D Human Pose Estimation in Video. arXiv 2021, arXiv:2103.14304. [Google Scholar]
  80. Kumarapu, L.; Mukherjee, P. Animepose: Multi-person 3d pose estimation and animation. Pattern Recognit. Lett. 2021, 147, 16–24. [Google Scholar] [CrossRef]
  81. Epic Games; Unreal Engine/The Most Powerful Real-Time 3D Creation Platform. Available online: https://www.unrealengine.com (accessed on 10 January 2021).
  82. Human, M. Make Human. Available online: http://www.makehumancommunity.org/ (accessed on 22 February 2021).
  83. AXYZ; Anima. Available online: https://secure.axyz-design.com/ (accessed on 5 March 2021).
  84. Schönberger, J.L.; Frahm, J.M. Colmap/Structure-from-Motion Revisited. Available online: https://colmap.github.io/ (accessed on 21 April 2021).
  85. Schops, T.; Schonberger, J.L.; Galliani, S.; Sattler, T.; Schindler, K.; Pollefeys, M.; Geiger, A. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3260–3269. [Google Scholar]
  86. Stathopoulou, E.K.; Remondino, F. Open-source image-based 3D reconstruction pipelines: Review, comparison and evaluation. In Proceedings of the 6th International Workshop LowCost 3D—Sensors, Algorithms, Applications, Strasbourg, France, 2–3 December 2019; ISPRS: Strasbourg, France, 2019; pp. 331–338. [Google Scholar]
  87. Leudet, J.; Mikkonen, T.; Christophe, F.; Männistö, T. Virtual Environment for Training Autonomous Vehicles. In Proceedings of the Annual Conference Towards Autonomous Robotic Systems, Bristol, UK, 25–27 July 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 159–169. [Google Scholar]
  88. Kazhdan, M.; Hoppe, H. Screened poisson surface reconstruction. ACM Trans. Graph. (ToG) 2013, 32, 1–13. [Google Scholar] [CrossRef] [Green Version]
  89. Contributor, Wiki. Iterative Closest Point. Available online: https://en.wikipedia.org/wiki/Iterative_Closest_Point (accessed on 24 June 2021).
  90. Marden, S.; Guivant, J. Improving the performance of ICP for real-time applications using an approximate nearest neighbour search. In Proceedings of the Australasian Conference on Robotics and Automation, Wellington, New Zealand, 3–5 December 2012; pp. 1–6. [Google Scholar]
  91. Muja, M.; Lowe, D.G. Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1) 2009, 2, 2. [Google Scholar]
  92. Sam, D.B.; Peri, S.V.; Sundararaman, M.N.; Kamath, A.; Radhakrishnan, V.B. Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020. [Google Scholar] [CrossRef]
  93. Sindagi, V.A.; Patel, V.M. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
  94. Liu, H.; Wang, S.; Zhao, D. Initial alignment for point cloud registration by improved differential evolution algorithm. Optik 2021, 243, 166856. [Google Scholar] [CrossRef]
  95. Li, L. A Crowd Density Detection Algorithm for Tourist Attractions Based on Monitoring Video Dynamic Information Analysis. Complexity 2020, 2020, 6635446. [Google Scholar] [CrossRef]
  96. Penmetsa, S.; Minhuj, F.; Singh, A.; Omkar, S. Autonomous UAV for suspicious action detection using pictorial human pose estimation and classification. ELCVIA Electron. Lett. Comput. Vis. Image Anal. 2014, 13, 0018–0032. [Google Scholar] [CrossRef] [Green Version]
  97. Zhou, B.; Tang, X.; Wang, X. Learning collective crowd behaviors with dynamic pedestrian-agents. Int. J. Comput. Vis. 2015, 111, 50–68. [Google Scholar] [CrossRef] [Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.