Next Article in Journal
Detection of High-Temperature Gas Leaks in Pipelines Using Schlieren Visualization
Previous Article in Journal
Comprehensive Characterization and Metamorphic Control Analysis of Full Apertures in Different Coal Ranks within Deep Coal Seams
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

People Detection Using Artificial Intelligence with Panchromatic Satellite Images

1
Department of Geoinformatics, Faculty of Mining and Geology, VSB-Technical University of Ostrava, 708 00 Ostrava, Czech Republic
2
Department of Telecommunications, Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, 708 00 Ostrava, Czech Republic
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(18), 8555; https://doi.org/10.3390/app14188555
Submission received: 31 July 2024 / Revised: 14 September 2024 / Accepted: 17 September 2024 / Published: 23 September 2024

Abstract

:
The detection of people in urban environments from satellite imagery can be employed in a variety of applications, such as urban planning, business management, crisis management, military operations, and security. A WorldView-3 satellite image of Prague was processed. Several variants of feature-extracting networks, referred to as backbone networks, were tested alongside the Faster R–CNN model. This model combines region proposal networks with object detection, offering a balance between speed and accuracy that is well suited for dense and varied urban environments. Data augmentation was used to increase the robustness of the models, which contributed to the improvement of classification results. Achieving a high level of accuracy is an ongoing challenge due to the low spatial resolution of available imagery. An F1 score of 54% was achieved using data augmentation, a 15 cm buffer, and a maximum distance limit of 60 cm.

1. Introduction

Currently, the detection of people and the quantification of human flows are mainly based on processing street camera imagery. However, due to their limited range of view, street cameras are not sufficient for monitoring large areas. Furthermore, the detection of people from satellite images proves to be more difficult than the detection of other target objects, such as planes or ships. Despite this challenge, satellite images promise many advantages, such as providing an overview of large areas within one time frame and monitoring closed or restricted areas as well as internal parts of residential or industrial complexes. People detection is of importance to security and defense, especially when tracking the movement and gathering of people in potentially dangerous areas. Economic and sociological studies evaluating the appearance of people in public spaces over a longer period of time can also be significant. Other uses include humanitarian monitoring, where the detection and counting of people in refugee camps or affected areas can improve the efficacy of humanitarian aid, border monitoring, and control of illegal immigration [1,2,3,4,5,6,7].
The advent of new satellite missions and enhanced sensing capabilities enables exploration into this new area of applications. The spatial resolution of satellite imagery can reach as much as 30 cm, and new missions, especially satellite swarming and constellation, are becoming clear trends in Earth observation applications [8]. Single satellites simply cannot achieve the necessary frequency of images with the spatial or temporal resolution needed for analysis or provide the necessary data for decision making in near real time [9].
The proliferation of high-resolution images and the development of machine learning (ML) algorithms have greatly enhanced the analysis of these data. Artificial intelligence has unlocked a new perspective on solving various problems, such as the detection of target objects, like people [10,11]. Our study employs a convolutional neural network (CNN), which is a deep learning model capable of directly extracting features from satellite imagery data. It is effective at exploiting realistic imagery data characteristics and has been shown to be highly successful [7,11]. Existing applications in remote satellite sensing are focused on the detection of large persistent objects, such as buildings or swimming pools [12]; applications for the detection of smaller objects, such as people, are lacking. When detecting people, various limitations exist, such as the resolution of satellite images, the costs of the images themselves, and the orientations of the images (bird’s eye view).
Our objective is to assess the current possibilities for people detection in an urban environment from satellite imagery with the highest available spatial resolution, and then to shed light on the potential future exploitation of satellite imagery for such applications. The biggest challenge to this end is the low spatial resolution of current satellite imagery, in which the overhead view of a person is reflected in only a few pixels (typically four pixels in our case). We propose using a combination of several techniques, such as Faster R-CNN, the extension of data augmentation by buffer use, and a specific validation process, to achieve the most accurate results.
The paper is organized as follows: an introduction, a background that provides an overview of state-of-the-art CNN method technology, a description of the field of study and data sources, an explanation of CNN processing, the results and calculated quality parameters describing the success of the human detection methods, and, finally, a discussion of the results and limitations of these methods.

2. Materials and Methods

2.1. Background

Among the various potential ML methods one could use, rapid progress in selected convolutional neural network (CNN) methods has been attracting the most attention recently. CNN methods are able to efficiently learn distinctive high-level features of object detection in remote sensing [7,13]. CNN algorithms can be divided into two categories: so-called two-stage detectors (RCNN) and single-stage detectors (SSD, YOLO, and RetinaNet) [14,15].
Faster R-CNN is used for generic object detection and has been successfully adapted from its two predecessors, R-CNN and Fast R-CNN [16,17], to solve many recognition problems. Faster R-CNN consists of two modules: a region proposal network (RPN) and a Fast R-CNN detector [14,16,17]. Faster R-CNN improves object detection architecture by replacing the selection search algorithm in Fast R-CNN with a region proposal network (RPN) [14,18]. An RPN is a fully convolutional network for proposal generation [14]. The rest of the model architecture remains the same as in Fast R-CNN. The system overview of Faster R-CNN is given in Figure 1.
Duporge and Isupova [20] applied a CNN model to automatically detect and count African elephants in a woodland savanna ecosystem in South Africa. They used WorldView-3 and -4 satellite data. Dumitrescu, Boiangiu, and Voncilă [21] focused on creating a fast and reliable object detection algorithm that is trained on scenes depicting people in an indoor environment. This method combines YOLOv4 and Faster R-CNN. Additionally, Li and Ma as well as Fu and Xu [22,23] used Faster R-CNN to detect people in a sequence of video images. On the other hand, Wang et al. [4] used Mask R-CNN, a representative extension of Faster R-CNN, which produces segmentation masks of objects. Ren, Zhu, and Xiao [24] used a modified Faster R-CNN to detect ships and planes in optical remote sensing images (NWPU VHR-10).
When compared to other popular object detection methods, like YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector), or RetinaNet, Faster R-CNN offers distinct advantages that make it particularly effective for this specific application; Faster R-CNN’s RPN generates high-quality region proposals that are more accurate and reliable compared to the single-stage approaches used in YOLO, SSD, and RetinaNet. The RPN is designed to propose regions that are likely to contain objects of interest, which is crucial for detecting small target objects like people in urban satellite imagery. This two-stage approach, where the RPN first proposes regions and then a classifier fine-tunes the detection, ensures higher precision and recall, which are critical for applications where accurate human detection is paramount [14,16,17,18,19,25]. Although YOLO and SSD are known for their speed, they often sacrifice accuracy, particularly in complex and cluttered scenes. YOLO’s single-stage detector processes the entire image concurrently, providing a faster detection time, but at the cost of reduced accuracy for small objects and increased false positives. SSD faces similar issues. RetinaNet introduces focal loss to handle the class imbalance problem (i.e., many more background examples than objects) in object detection [15,26]. In contrast, Faster R-CNN’s two-stage process allows for the more precise localization and classification of people, which is essential when dealing with the varied and detailed backgrounds of urban environments [25,27].
Deep ML methods require a high volume of data, namely for their training [28,29]. The number of training samples for object detection using CNN algorithms can vary widely depending on several factors, including the complexity of the task, the diversity of the objects, the sizes of the objects in the images, and the architecture of the CNN being used. CNN algorithms enable the extraction of image features, which can be effectively processed to reduce the dimensionality of the task [30].
Additional data manipulation methods (data augmentations) are applied to achieve improved results. These techniques help models generalize better by exposing them to a wider variety of image conditions. The following data augmentation techniques for object detection from satellite images are frequently employed: rotating, flipping, shifting, translating, and clipping [11,31,32].

2.2. Study Areas and Data

Prague was selected as the area of interest for analysis of the occurrence of people on streets. The study area in Prague consists of three subsets: Prague Castle, Charles Bridge, and the Old Town Square, including its surroundings (Figure 2). Two reasons for these selections apply: the areas contain various surface types (to verify the ability to detect people in various urban conditions) and the frequent occurrence of people in these locations at the given time (Figure 3). One full WorldView-3 scene from the morning of 23 July 2019, covering 25 km2, was obtained. The spatial resolution of the WorldView-3 panchromatic and multispectral images is 0.3 m and 1.6 m, respectively.
In these study areas, ground truth data were obtained by visual vectorization. Ground truth data (Table 1) are needed for training and accuracy assessment. After vectorization, an exploratory data analysis was performed in which it was found that people in the satellite images were shown as 4 pixels that shared a neighborhood (Figure 3). The digital numbers (DNs) of these pixels were significantly higher than the DNs of surface pixels.
The surface types were as follows:
  • Location 1—interlocking pavement.
  • Location 2—small square light cobbles, forming large square areas, bordered by small square dark cobbles.
  • Location 3—small square light cobbles, forming small square areas, bordered by small square dark cobbles.
  • Location 4—small square light cobbles, forming large rectangular areas, bordered by small square dark cobbles.
  • Location 5—interlocking pavement.
  • Location 6—square light cobbles, forming large square areas, bordered by square dark cobbles.
  • Location 7—interlocking pavement.
  • Location 8—marble pavement.
Vectored digital data for buildings and greenery were obtained from digital technical maps of IPR Prague (The Prague Institute of Planning and Development) as appropriate auxiliary data. These data enabled the creation of masks that improve the quality of people detection by hiding or blocking certain parts of the input imagery, allowing the neural network to focus on a specific area of interest. By excluding irrelevant areas using the mask, computation can be reduced, leading to a potential increase in detection speed.

2.3. Methodology

Upon review, Faster R-CNN and YOLO were selected for data processing using implementation in ArcGIS Pro version 3.3.1. The selection of these most promising models was confirmed by the “AutoDL” function in ArcGIS. The Python programming language and ArcGIS Python API library were used to automate various subtasks (available at https://github.com/PavelKVSB/people_detection_AI_Satellite_Images, accessed on 2 September 2024), and the hardware parameters include Intel Core i7-8700 CPU 3.20 GHz; NVIDIA GEFORCE GTX 1070; and 16 GB RAM.
To begin, an image that contained three RGB bands was created from a panchromatic image because neural network algorithms perform better with this image type [33,34]. Next, it was necessary to create training data from the ground truth data for the Faster R-CNN model. To generate the training data, the following parameters were set: tile size (size of the image chips) at 128, stride size (distance to move in X and Y when creating the next image chip) at 64, and the metadata format of training data in PASCAL Visual Object Classes in .xml format. This part of the processing was carried out in ArcGIS Pro. After the training data were prepared, the model was trained using Jupyter Notebooks (version 7.2.2) with the GPU enabled in ArcGIS Pro SW. The training data were prepared, and the sizes of the validation portion (20%) as well as the batch size were set. Typical mini-batch sizes were 16, 32, 64, 128, 256, and 512 [35], depending on CPU/GPU availability. The Faster R-CNN algorithm was imported and the backbone model was configured. In addition, the “learning rate find” function was utilized to select the most suitable learning rate for model training. After these settings were fixed (Table 2), the given model was trained. The 13th and 14th methods achieved the highest values of the average precision score (AP) (almost 42%, Table 2). The AP is the precision average across all recall values between 0 and 1 [36]. The average precision is high when both precision and recall are high, and low when either of them is low across a range of confidence threshold values [37]. It can also be seen from Table 2 that when the batch size is smaller, the average precision score is lower. However, even a large increase in the batch size does not necessarily lead to better results; it is necessary to find the best batch size value, which in our case was 16. After testing various backbones available for Faster R-CNN, Resnet152 was found to provide the best results. Variants 13 and 14 have a higher AP value than all other variants. This is because, in these cases, training data were used with a buffer of 15 cm (13th) and 30 cm (14th). All testing was processed in the ArcGIS SW environment.
By utilizing the buffer parameter, the models received a significant boost in their APs, as indicated in Table 2. Furthermore, this clearly demonstrates that the most optimal outcomes were obtained when the backbone was configured as Resnet152, the batch size was set to 16, and the number of epochs was set to 150. Therefore, these settings were used for further optimization with data augmentation. The following data augmentation functions were used:
  • Horizontal flip—The model learns to recognize objects regardless of whether they are oriented to the left or right. This is especially beneficial in tasks where the orientation of objects is not fixed, such as object detection. By enhancing the diversity of the input data, this technique reduces the risk of overfitting and improves the model’s ability to generalize to new, unseen data.
  • Vertical flip—This transformation can introduce variability that helps the model generalize better to different orientations.
  • Rotation—Random rotation of the image by a maximum of 45° in either direction (clockwise or counterclockwise). This transformation allows the model to become more robust to slight variations in the orientation of objects.
  • Zoom—Images were randomly zoomed up to 50%. Zooming helps create a variety of scales within the dataset.
  • Warp—Random generation of different deformation of images (including skewing, stretching, and shifting), within the specified 0.3 range. Warping simulates natural distortions and irregularities that might be present in real images.
The use of the “get transforms” function is essential for data augmentation. This function applies a sequence of transformation operations to images during both the training and validation processes. These transformations are random and are applied every time the images pass through the model, meaning that each image may appear differently in each batch. This variability is key for increased data variability, which leads to improved model generalization and a reduced overfitting risk. As a result, the “get_transforms” function enhances the effectiveness of the training process, thereby improving its robustness and performance when encountering new, unseen data.
Variants 15 and 16 represent outputs of training with data augmentation. Variant 15 uses the training data without a buffer, and the AP reached 42% (Table 3). When training data are applied with a buffer of 15 cm, the AP value increased to 49%. Based on the results obtained using the Resnet152 backbone, data augmentation was applied to other backbones as well. Subsequently, various additional backbone variants were tested, as shown in Table 3. None reached better results than ResNet152 with a 15 cm buffer. The variant with MobileNet V3 reached almost 43%, VGG16 and VGG19 37%, Inception V3 31.55%, and V4 only 0.29%. The DenseNet backbones (121, 161, 169, and 201) do not provide useful results (mAP about 1%) The best results among the tested backbones were achieved by the MobileNet V3 backbone, with an AP of 42.65%. Slightly lower results were recorded for the VGG16 and VGG19 backbones, where the AP was around 37%. The worst results were obtained using DenseNet, with an AP of approximately 1%.
The training process and validation are depicted by the plotting of validation and training losses after fitting the model (Figure 3). In both cases, the function on the training data is decreasing, i.e., the model is learning, and its predictions are improving. When the loss function on the validation data decreases, it means that the model not only predicts well on the training data, but also generalizes to new data. This is important because it is necessary to have a model that will perform well on various data. The difference between training and validation loss is also important. If the difference between the two losses is large, it may indicate that the model is overtrained (overfitting), which means that it has learned the specifics of the training data too well and cannot generalize to new data. If the curves are close together, it is a good sign that the model generalizes well [38]. The functions for variant 9 show larger differences (Figure 4a) than the functions for variant 15 (Figure 4b), where data augmentation was applied.
The trained model then needed to be tested on a sample of data that was not used during training. ArcGIS Pro with its “Detect Objects Using Deep Learning” function was employed to test the model. Four models were tested: variants 9 and 13 without data augmentation, and variants 15 and 16 with data augmentation. All four models were tested on an image without a mask and one with a mask (Figure 5). The predictions that had a confidence score higher than 50% were included in the result as predicted people.
Simultaneously, the YOLOv3 model was trained with different settings (Table 4). According to the results of the Faster R-CNN testing, this testing was focused on models with data augmentation and a buffer of 15 cm. The batch size was set to 32, 300 epochs were used, and Darknet53 was used as a backbone. The YOLO model reached a maximum AP of 36.32%, much lower than the AP achieved by the Faster R-CNN model with data augmentation and a 15 cm buffer.
In addition, other widely used object detectors, such as RetinaNet and Single Shot Detector (SSD), were evaluated using ArcGIS Pro software (version 3.3.1) implementation. A class arcgis.learn.SingleShotDetector() allows the user to create an SSD object detector with the specified parameters, including grid sizes used for creating anchor boxes, zoom scales and aspect ratios of the anchor boxes, backbone models for feature extraction, and dropout probability. The SSD implementation supports the backbones from the ResNet, DenseNet, and VGG families. A class arcgis.learn.RetinaNet() creates a RetinaNet object detector with the specified zoom scales and aspect ratios of anchor boxes, as well as backbone models. Only backbones from the ResNet family are supported by RetinaNet implementation. The aspect ratio parameters were set to 1:1 to define the square shape of the labels. Multiple zoom parameters were tested to define the appropriate scale of the anchor boxes to cover the sizes of the detected people in the image. Table 5 summarizes the parameters setting used for the SSD model and the RetinaNet model.
The above testing in ArcGIS Pro implementation yielded very poor results, with a maximal average precision value of only 0.1% for the SSD model employing the VGG19 backbone, a batch size of 16, a zoom of 0.5, and a grid of 8.
Based on the poor results of the YOLO model and ArcGIS Pro implementations, only the Faster R-CNN model was further tested.

2.4. Validation Process

After applying the methods, it is necessary to determine how accurate each of the methods are. The schema of validation process is shown in the Figure 6. First, it is necessary to determine the center of gravity for each predicted polygon. This is an approximation of the most probable location of a predicted person (PP). Next, the shortest distances between the PP and the ground truth data (GT) are compared. This is calculated as follows: A Euclidean distance for each pair of GT and PP locations is determined and the nearest distance for each GT is found. Both points are then excluded from the list to allow for a subsequent search, where the next shortest distance between points is found until the shortest distances for all points have been determined. A new script to provide such a function was prepared in the Python language since the available software lacked a suitable function. The results then undergo statistical evaluation to determine the quality of the methods. The precision, recall, and F1 scores are calculated, for which it is necessary to evaluate TPs (true positives), FPs (false positives), TNs (true negatives), and FNs (false negatives). When evaluating the precision, recall, and F1 scores, the distance limits between the ground truth points and the detected points must be established. The following distance limits were set: 15 cm (corresponding to a 1/2 pixel), 30 cm, 45 cm, and 60 cm.

3. Results

Table 6 shows the results of people detection without using data augmentation or a 15 cm buffer. The F1 score at 60 cm corresponds approximately to the AP when training the given model (31%). When using the mask, the number of TPs decreased and the number of FNs increased. There was also a lower number of FPs, caused by a smaller number of predicted people; therefore, the value of the F1 score reached only 28%. Table 7 shows the results obtained using a 15 cm buffer without data augmentation. When the buffer is used, an increase in TPs can be seen in both cases, which corresponds to a decrease in FNs. The F1 score for the model without a mask achieved an approximately 3% improvement compared to the model without a buffer. When using the mask, the model utilizing the buffer achieved an F1 score approximately 12% higher.
Table 8 shows the results when using data augmentation. Regardless of whether a mask is used, the F1 score is higher when data augmentation is employed. In the case of using a mask, it is possible to see an increase of more than 20% in the F1 score (F1 score reaches more than 51%) compared to when not using data augmentation. When using data augmentation, it was possible to detect more than 50% of GT within 60 cm. Table 9 shows the results obtained using data augmentation and a buffer of 15 cm. In this case, more than 60% of GT were detected. For the model not utilizing a mask, the value of the F1 score is smaller than in Table 8, which is due to a larger number of FPs. Meanwhile, in the case of employing a mask, the model with data augmentation reached more than 54%, which is approximately 3% higher than when not using a buffer.
Figure 7 shows the output of people detection using the Faster R-CNN method. Red squares indicate real people (ground truth), while blue squares indicate people detected by the Faster R-CNN algorithm. The Faster R-CNN algorithm was able to detect people in this image with relatively high accuracy. It can be seen that in some cases the squares overlap completely and in others they overlap partially. Nevertheless, there is still room for improvement, especially in terms of the size of the area that the algorithm identifies as human.
Figure 8 documents the results of Table 7, where no mask was used. False positives occurred in areas where the surface appeared lighter due to specular reflection. Additionally, there were various transitions between surfaces where pixels could be misinterpreted as representing people. This issue was particularly evident on rooftops and sidewalks (Figure 8), where the transitions between different surface textures and lighting conditions led to pixel patterns resembling those of human figures, resulting in incorrect detections on the satellite images. This can also be seen in Table 9, where precisely such surface types have a higher density of predicted data than ground truth data (Locations 6 and 8).
Additionally, areas where individuals were in close proximity (Location 7 in Table 10) exhibited an increased incidence of false negatives. In these situations, the model struggled to accurately classify individuals, resulting in missed detections.
Figure 9 clearly demonstrates that the combination of data augmentation and a buffer with no mask resulted in the highest number of true positive people, with almost 64% detection. Data augmentation and buffers significantly increase the number of TPs, while masking has a less significant effect.

4. Discussion

The results in Table 6, Table 7, Table 8 and Table 9 show the success of detecting people from satellite images. The F1 score results are approximately 30% when the model is trained without the use of data augmentation. The utilization of a 15 cm buffer leads to better results in terms of the F1 score, and models trained on the dataset that employed data augmentation achieve an F1 score exceeding 50%.
Figure 10 shows people detection without using data augmentation compared with using data augmentation. It can also be seen from Figure 9 that data augmentation helped to detect more people. However, without the use of a mask, Faster R-CNN was able to detect more people than with the use of a mask.
Figure 11b shows Faster R-CNN detection using a 15 cm buffer and data augmentation. More people were detected with the buffer than without (Figure 9). With the mask, two additional people were falsely detected, but without the mask two people were missing. It can be seen from Figure 10b that, again, data augmentation improves the results of people detection, and that when using a buffer and data augmentation all the people in Figure 11b were detected with no FPs.
The maximum achieved F1 score was 54.5%, which is not satisfactory. Duporge and Isupova [20] also applied Faster R-CNN for processing WV3 and WV4 images, and they obtained an F2 score of 77.8% in heterogeneous areas and 73% in homogeneous areas. Their higher accuracy may be due to the larger pixel representation of an elephant than a human. Gaszczak, Breckon, and Han [39] achieved up to a 70% detection rate for people detection in thermal imagery based on a cascaded classification technique combining additional multivariate Gaussian shape matching. Sirmacek and Reinartz [1] used GeoEye-1 satellite images to detect people using an adaptive kernel density estimation method and estimated the corresponding probability density function; their results reached an approximately 75% detection success rate. Wang et al. [4] used Mask R-CNN to detect people from cameras in offices and shops and achieved an AP of 56–76%. Using a modified Faster R-CNN method, Y. Ren, Zhu, and Xiao [24] detected small objects, such as planes and ships; the AP for planes was approximately 83% and for ships it was 71%. Yan et al. [40] proposed a method for extracting tailing pond information from high-resolution images (Google Earth images) using Faster R-CNN, and their AP reached 80.1%. Testing models with other architectures (YOLOv3, SSD, and RetinaNet) using ArcGIS Pro software implementation did not achieve higher accuracy than Faster R-CNN.

5. Conclusions

The detection of people from satellite images remains a challenge. Faster R-CNN, YOLOv3, SSD, and RetinaNet in ArcGIS Pro implementation were tested for people detection. A validation method based on the calculation of distances between point-based locations of people (ground truth data) and gravity centers of polygons for predicted people was employed. Variants with buffers, masks, and data augmentation were analyzed. The most promising results were obtained with Faster R-CNN using a 15 cm buffer, no mask, the application of data augmentation, and a 60 cm validation distance, resulting in an F1 score of almost 55%. In the future, results may be improved with the implementation of selected textural features and other CNN models, using image super-resolution methods such as fast Fourier convolutions [41], or using deblurring techniques to increase image quality [42].

Author Contributions

Software, P.G.; Resources, L.O. and P.P.; Writing—original draft, P.G.; Writing—review & editing, P.K.; Supervision, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by grant SP2024/076 from VSB—Technical University of Ostrava.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because data has to be purchased from WorldView3. Requests to access the datasets should be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sirmacek, B.; Reinartz, P. Feature Analysis for Detecting People from Remotely Sensed Images. J. Appl. Remote Sens. 2013, 7, 073594. [Google Scholar] [CrossRef]
  2. Kaiser, M.S.; Lwin, K.T.; Mahmud, M.; Hajializadeh, D.; Chaipimonplin, T.; Sarhan, A.; Hossain, M.A. Advances in Crowd Analysis for Urban Applications through Urban Event Detection. IEEE Trans. Intell. Transp. Syst. 2018, 19, 3092–3112. [Google Scholar] [CrossRef]
  3. Hinz, S. Density and motion estimation of people in crowded environments based on aerial image sequences. In Proceedings of the ISPRS Hannover Workshop on High-Resolution Earth Imaging for Geospatial Information, Hannover, Germany, 2–5 June 2009. [Google Scholar]
  4. Wang, T.; Hsieh, Y.-Y.; Wong, F.-W.; Chen, Y.-F. Mask-RCNN Based People Detection Using a Top-View Fisheye Camera. In Proceedings of the 2019 International Conference on Technologies and Applications of Artificial Intelligence (Taai), Kaohsiung, Taiwan, 21–23 November 2019; IEEE: New York, NY, USA, 2019. [Google Scholar]
  5. Garcia, J.; Gardel, A.; Bravo, I.; Luis Lazaro, J.; Martinez, M.; Rodriguez, D. People Detection and Tracking Based on Stereovision and Kalman Filter. Rev. Iberoam. Autom. Inform. Ind. 2012, 9, 453–461. [Google Scholar] [CrossRef]
  6. Coniglio, C.; Meurie, C.; Lezoray, O.; Berbineau, M. People Silhouette Extraction from People Detection Bounding Boxes in Images. Pattern Recognit. Lett. 2017, 93, 182–191. [Google Scholar] [CrossRef]
  7. Pazhani, A.A.J.; Vasanthanayaki, C. Object Detection in Satellite Images by Faster R-CNN Incorporated with Enhanced ROI Pooling (FrRNet-ERoI) Framework. Earth Sci. Inf. 2022, 15, 553–561. [Google Scholar] [CrossRef]
  8. Wu, X.; Yang, Y.; Sun, Y.; Xie, Y.; Song, X.; Huang, B. Dynamic Regional Splitting Planning of Remote Sensing Satellite Swarm Using Parallel Genetic PSO Algorithm. Acta Astronaut. 2023, 204, 531–551. [Google Scholar] [CrossRef]
  9. Farrag, A.; Othman, S.; Mahmoud, T.; ELRaffiei, A.Y. Satellite Swarm Survey and New Conceptual Design for Earth Observation Applications. Egypt. J. Remote Sens. Space Sci. 2021, 24, 47–54. [Google Scholar] [CrossRef]
  10. Dakir, A.; Barramou, F.; Alami, O.B. Opportunities for Artificial Intelligence in Precision Agriculture Using Satellite Remote Sensing. In Geospatial Intelligence: Applications and Future Trends; Barramou, F., El Brirchi, E.H., Mansouri, K., Dehbi, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 107–117. ISBN 978-3-030-80458-9. [Google Scholar]
  11. Lalitha, V.; Latha, B. A Review on Remote Sensing Imagery Augmentation Using Deep Learning. Mater. Today Proc. 2022, 62, 4772–4778. [Google Scholar] [CrossRef]
  12. Adegun, A.A.; Fonou Dombeu, J.V.; Viriri, S.; Odindi, J. State-of-the-Art Deep Learning Methods for Objects Detection in Remote Sensing Satellite Images. Sensors 2023, 23, 5849. [Google Scholar] [CrossRef]
  13. Song, G.; Wang, Z.; Bai, L.; Zhang, J.; Chen, L. Detection of Oil Wells Based on Faster R-CNN in Optical Satellite Remote Sensing Images. In Proceedings of the Image and Signal Processing for Remote Sensing XXVI, Online, 21–25 September 2020; Volume 11533, pp. 114–121. [Google Scholar]
  14. Li, M.; Zhang, Z.; Lei, L.; Wang, X.; Guo, X. Agricultural Greenhouses Detection in High-Resolution Satellite Images Based on Convolutional Neural Networks: Comparison of Faster R-CNN, YOLO v3 and SSD. Sensors 2020, 20, 4938. [Google Scholar] [CrossRef]
  15. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
  16. Jiang, H.; Learned-Miller, E. Face Detection with the Faster R-CNN. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 650–657. [Google Scholar]
  17. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  18. Benjdira, B.; Khursheed, T.; Koubaa, A.; Ammar, A.; Ouni, K. Car Detection Using Unmanned Aerial Vehicles: Comparison between Faster R-CNN and YOLOv3. In Proceedings of the 2019 1st International Conference on Unmanned Vehicle Systems-Oman (UVS), Muscat, Oman, 5–7 February 2019; pp. 1–6. [Google Scholar]
  19. Maity, M.; Banerjee, S.; Sinha Chaudhuri, S. Faster R-CNN and YOLO Based Vehicle Detection: A Survey. In Proceedings of the 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 8–10 April 2021; pp. 1442–1447. [Google Scholar]
  20. Duporge, I.; Isupova, O. Using Very-High-Resolution Satellite Imagery and Deep Learning to Detect and Count African Elephants in Heterogeneous Landscapes—Duporge—2021—Remote Sensing in Ecology and Conservation—Wiley Online Library. Available online: https://zslpublications.onlinelibrary.wiley.com/doi/full/10.1002/rse2.195 (accessed on 8 December 2021).
  21. Dumitrescu, F.; Boiangiu, C.-A.; Voncilă, M.-L. Fast and Robust People Detection in RGB Images. Appl. Sci. 2022, 12, 1225. [Google Scholar] [CrossRef]
  22. Li, L.; Ma, J. Zenithal People Detection Based on Improved Faster R-CNN. In Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, China, 7–10 December 2018; pp. 1503–1508. [Google Scholar]
  23. Fu, Z.; Xu, D. Exploiting Context for People Detection in Crowded Scenes. J. Electron. Imaging 2018, 27, 043028. [Google Scholar] [CrossRef]
  24. Ren, Y.; Zhu, C.; Xiao, S. Small Object Detection in Optical Remote Sensing Images via Modified Faster R-CNN. Appl. Sci. 2018, 8, 813. [Google Scholar] [CrossRef]
  25. Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
  26. Ahmad, M.; Abdullah, M.; Han, D. Small Object Detection in Aerial Imagery Using RetinaNet with Anchor Optimization. In Proceedings of the 2020 International Conference on Electronics, Information, and Communication (ICEIC), Barcelona, Spain, 19–22 January 2020; pp. 1–3. [Google Scholar]
  27. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  28. Ding, J.; Li, X.; Gudivada, V.N. Augmentation and Evaluation of Training Data for Deep Learning. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 2603–2611. [Google Scholar]
  29. Luca, A.R.; Ursuleanu, T.F.; Gheorghe, L.; Grigorovici, R.; Iancu, S.; Hlusneac, M.; Grigorovici, A. Impact of Quality, Type and Volume of Data Used by Deep Learning Models in the Analysis of Medical Images. Inform. Med. Unlocked 2022, 29, 100911. [Google Scholar] [CrossRef]
  30. Krupa, K.S.; Kiran, Y.C.; Kavana, S.R.; Gaganakumari, M.; Meghana, R.; Varshana, R. Deep Learning-Based Image Extraction. Artif. Intell. Appl. 2022. [Google Scholar] [CrossRef]
  31. Adedeji, O.; Owoade, P.; Ajayi, O.; Arowolo, O. Image Augmentation for Satellite Images. arXiv 2022, arXiv:2207.14580. [Google Scholar]
  32. Ghaffar, M.A.A.; McKinstry, A.; Maul, T.; Vu, T.T. Data augmentation approaches for satellite image super-resolution. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, IV-2-W7, 47–54. [Google Scholar] [CrossRef]
  33. Golej, P.; Horak, J.; Kukuliac, P.; Orlikova, L. Vehicle Detection Using Panchromatic High-Resolution Satellite Images as a Support for Urban Planning. Case Study of Prague’s Centre. GeoScape 2022, 16, 108–119. [Google Scholar] [CrossRef]
  34. Kumar, J.; Huan, T.; Li, X.; Yuan, Y. Panchromatic and Multispectral Remote Sensing Image Fusion Using Particle Swarm Optimization of Convolutional Neural Network for Effective Comparison of Bucolic and Farming Region. In Earth Science and Remote Sensing Applications; Series of Remote Sensing/Photogrammetry; Springer: Cham, Switzerland, 2018. [Google Scholar]
  35. Bengio, Y. Practical Recommendations for Gradient-Based Training of Deep Architectures. In Neural Networks: Tricks of the Trade, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  36. How Compute Accuracy for Object Detection Works—ArcGIS Pro|Documentation. Available online: https://pro.arcgis.com/en/pro-app/latest/tool-reference/image-analyst/how-compute-accuracy-for-object-detection-works.htm (accessed on 8 July 2024).
  37. What Is Average Precision in Object Detection & Localization Algorithms and How to Calculate It?|by Aqeel Anwar|Towards Data Science. Available online: https://towardsdatascience.com/what-is-average-precision-in-object-detection-localization-algorithms-and-how-to-calculate-it-3f330efe697b (accessed on 8 July 2024).
  38. Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
  39. Gaszczak, A.; Breckon, T.P.; Han, J. Real-Time People and Vehicle Detection from UAV Imagery. In Proceedings of the Intelligent Robots and Computer Vision XXVIII: Algorithms and Techniques, San Francisco, CA, USA, 24–25 January 2011; Volume 7878, pp. 71–83. [Google Scholar]
  40. Yan, D.; Li, G.; Li, X.; Zhang, H.; Lei, H.; Lu, K.; Cheng, M.; Zhu, F. An Improved Faster R-CNN Method to Detect Tailings Ponds from High-Resolution Remote Sensing Images. Remote Sens. 2021, 13, 2052. [Google Scholar] [CrossRef]
  41. Ramesha, V.; Kadambi, Y.; Aditya, B.S.A.; Prashant, T.V.; Shylaja, S.S. Towards Faster and Efficient Lightweight Image Super Resolution Using Transformers and Fourier Convolutions. Artif. Intell. Appl. 2022. [Google Scholar] [CrossRef]
  42. Chaverot, M.; Carré, M.; Jourlin, M.; Bensrhair, A.; Grisel, R. Improvement of Small Objects Detection in Thermal Images. Integr. Comput.-Aided Eng. 2023, 30, 311–325. [Google Scholar] [CrossRef]
Figure 1. System overview of Faster R-CNN [19].
Figure 1. System overview of Faster R-CNN [19].
Applsci 14 08555 g001
Figure 2. Study areas in Prague (pixel resolution is 30 × 30 cm). The numbers represent locations mentioned in Table 1.
Figure 2. Study areas in Prague (pixel resolution is 30 × 30 cm). The numbers represent locations mentioned in Table 1.
Applsci 14 08555 g002
Figure 3. Details of Location 7 (pixel resolution is 30 × 30 cm).
Figure 3. Details of Location 7 (pixel resolution is 30 × 30 cm).
Applsci 14 08555 g003
Figure 4. Validation and training losses after fitting the model. (a) Variant 9; (b) variant 16.
Figure 4. Validation and training losses after fitting the model. (a) Variant 9; (b) variant 16.
Applsci 14 08555 g004
Figure 5. Image with mask (pixel resolution is 30 × 30 cm).
Figure 5. Image with mask (pixel resolution is 30 × 30 cm).
Applsci 14 08555 g005
Figure 6. Validation process.
Figure 6. Validation process.
Applsci 14 08555 g006
Figure 7. Results of the model including data augmentation and a buffer. Red indicates pixels with people (ground truth), and blue the most probable detected locations with a mask. Location 7 (pixel resolution is 30 × 30 cm).
Figure 7. Results of the model including data augmentation and a buffer. Red indicates pixels with people (ground truth), and blue the most probable detected locations with a mask. Location 7 (pixel resolution is 30 × 30 cm).
Applsci 14 08555 g007
Figure 8. Results of detection with data augmentation and a 15 cm buffer but without a mask. Red indicates pixels with people (ground truth), and blue the most probable detected locations (pixel resolution is 30 × 30 cm).
Figure 8. Results of detection with data augmentation and a 15 cm buffer but without a mask. Red indicates pixels with people (ground truth), and blue the most probable detected locations (pixel resolution is 30 × 30 cm).
Applsci 14 08555 g008
Figure 9. Correctly detected people (recall) with various methods and preprocessing.
Figure 9. Correctly detected people (recall) with various methods and preprocessing.
Applsci 14 08555 g009
Figure 10. Detection of people using Faster R-CNN: without data augmentation or a buffer (a); with data augmentation but without buffer (b). Red—pixels with people (ground truth); green—the most probable detected location without a mask; and blue—the most probable detected location with a mask. Location 7 (pixel resolution is 30 × 30 cm).
Figure 10. Detection of people using Faster R-CNN: without data augmentation or a buffer (a); with data augmentation but without buffer (b). Red—pixels with people (ground truth); green—the most probable detected location without a mask; and blue—the most probable detected location with a mask. Location 7 (pixel resolution is 30 × 30 cm).
Applsci 14 08555 g010
Figure 11. Detection of people using Faster R-CNN: without data augmentation but with a buffer (a); with data augmentation and a buffer (b). Red—ground truth; green—without a mask; and blue—with a mask. Location 7 (pixel resolution is 30 × 30 cm).
Figure 11. Detection of people using Faster R-CNN: without data augmentation but with a buffer (a); with data augmentation and a buffer (b). Red—ground truth; green—without a mask; and blue—with a mask. Location 7 (pixel resolution is 30 × 30 cm).
Applsci 14 08555 g011
Table 1. Ground truth data in Prague.
Table 1. Ground truth data in Prague.
LocationNumber of People
1—Old Town Square—surface 1214
2—Old Town Square—surface 215
3—Old Town Square—surface 319
4—Old Town Square—surface 49
5—Charles Bridge479
6—Prague Castle—surface 163
7—Prague Castle—surface 2116
8—Prague Castle—surface 335
Total950
Table 2. Settings and results of Faster R-CNN model training.
Table 2. Settings and results of Faster R-CNN model training.
Input ParametersOptimizationQuality Assessment
VariantBuffer (cm)BackboneBatch SizeEpochsLearning RateAverage Precision Score (%)
1.0Resnet1018500.00047918.82
2.0Resnet1012500.00057514.15
3.0Resnet10132500.00013215.09
4.0Resnet101321500.00015824.67
5.0Resnet101643000.00022922.96
6.0Resnet101161500.00015821.24
7.0Resnet15281000.00033127.10
8.0Resnet152321500.00019123.57
9.0Resnet152161500.00015831.45
10.0Resnet152241500.00022924.58
11.0Resnet152201500.00027523.15
12.0Resnet50161500.00027528.32
13.15Resnet152161500.00019141.14
14.30Resnet152161500.00013241.51
Table 3. Settings for training of the Faster R-CNN model using data augmentation.
Table 3. Settings for training of the Faster R-CNN model using data augmentation.
Input ParametersOptimizationQuality Assessment
VariantBuffer (cm)BackboneBatch SizeEpochsLearning RateAverage Precision Score (%)
150Resnet152161500.00009242.15
1615Resnet152161500.00010948.65
1715VGG16161500.00027537.61
1815VGG19161500.00013237.24
1915InceptionV3161500.00009131.55
2015InceptionV4161500.0002290.29
2115DenseNet121161500.0003310.91
2215DenseNet161161500.0002291.24
2315DenseNet169161500.0001910.47
2415DenseNet201161500.0001900.97
2515MobileNet V3161500.00013242.65
Table 4. Settings for the training of the YOLOv3 model using data augmentation.
Table 4. Settings for the training of the YOLOv3 model using data augmentation.
Input ParametersOptimizationQuality Assessment
VariantBuffer (cm)BackboneBatch SizeEpochsLearning RateAverage Precision Score (%)
26.15Darknet5383000.0025122.26
27.15Darknet53163000.0020894.04
28.15Darknet53323000.00363136.32
29.15Darknet53643000.0008324.27
Table 5. Settings for the training of the SSD and RetinaNet models using data augmentation.
Table 5. Settings for the training of the SSD and RetinaNet models using data augmentation.
Model typeRatiosBackboneBatch SizeLearning RateZoom/
Scales
GridsDropout
SSD[1.0, 1.0]VGG16, VGG19,
ResNet152, and DenseNet201
2, 4, 8, 16, 32, and 64Define by lr_find() function[0.1–0.5]4,8,16,320.2
Retina-Net[1.0, 1.0]ResNet50, ResNet101, and ResNet1522, 4, 8, 16, 32, and 64Define by lr_find() function[2–8]NANA
Table 6. Results of people detection without data augmentation and without a buffer.
Table 6. Results of people detection without data augmentation and without a buffer.
MethodLimit (cm)TPFPFNPrecision (%)Recall (%)F1 (%)
Faster R-CNN without mask1571072076.13.34.3
30338118128.915.420.1
45466816840.421.528
60516316344.723.831.1
Faster R-CNN with mask1564820811.12.84.5
30163819829.67.511.9
45302418455.61422.4
60381617670.417.828.4
Table 7. Results of people detection without data augmentation but with a 15 cm buffer.
Table 7. Results of people detection without data augmentation but with a 15 cm buffer.
MethodLimit (cm)TPFPFNPrecision (%)Recall (%)F1 (%)
Faster R-CNN without mask15101772045.34.75
304014717421.418.720
45581291563127.128.9
606811914636.431.833.9
Faster R-CNN with mask15111102039.15.16.6
30348718028.115.920.3
45546716044.625.232.3
60675414755.431.340
Table 8. Detection with data augmentation but without a 15 cm buffer.
Table 8. Detection with data augmentation but without a 15 cm buffer.
MethodLimit (cm)TPFPFNPrecision (%)Recall (%)F1 (%)
Faster R-CNN without mask15202791946.79.37.8
307222714224.133.628.1
4510719210735.85041.7
601211789340.556.547.2
Faster R-CNN with mask15172191977.27.97.6
307016614429.732.731.1
4510113511342.847.244.9
601161209849.254.251.6
Table 9. Detection with data augmentation and with a 15 cm buffer.
Table 9. Detection with data augmentation and with a 15 cm buffer.
MethodLimit (cm)TPFPFNPrecision (%)Recall (%)F1 (%)
Faster R-CNN without mask15213801935.29.86.8
307932213519.736.925.7
451233789130.757.540
601362657833.963.644.2
Faster R-CNN with mask15182561966.68.47.4
307519913927.43530.7
4511416010041.653.346.7
601331418148.562.154.5
Table 10. The population density calculated for ground truth and predicted data per 1 m2.
Table 10. The population density calculated for ground truth and predicted data per 1 m2.
LocationGround TruthPredicted
6—Prague Castle—surface 10.0140.037
7—Prague Castle—surface 20.0340.027
8—Prague Castle—surface 30.0090.014
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Golej, P.; Kukuliač, P.; Horák, J.; Orlíková, L.; Partila, P. People Detection Using Artificial Intelligence with Panchromatic Satellite Images. Appl. Sci. 2024, 14, 8555. https://doi.org/10.3390/app14188555

AMA Style

Golej P, Kukuliač P, Horák J, Orlíková L, Partila P. People Detection Using Artificial Intelligence with Panchromatic Satellite Images. Applied Sciences. 2024; 14(18):8555. https://doi.org/10.3390/app14188555

Chicago/Turabian Style

Golej, Peter, Pavel Kukuliač, Jiří Horák, Lucie Orlíková, and Pavol Partila. 2024. "People Detection Using Artificial Intelligence with Panchromatic Satellite Images" Applied Sciences 14, no. 18: 8555. https://doi.org/10.3390/app14188555

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop