A Human-Detection Method Based on YOLOv5 and Transfer Learning Using Thermal Image Data from UAV Perspective for Surveillance System

Mantau, Aprinaldi Jasa; Widayat, Irawan Widi; Leu, Jenq-Shiou; Köppen, Mario

doi:10.3390/drones6100290

Open AccessArticle

A Human-Detection Method Based on YOLOv5 and Transfer Learning Using Thermal Image Data from UAV Perspective for Surveillance System

by

Aprinaldi Jasa Mantau

^1,*

,

Irawan Widi Widayat

¹

,

Jenq-Shiou Leu

²

and

Mario Köppen

¹

Department of Computer Science and System Engineering (CSSE), Graduate School of Computer Science and System Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka-shi, Fukuoka 820-8502, Japan

²

Department of Electronic and Computer Engineering (ECE), National Taiwan University of Science and Technology, Taipei City 106, Taiwan

^*

Author to whom correspondence should be addressed.

Drones 2022, 6(10), 290; https://doi.org/10.3390/drones6100290

Submission received: 7 September 2022 / Revised: 28 September 2022 / Accepted: 30 September 2022 / Published: 4 October 2022

(This article belongs to the Special Issue Drones, Artificial Intelligence and Advanced Analytics for the Conservation of Threatened Species)

Download

Browse Figures

Versions Notes

Abstract

:

At this time, many illegal activities are being been carried out, such as illegal mining, hunting, logging, and forest burning. These things can have a substantial negative impact on the environment. These illegal activities are increasingly rampant because of the limited number of officers and the high cost required to monitor them. One possible solution is to create a surveillance system that utilizes artificial intelligence to monitor the area. Unmanned aerial vehicles (UAV) and NVIDIA Jetson modules (general-purpose GPUs) can be inexpensive and efficient because they use few resources. The problem from the object-detection field utilizing the drone’s perspective is that the objects are relatively small compared to the observation space, and there are also illumination and environmental challenges. In this study, we will demonstrate the use of the state-of-the-art object-detection method you only look once (YOLO) v5 using a dataset of visual images taken from a UAV (RGB-image), along with thermal infrared information (TIR), to find poachers. There are seven scenario training methods that we have employed in this research with RGB and thermal infrared data to find the best model that we will deploy on the Jetson Nano module later. The experimental result shows that a new model with pre-trained model transfer learning from the MS COCO dataset can improve YOLOv5 to detect the human–object in the RGBT image dataset.

Keywords:

human detection; Jetson; surveillance; thermal imaging; UAV; YOLO

1. Introduction

Since 2012, the United Nations has proclaimed 21 March as World Forest Day [1]. The goal is to make people aware of the importance of forest sustainability. Based on data from the Food and Agriculture Organization of the United Nations [2], Indonesia is the eighth most forested country, with a total forest area of more than 50% of the total land area or about 93 million hectares. However, many illegal activities occur in large forest areas in Indonesia, such as land clearing without a permit, forest fires, and illegal hunting [3]. Illegal activities carried out in the forest environment may cause many natural disasters such as landslides, floods, and the loss of the biological environment for many animals [4]. Various efforts have been made to maintain Indonesia’s forests and their various biological species, whereas many illegal activities still occur. Due to a lack of personnel resources and thorough area coverage, the traditional technique of patrolling and monitoring these areas has not been able to resolve this issue.

The answer to this problem is to develop a surveillance system that keeps an eye on the neighborhood using artificial intelligence. Unmanned aerial vehicles (UAVs, often known as drones) and NVIDIA Jetson modules, a general-purpose GPU, are an affordable and effective solution because they only need a small number of resources. The proposed solution for a survaillance system using drones and Jetson can be seen in Figure 1.

UAV technology is currently very advanced and is the most realistic solution today because it is flexible, fast, relatively inexpensive, lightweight, and easy to use [5]. In several fields of studies, UAVs have been employed as tools for area and target coverage, path and trajectory planning, image analysis and vision-based techniques, networking, and flight control [6]. Despite the massive use of UAVs in these various fields, there are still many challenges that need to be solved, which include weather conditions, shadows, illumination, and other variations. To overcome this challenge, RGBT images, which are also known as red, green, and blue images with thermal infrared information, are utilized.

Conceptually, a thermal infrared (TIR) image represents data that capture information outside the spectrum of the human eye. It captures wavelengths out of the visible light spectrum area, as we can see in Figure 2. This helps the TIR to overcome changes in light intensity that affect the color captured by the human eye. However, TIR also has a weakness: it is sensitive to temperature changes and does not contain detailed information such as visual RGB images [7].

Due to the small size of humans in UAV videos, the UAV’s motion, and the low resolution, the ability to detect poachers in UAV video, particularly thermal infrared footage, is an important topic of research. In this present study, several scenarios have been used to enhance the you only look once (YOLO) [8] object-detection method, which focuses on small human–object detection from a UAV perspective. The target presents a harder challenge for the object detection due to its various shapes and dense crowds. Therefore, the YOLOv5 model was trained using the RGB image and TIR dataset in order to evaluate how well it performed when identifying humans from aerial perspective data.

The main contributions of this paper are as follows:

Optimizing the YOLOv5s algorithm for small human–object detection dataset via the transfer learning method.
Developing a method to handle different environmental issues, including illumination and mobility change using thermal infrared (TIR) images in addition to RGB (RGBT) images.
The original dataset has been manually annotated to be YOLO-format-compatible, and the annotation will be made available to the public.
Proposing a surveillance system for wildlife conservation using NVIDIA Jetson Nano module.

This paper is organized as follows: Section 2 describes the object detection for surveillance and provides a brief overview of the NVIDIA Jetson Modules. Section 3 consists of the methodology and necessary background information, as well as the evaluation method. Section 4 and Section 5 consist of the experiment’s results and the conclusion, respectively.

2. Related Work

The technique of object detection in UAVs or drones has been developed for use in a variety of contexts, including aerial image analysis, monitoring agents, delivery routing agents, intelligent surveillance, and air force security. Hengstler et al. [9] introduce a new approach to the distribution model of the surveillance camera by using a low-resolution stereo camera that calculates all the captured images for the position, range, and dimension that UAVs use, called MeshEye. Widiyanto et al. [10] introduced a PSO algorithm for the odor-source localization model of automatic robotic movement by reconstructing two different points of robotic sensing. Zhao et al. [11] proposed a new mixed YOLOv3-LITE for image detection precision and speed, which can be used on a non-GPU computing system such as a mobile or portable device.

Several studies have been conducted in the field of object detection, especially with the availability of large datasets online and the increasing computing power, which have made extraordinary achievements in the field of computer vision [12]. It has been observed that object detection has been able to solve general and specific problems. The two examples of single-stage detection include you only look once (YOLO) and single-shot multi-box detector (SSD) [13]. Meanwhile, the RCNN family, which includes RCNN [14], Fast RCNN [15], and Faster RCNN [16], is categorized as being composed of two-stage detectors. These two categories of deep-learning-based detectors are divided based on accuracy and processing time.

2.1. You Only Look Once (YOLO)

The first YOLO method was introduced by Redmon et al. in 2016 [8]. This single convolution network object detection has the ability to predict object categories and locations up to 45 fps. YOLO algorithm takes all the images in one instance and then divides the given image into the

S x S

grid system. Each grid on the input image is responsible for detecting and predicting the category of the object inside the bounding box that contains the class probability. The YOLO architecture has 24 convolutional layers for performing feature extraction and two fully connected layers for predicting the bounding box of the predicted object. In addition, YOLO is renowned for its high performance, but with a tiny model, which makes it an ideal candidate for real-time object detection for on-device deployment.

By late 2021, YOLO had been upgraded to version 5. Before this period, the first three YOLO versions were released in 2016, 2017, and 2018, respectively, and within a few months in 2020, two versions of this model were released, namely, YOLOv4 and YOLOv5. YOLO version 2 (YOLOv2) replaced the original architecture with a 19-layer feature called Darknet-19 [17]. In the third version (YOLOv3), the network architecture was updated again to a more profound architecture known as Darknet-53 [18]. Furthermore, YOLO version4 (YOLOv4), regarded as CSP Darknet-53, utilized the same Darknet-53 as the backbone architecture with additional cross stage partial connection (CSP) [19]. YOLOv4 came up in 2020 with several additional features that are proven to enhance accuracy.

2.2. NVIDIA Jetson Modules

Embedded machine learning is evolving rapidly. NVIDIA is recognized as a manufacturer of graphics-processing units for gaming, professional markets, and system-on-chip units for mobile computing. Furthermore, it has also produced several NVIDIA Jetson modules, which is a family of embedded computers with integrated GPUs or modules designed for high-performance computing to create an embedded AI system easily [20]. Jetson Nano is the cheapest of all NVIDIA Jetson modules, and with its 128 parallel processing cores, it has the ability to handle a real-time video feed. The main technical parameters of the Jetson Nano modules are summarized in Table 1.

NVIDIA Jetson Nano used the compute unified device architecture (CUDA) as a parallel computing platform. Generally, CUDA is a development and execution enabling platform designed by NVIDIA for general proposed computing or program on graphical processing units (GPUs) [22]. It allocates tasks that are parallel to others, which do not need to be executed sequentially on the GPU. Furthermore, it supports many programming languages, such as C, C++, Fortran, and Python. CUDA is useful in domains that require a lot of computing power or in situations where parallelization is possible and high performance is required. NVIDIA Jetson modules have been widely used in research in the field of computer vision; this is because NVIDIA Jetson general-purpose GPUs became a viable platform for the efficient execution of some computational models [23].

In this current study, the NVIDIA Jetson Nano was used to detect human appearance from a UAV perspective for the surveillance system. Additionally, the best YOLOv5 model was deployed from the RGBT dataset on Jetson Nano. An overview design of the Jetson Nano utilized is seen in Figure 3.

3. Methodology

3.1. Object Detector

YOLOv5 [24] is the latest major version of YOLO till date. Jocher launched the YOLOv5 publicly on 9 June 2020 and is still being updated. The release of YOLOv5 includes four main different model sizes, which are YOLOv5s, the smallest; YOLOv5m, medium; and YOLOv5l, large; and YOLOv5x, the largest. When it was released, YOLOv5 was initially only intended for an image size of 640 pixels, but now it also offers 1280 pixels.

Furthermore, the architecture of YOLOv5 has a cross stage partial connection (CSP) backbone and PANET neck, just like YOLOv4. However, YOLOv5 utilizes the PyTorch instead of using the original Darknet. The significant improvements in YOLOv5 include mosaic data augmentation and auto-learning bounding box anchors. The architecture of YOLOv5 is shown in Figure 4.

3.2. Dataset

During the experimental design, the VisDrone 2021 RGBT dataset was used [25]. This dataset was originally part of the VisDrone 2021 Crowd Counting Challenge, which is a challenge for counting people in each frame. This challenge aims to estimate the number of people in an image. VisDrone 2021 provides a dataset with pairs of RGB and TIR images. It is important to note that the VisDrone 2021 RGBT dataset was collected by the AISKYEYE team from the Lab of Machine Learning and Data Mining at Tianjin University, China.

These data consist of 1807 pairs of RGB and TIR images; an example of this pair image can be seen in Figure 5. This team collected the data from the actual UAV under several different scenarios as well as various lighting and weather conditions. The ground truth of the dataset is the object’s target point in XML format. Before implementing this data in the experiment, some data prepossessing was performed to make it compatible with the YOLO format. In this study, the data was divided into training and test sets in the ratio of 80:20, respectively.

3.3. Experiment Setup

In this research, the YOLOv5 model was trained in the host machine with an NVIDIA RTX 3060 GPU, 12 GB of VRAM, Intel Corei9-10900K Processor (3.70 GHz, 20 MB), and memory of 32GB. After getting the best model from the training stage, it was converted to a TensorRT model that was deployed to the Jetson Nano. Finally, the model was tested with NVIDIA Deep Stream SDK. The training parameters used can be seen in Table 2. Since this study aims to deploy the inference model on the Jetson Nano module, the smallest model version of YOLOv5 (YOLOv5s) was chosen.

A total of seven training-testing dataset scenarios considered in this study based on YOLOv5 are as follows:

Original YOLOv5 model (MS COCO RGB Dataset);
VisDrone RGB image data + transfer learning YOLOv5s model (YOLO-RGB-TL);
VisDrone RGB image data (YOLO-RGB);
VisDrone TIR image data + transfer learning YOLOv5s model (YOLO-TIR-TL);
VisDrone TIR image data (YOLO-TIR);
VisDrone RGB and TIR Image + transfer learning YOLOv5s model (YOLO-RGBT-TL);
VisDrone RGB and TIR image data (YOLO-RGBT).

The seven aforementioned scenarios are intended to investigate the impacts of the combination transfer-learning approach and dataset utilized so that the best scenario may be selected and applied to the Jetson Nano device.

3.4. Evaluation

The training scenarios for VisDrone RGB, TIR, and RGBT images were evaluated in both RGB and TIR test sets. The evaluation measurements utilized include precision (P), recall (R), and average precision (AP). The AP measures a combination of recall and precision for ranked retrieval results and is the average precision at various recall values [26]. The formula to calculate P and R is as follows:

Recall = \frac{TP}{TP + FN}

(1)

Precision = \frac{TP}{TP + FP}

(2)

where:

TP denotes true positive;
FP denotes false positive;
FN denotes false negative.

4. Experiments and Results

The experimental pipeline consists of two main stages: the first one is a model search or training process to find the best model to perform the human-detection task from a UAV perspective. This model search was performed on the computer host machine mentioned earlier. The second stage is the execution or inference in the Jetson Nano module. The flow of this experiment can be seen in Figure 6.

4.1. Model Search

The original YOLOv5 provided by ultralytics has the ability to detect small objects in both the RGB and TIR images, but such detection leads to a wrong classification. For example, YOLOv5 shown in Figure 7 classifies the small human being as a bird and kite. Intuitively, this occurs because the original YOLOv5 is a model trained on the COCO dataset, which has 80 classes and different perspectives. This indicates that the model trained on the MS COCO dataset is insufficient to solve the human classification problem from the standpoint of an unmanned aerial vehicle (UAV).

After the training was conducted using RGB, TIR, and RGBT images from Visdrone 2021 dataset, the model was tested using RGB and TIR test-set images. The result of experiment from seven scenarios can be seen in Table 3 and Table 4.

Table 3 shows the comparison results for each of the seven training scenarios and the original YOLOv5 model when applied on the RGB images test set. It was observed that the performance from all trained models produced a better performance than the original YOLOv5.

The best model in this scenario was the YOLO-RGB-TL model, with an average precision of 79.8%; meanwhile, the YOLO-TIR model failed in the RGB images test as it produced a lower performance value. Table 3 also shows that the performance of both YOLO-RGB and YOLO-RGBT became better when pre-trained weight transfer learning from the MS COCO dataset was employed. This is evident as the model performance increased from 70% to 79.8% and 71.4% to 79.1% for YOLO-RGB and YOLO-RGBT, respectively.

Furthermore, Table 4 shows the comparison results for each of the seven training scenarios and the original YOLOv5 model when applied to the TIR images test set. It was observed that the YOLO-TIR and YOLO-RGBT with transfer learning weight produced a TIR image test set with AP 88.8%. In Table 4, both YOLO-RGB and YOLO-RGB- TL did not produce the same result as YOLO-TIR and YOLO-RGBT models because the information in the TIR image was not as detailed as that in the RGB image. This limited information makes it to be difficult for this model, which is not trained with TIR images, to detect the object. The performance results of each scenario for the RGB and TIR images are shown from Figure 7, Figure 8, Figure 9 and Figure 10.

4.2. Inference on the Jetson GPUs

It is important to note that the best model obtained from the previous step was chosen and was executed on the Jetson Nano module. The process of deploying the model in the Jetson Nano module includes converting the model to TensorRT and cloning the TensorRT project on the Jetson Nano. In the deployment process, the NVIDIA Deep Stream was installed and then the model was executed in the Jetson Nano module. The best model was run on the platform using the Keras API with TensorFlow v2. Jetson Nano modules were switched to the highest performance mode (nvpmodel 0), and the model processed images from the testing data set.

4.3. The Limitation of This Study

The limitation of this study is that the model we used as a foundation is YOLOv5s, a simple version of YOLOv5. Because we propose a method for applying the model to Jetson Nano, we consider resource constraints such as memory, time, and energy consumption. Additionally, the proportion of the RGB and TIR images has not been studied further to determine the optimal combination of these images for the most accurate object-detection method.

5. Conclusions

The detection ability of the state-of-the-art deep-learning-based algorithm, namely, you only look once (YOLO), has been investigated by considering the small human–object detection from an unmanned aerial vehicle perspective using NVIDIA Jetson modules. For the model search aspect, the YOLOv5 model trained with RGB and thermal infrared images produced a good result for solving the small object-detection problem. The RGB and TIR images dataset from VisDrone was able to boost the performance of the YOLOv5 model in order to detect the small object from a UAV perspective with AP values up to 79.8% and 88.8% for RGB and TIR images, respectively. Future study needs to consider more complex methods for the training process, including the possibility to observe new architecture in YOLO and the most effective way to utilize the combination of RGB and thermal infrared dataset images. Finally, a complex surveillance system can be implemented in a multi-agent UAV with an edge AI concept using the NVIDIA Jetson module in order to investigate the cost performance of this solution.

Author Contributions

Conceptualization, A.J.M., I.W.W., M.K. and J.-S.L.; data curation, A.J.M.; formal analysis, A.J.M.; investigation, A.J.M. and I.W.W.; resources, A.J.M.; supervision, M.K. and J.-S.L.; visualization, A.J.M.; writing—original draft, A.J.M. and I.W.W.; and writing—review and editing, A.J.M., I.W.W., M.K. and J.-S.L. All authors read and agreed to the published version of the manuscript.

Funding

This study was supported by a collaborative research project between the Kyushu Institute of Technology (Kyutech) and the National Taiwan University of Science and Technology (Taiwan-Tech).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors gratefully acknowledge the support by the Kyushu Institute of Technology—National Taiwan University of Science and Technology Joint Research Program, under Grant Kyutech-NTUST-111-04.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

References

United Nations. International Day of Forests, 21 March. Available online: https://www.un.org/en/observances/forests-and-trees-day (accessed on 1 September 2021).
Food and Agriculture Organization of the United Nations. Global Forest Resources Assessment 2020: Main Report; FAO: Rome, Italy, 2020. [Google Scholar] [CrossRef]
Assifa, F. Setiap Tahun, HUTAN INDONESIA HILANG 684.000 Hektar. Available online: https://regional.kompas.com/read/2016/08/30/15362721/setiap.tahun.hutan.indonesia.hilang.684.000.hektar (accessed on 24 April 2021).
Nugroho, W.; Eko Prasetyo, M.S. Forest Management and Environmental Law Enforcement Policy against Illegal Logging in Indonesia. Int. J. Manag. 2019, 10, 317–323. [Google Scholar] [CrossRef]
Mantau, A.J.; Widayat, I.W.; Köppen, M. A Genetic Algorithm for Parallel Unmanned Aerial Vehicle Scheduling: A Cost Minimization Approach. In Proceedings of the International Conference on Intelligent Networking and Collaborative Systems, Taichung, Taiwan, 1–3 September 2021; Springer: Cham, Switzerland, 2021; pp. 125–135. [Google Scholar] [CrossRef]
Shakeri, R.; Al-Garadi, M.A.; Badawy, A.; Mohamed, A.; Khattab, T.; Al-Ali, A.; Harras, K.A.; Guizani, M. Design Challenges of Multi-UAV Systems in Cyber-Physical Applications: A Comprehensive Survey, and Future Directions. arXiv 2018, arXiv:1810.09729. [Google Scholar] [CrossRef] [Green Version]
Bokolonga, E.; Hauhana, M.; Rollings, N.; Aitchison, D.; Assaf, M.H.; Das, S.R.; Biswas, S.N.; Groza, V.; Petriu, E.M. A compact multispectral image capture unit for deployment on drones. In Proceedings of the 2016 IEEE International Instrumentation and Measurement Technology Conference Proceedings, Taipei, Taiwan, 23–26 May 2016; pp. 1–5. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Hengstler, S.; Prashanth, D.; Fong, S.; Aghajan, H. Mesheye: A hybrid-resolution smart camera mote for applications in distributed intelligent surveillance. In Proceedings of the 6th International Conference on Information Processing in Sensor Networks, Cambridge, MA, USA, 25–27 April 2007; pp. 360–369. [Google Scholar]
Widiyanto, D.; Purnomo, D.; Jati, G.; Mantau, A.; Jatmiko, W. Modification of particle swarm optimization by reforming global best term to accelerate the searching of odor sources. Int. J. Smart Sens. Intell. Syst. 2016, 9, 1410–1430. [Google Scholar] [CrossRef] [Green Version]
Zhao, H.; Zhou, Y.; Zhang, L.; Peng, Y.; Hu, X.; Peng, H.; Cai, X. Mixed YOLOv3-LITE: A lightweight real-time object-detection method. Sensors 2020, 20, 1861. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. UAV-YOLO: Small object detection on unmanned aerial vehicle perspective. Sensors 2020, 20, 2238. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Cass, S. Nvidia makes it easy to embed AI: The Jetson nano packs a lot of machine-learning power into DIY projects—[Hands on]. IEEE Spectr. 2020, 57, 14–16. [Google Scholar] [CrossRef]
Jetson Modules. 2021. Available online: https://developer.nvidia.com/embedded/jetson-modules (accessed on 12 January 2022).
Kirk, D. NVIDIA Cuda Software and Gpu Parallel Computing Architecture. In Proceedings of the 6th International Symposium on Memory Management, ISMM’07, Montreal, QC, Canada, 21–22 October; Association for Computing Machinery: New York, NY, USA, 2007; pp. 103–104. [Google Scholar] [CrossRef]
Krömer, P.; Nowaková, J. Medical Image Analysis with NVIDIA Jetson GPU Modules. In Proceedings of the Advances in Intelligent Networking and Collaborative Systems; Barolli, L., Chen, H.C., Miwa, H., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 233–242. [Google Scholar] [CrossRef]
Jocher, G. yolov5. 2021. Available online: https://github.com/ultralytics/yolov5 (accessed on 31 January 2022).
University, T. Crowd Counting. Available online: http://aiskyeye.com/download/crowd-counting_/ (accessed on 6 June 2022).
Zhang, E.; Zhang, Y. Average Precision. In Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer: Boston, MA, USA, 2009; pp. 192–193. [Google Scholar] [CrossRef]

Figure 1. UAVs with NVIDIA Jetson Nano for surveillance system.

Figure 2. Wavelength of light.

Figure 3. Jetson Nano module.

Figure 4. YOLOv5 architecture. Backbone: CSPD; neck: PANet; and head: YOLO layer detection results (class, score, location, and size).

Figure 5. RGBT VisDrone crowd-counting dataset [25].

Figure 6. Model search and human–object detection on Jetson Nano workflow.

Figure 7. YOLOv5s original model detection result. (a) TIR Image. (b) RGB Image.

Figure 8. YOLOv5-RGB model detection result. (a) TIR Image. (b) RGB Image.

Figure 9. YOLOv5-TIR model detection result. (a) TIR Image. (b) RGB Image.

Figure 10. YOLOv5-RGBT detection result. (a) TIR Image. (b) RGB Image.

Table 1. Jetson Nano Technical Specification [21].

	Technical Specifications
AI Performance	472 GFLOPs
GPU	NVIDIA Maxwell architecture
CPU	Quad-core ARM Cortex-A57 MPCore processor
Cuda Core	128
Memory	4 GB 64-bit LPDDR4 25.6 GB/s
Power	5 W\|10 W

Table 2. Training parameter.

Training Parameter	Value
Class	1
Batch size	16
Epoch	100
Learning rate	1 × 10 $^{- 3}$

Table 3. Performance result on RGB test-set image.

Model	Precision (%)	Recall (%)	AP (%)
YOLOv5	24.6	7.3	12.36
YOLO-RGB-TL	80.8	75.4	79.8
YOLO-RGB	71.5	68	70
YOLO-TIR-TL	12.2	13.3	4.94
YOLO-TIR	9.89	12	4.01
YOLO-RGBT-TL	80.1	75.1	79.1
YOLO-RGBT	76.5	66.8	71.4

Table 4. Performance result on TIR test-set image.

Model	Precision (%)	Recall (%)	AP (%)
YOLOv5	21.2	4.2	8.3
YOLO-RGB-TL	76.3	63.3	71.3
YOLO-RGB	66	61.4	64
YOLO-TIR-TL	86.6	84.2	88.8
YOLO-TIR	81.7	80.1	84.6
YOLO-RGBT-TL	86.3	83.9	88.8
YOLO-RGBT	82.5	81.5	85.7

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mantau, A.J.; Widayat, I.W.; Leu, J.-S.; Köppen, M. A Human-Detection Method Based on YOLOv5 and Transfer Learning Using Thermal Image Data from UAV Perspective for Surveillance System. Drones 2022, 6, 290. https://doi.org/10.3390/drones6100290

AMA Style

Mantau AJ, Widayat IW, Leu J-S, Köppen M. A Human-Detection Method Based on YOLOv5 and Transfer Learning Using Thermal Image Data from UAV Perspective for Surveillance System. Drones. 2022; 6(10):290. https://doi.org/10.3390/drones6100290

Chicago/Turabian Style

Mantau, Aprinaldi Jasa, Irawan Widi Widayat, Jenq-Shiou Leu, and Mario Köppen. 2022. "A Human-Detection Method Based on YOLOv5 and Transfer Learning Using Thermal Image Data from UAV Perspective for Surveillance System" Drones 6, no. 10: 290. https://doi.org/10.3390/drones6100290

APA Style

Mantau, A. J., Widayat, I. W., Leu, J.-S., & Köppen, M. (2022). A Human-Detection Method Based on YOLOv5 and Transfer Learning Using Thermal Image Data from UAV Perspective for Surveillance System. Drones, 6(10), 290. https://doi.org/10.3390/drones6100290

Article Menu

A Human-Detection Method Based on YOLOv5 and Transfer Learning Using Thermal Image Data from UAV Perspective for Surveillance System

Abstract

1. Introduction

2. Related Work

2.1. You Only Look Once (YOLO)

2.2. NVIDIA Jetson Modules

3. Methodology

3.1. Object Detector

3.2. Dataset

3.3. Experiment Setup

3.4. Evaluation

4. Experiments and Results

4.1. Model Search

4.2. Inference on the Jetson GPUs

4.3. The Limitation of This Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI