As the results of the detection rely heavily on the performance of the cameras during the inspections, the specifications of the cameras and some image parameters are worth mentioning. Image parameters including camera sensitivity, commonly known by the name of ISO speed, which reflects the camera’s sensitivity to light. Lower ISO settings generally result in images with less noise or grain, particularly in well-lit conditions, resulting in more detailed images with smoother transitions between tones. However, higher ISO values can introduce image noise or graininess, especially in low-light conditions [
18]. For instance, the performance of neural networks in detecting defects is highly influenced by the lighting conditions under which the inspection is conducted. Research indicates that optimal defect recognition is achieved at an illumination level of approximately 200 lux. In contrast, insufficient lighting (below 150 lux) can obscure critical features of the defects, leading to missed detections or incomplete analysis. Similarly, excessive lighting (above 250 lux) can cause glare or overexposure, which may distort the visual features of the defects and result in false positives or incorrect predictions [
19]. The ISO range of the detecting devices can partially diminish the possible problem posed by varying lighting conditions. By adjusting the ISO settings, the sensor’s sensitivity to light can be optimized to capture clear and detailed images, even in suboptimal lighting environments [
20]. Resolution is how well cameras capture and record details. Resolution is measured by the number of horizontal and vertical pixels in each frame. Higher video resolution provides more detail and clarity, which is essential for playback on large screens and professional video production [
21]. The aperture, on the other hand, controls the amount of light that the cameras can record; hence, the larger the aperture, the more light that enters. In this respect, the aperture size of the iPhone 13 is more significant at f/1.6 compared to the UAV at f/2.8.
Table 1 briefly compares the specifications of the three cameras used in this study.
2.2. Algorithm Training
In this study, the primary approach was to utilize an image detection algorithm, namely “You Only Look Once” (YOLO), to detect trained objects [
27]. In this case, the model was specifically trained to detect damages, including cracks, scratches, dents, paint off, and missing head nails on aircraft surfaces. The equipment used for algorithm training in this study included a computer running Windows 11, equipped with an AMD Ryzen 9 7950X 16-core processor (AMD, Santa Clara, CA, USA), an NVIDIA GeForce RTX 4070Ti GPU (NVIDIA, Santa Clara, CA, USA)with 6 GB of VRAM, and 128 GB of RAM. Hardware from ASUS and MSI supported all model training tasks. Training large-scale models such as YOLO requires substantial computational power. CPUs, while powerful, are not optimized for parallel processing, making GPUs a necessity for efficient deep-learning tasks. The introduction of CUDA and CUDNN into the training process allowed the YOLO model to fully utilize GPU resources, significantly reducing training time and enhancing performance. These technologies, combined with PyTorch’s neural network operations, enabled the efficient handling of complex computations necessary for YOLO’s architecture.
2.2.1. Training Environment
In this study, the programming for algorithm training was conducted primarily using Visual Studio Code software, version 1.96 [
28]. Python was installed as a plugin and served as the interpreter to utilize the “YAML” training files provided by the official YOLO site(s) [
29,
30]. YAML, a human-readable data serialization format, is widely used for configuration files and data interchange. It represents nested data structures with indentation, offering a concise and user-friendly alternative to formats such as JSON [
31].
In the context of deep learning and machine learning, particularly with frameworks such as YOLO, a data.yaml file serves as a configuration file written in YAML format. This file specifies dataset paths, class names, and other necessary parameters required for training the model [
31,
32]. The data.yaml file used in this study defined the location of the data for training, validation, and testing as well as the classes included in the dataset.
The training process relied heavily on YAML files to define how the algorithm would learn from the dataset. Ultralytics (La Jolla, CA, USA) provides pre-configured training materials for various YOLO versions, including YOLOv5 Nano (yolov5n.yaml), YOLOv8 Nano (yolov8n.yaml), YOLOv9 Compact (yolov9c.yaml), and YOLOv9 Enhanced (yolov9e.yaml). These configuration files were adapted to meet the specific requirements of this study [
29].
2.2.2. Code and Computation
The code for training was originally provided by ultralytics [
29]; it enabled custom training of the YOLO model, either from scratch or using pre-trained weights. Weights represent the parameters in a neural network that transform input data through the network’s layers. During training, the model adjusts these weights to encode the knowledge learned from the dataset [
33,
34]. However, training with a large-scale database is time-intensive and demands significant computational resources [
35,
36]. To address this, GPUs and CUDA were employed to leverage parallel processing capabilities, reducing training time and computational load compared to CPUs [
31,
37,
38]. After installing Python 3.10, essential libraries including PyTorch [
39], TorchVision, OpenCV [
40], NVIDIA CUDA, and CUDNN were configured to enable GPU acceleration during training [
36,
41]. These tools significantly enhanced model performance by optimizing computations on the GPU [
42].
Figure 4 displays the YOLOv9 training code used for the training session in this study.
2.2.3. Training Parameters and Loss Functions
Several parameters were crucial in defining the training process. The “epochs” parameter determined the number of iterations for training, allowing the model to improve with each cycle, provided adequate resources were available. The “batch size” parameter controlled the number of training samples per iteration. Smaller batch sizes can result in noisy gradient updates but may improve generalization, while larger batch sizes are more memory efficient but risk overfitting [
32,
34,
39,
43].
Loss functions quantified the model’s performance during training. Three loss components—box loss, class loss, and DFL loss—were employed. Box loss measured the difference between predicted and ground truth bounding boxes using metrics, including Intersection over Union (IoU), defined in Equation (1) [
44,
45]:
Class loss quantified errors in object classification within bounding boxes. Cross-entropy loss, the most commonly used metric, is expressed in Equation (2) [
33]:
DFL loss (Distribution Focal Loss) emphasized hard-to-classify samples by assigning them higher weights, improving the model’s learning on challenging cases. Its formula is given in Equation (3) [
46]:
where
The objective during training was to minimize these loss values, as lower losses indicate better model performance. A lower loss value indicates that the model’s predictions are closer to the actual values, signifying better performance.
2.2.4. Training Outputs
Upon completing training, two output files in PyTorch format (“.
pt”)—”best.pt” and “last.pt”—were generated. These files store the model’s architecture, parameters, and additional information required to recreate the model’s state [
46]. The “best.pt” file, which represents the model with the highest performance, was selected for subsequent processes.
Figure 5 illustrates the terminal output during model training in Visual Studio Code, which provides information on the losses’ value, GPU memory usage, and the calculated mAPs.
2.3. YOLO Model
Ultralytics provided several YOLO model options, each designed to address specific detection needs. For instance, YOLOv8 Nano is optimized for lightweight hardware environments, offering reduced computational requirements with only 3.2 M parameters and 8.7 FLOPs while achieving an mAP
val5095 of 37.3. YOLOv9 Compact (YOLOv9c) balances performance and efficiency, with 25.5 M parameters, 102.8 FLOPs with an mAP
val50-95 of 53.0, making it suitable for constrained resources. The YOLOv9 Enhanced model provides the highest accuracy (mAP
val50-95 of 55.6) but at the cost of significantly increased parameters (58.1 M) and FLOPs (192.5), ideal for high-performance tasks. Meanwhile, YOLOv5 Nano, the least demanding model, achieves an mAP
val50-95 of 28.0 with only 1.9 M parameters and 4.5 FLOPs, suitable for resource-limited real-time applications.
Table 3 summarizes these trade-offs, allowing users to select a model based on hardware constraints and performance needs [
29].
The mean average precision (mAP) value is a metric used to evaluate the performance of object detection models along with the metrics, such as precision, recall, and average precision [
47,
48]. The calculations regarding the evaluating of the model’s performance are shown in Equations (4)–(7) [
48].
- (1)
Precision (
P): The number of true positive detections divided by the total number of detections.
where
TP is true positives and
FP is false positives.
- (2)
Recall (
R): The number of true positive detections divided by the total number of ground-truth instances.
where
FN is false negatives.
- (3)
Average Precision (
AP): The area under the precision–recall curve for a single class.
where
Rn and
Pn are the recall and precision at the nth threshold.
- (4)
Mean Average Precision (
mAP): The
mAP is calculated as the average of the average precision (
AP) values across all classes:
where
N is the number of classes and
APi is the average precision for the
ith class.
The mAP@0.5 metric evaluates the model’s performance at a fixed IoU threshold of 0.5. A detection is considered correct if the IoU between the predicted bounding box and the ground truth bounding box exceeds this threshold. This metric reflects the average precision across all classes under the specified IoU constraint [
48].
The models underwent an experimental session using a custom database comprising five classes: crack, dent, missing-head, paint-off, and scratch. This experiment aimed to identify the most suitable models for the training material in this study, using the test devices listed in
Table 1 and the training equipment described in
Section 2.1. Given our limited GPU resources, the experiment was to ensure the practical performance and compatibility of each model within the constraints of our hardware.
Figure 6 and
Figure 7 demonstrate the performance score in mAP@50 and recall of different YOLO models tested under the mentioned environment for 500 epochs.
As shown in
Figure 6, the training results demonstrate the differences in mAP@0.5 performance among various YOLO models. YOLOv9 Compact (Red Line) scored the highest mAP@0.5, achieving over 0.7 and remaining relatively stable after 200 epochs, suggesting the best and most reliable performance among the tested models. YOLOv8 Nano and YOLOv5 Nano also remained very stable but at lower mAP values compared to the YOLOv9 variants. The lowest score was from YOLOv5 Nano (Orange Line), which had the lowest mAP@0.5 among the compared models, reaching around 0.55, suggesting it is less accurate than the others.
In
Figure 7, the chart illustrates the recall performance of different YOLO models. While all models show significant growth in the early epochs, YOLOv9 Compact (red line) consistently outperforms other models in recall throughout the training period, resulting in approximately a 0.7 recall value. YOLOv9 Enhanced (blue line) and YOLOv8 Nano (green line) show similar performance, with YOLOv9 Enhanced slightly edging out YOLOv8 Nano in the later epochs, though they still scored somewhat lower in this custom training.
Table 4 summarizes the evaluation metrics across all the models.
Table 4 presents data collected at epoch 300. This decision was made because the training process for YOLOv9 Enhanced was set to conclude at epoch 500, while the others were set to finish at epoch 1000. Collecting data at epoch 300 ensured that the early completion of the training did not affect the metric results, as shown in
Figure 6 and
Figure 7. According to the official data regarding the performance of the models [
49], YOLOv9 Enhanced was expected to outperform YOLOv9 Compact. However, several factors can affect performance, with the processing unit used during training being crucial, especially since YOLOv9 Enhanced requires extremely high FLOPs [
49].
As shown in
Figure 6 and
Figure 7 and
Table 4, the YOLOv9 Compact (red line) performed the best among the tested models. In the next phase of the experiment, YOLOv9 Compact will undergo further training with custom data.
This experiment was conducted using yolov9-c.yaml, the most suitable model for this study, provided by Ultralytics [
50]. The model includes powerful new capabilities and efficiency in techniques such as Programmable Gradient Information (PGI), which optimizes gradient flow during the training of deep neural networks to improve convergence, prevent vanishing or exploding gradients, and enhance training efficiency. It also incorporates the Generalized Efficient Layer Aggregation Network (GELAN), which increases efficiency and effectiveness through efficient layer aggregation, generalization across tasks, and reduction of computational complexity [
49].
According to yolov9.yaml, the YOLOv9 architecture is divided into two main parts: the “backbone” and the combination of the traditional “neck” and “head” of the neural network, simply defined as “head” [
49]. Several modules are found in the source code, including “Conv Module”, “RepConvN Module”, “RepNBottleneck Module”, “RepNCSP Module”, “RepNCSpELAN4 Module”, “CBLinear Module”, and “CBFuse Module.” Each module in this architecture plays a specific role in processing and transforming the input data, enhancing the model’s performance in tasks such as object detection and recognition. The functions of these modules are as follows:
Conv Module: This module comprises three components: nn.Conv2d, nn.BatchNorm2d, and an activation function (Act). The nn.Conv2d layer applies convolutional operations to the input data, nn.BatchNorm2d normalizes the output to speed up training and improve stability, and the activation function introduces non-linearity to the model [
51,
52]. As shown in
Figure 8a.
RepConvN Module: The module includes two convolutional layers and one batch normalization layer (bn), followed by an activation function (Act). The structure improves the efficiency and performance of the model by combining multiple convolutional operations [
53,
54], as shown in
Figure 8b.
RepNBottleneck Module: This module makes use of a bottleneck architecture with a RepConvN block followed by a convolutional layer (Conv). This design helps to reduce the number of parameters and computational complexity while maintaining or improving the model’s accuracy [
55], as shown in
Figure 9a.
RepNCSP Module: This module consists of multiple RepNBottleneck blocks and a concatenation operation (Concat). The structure is designed to merge feature maps from different layers, enhancing the model’s ability to capture complex patterns [
56], as shown in
Figure 9b.
RepNCSpELAN4 Module: This module integrates several components: a convolutional layer (Conv), a chunk operation, two RepNCSP + Conv blocks, and a concatenation operation (Concat). This module outputs the processed data to subsequent layers. It is designed for efficient layer aggregation and improved feature learning [
57], as shown in
Figure 10a.
CBLinear Module: This module includes a convolutional layer (Conv), a split operation, and an element named A0 that outputs tensors. The structure is tailored to split the input data into multiple tensors for further processing [
58], as shown in
Figure 10b.
CBfuse Module: The CBfuse Module combines multiple interpolated inputs. Each input is first interpolated to a common size, then stacked together and finally summed. This approach helps in fusing features from different scales or resolutions, improving the overall feature representation [
59], as shown in
Figure 11.
Figure 12 represents the architecture of a neural network backbone for YOLOv9 object detection, explicitly showing how different layers process the input data and pass the information through the network. The input to the network has a resolution of 640 × 640. In the first stage of the process, Layer 0 processes the input and directly outputs to Layer 1 and Layer 26, the “head” section, without altering the resolution. Layers in pink are convolutional layers that process and reduce the resolution by half, resulting in 320 × 320 pixels in Layer 1, which then outputs to Layer 2. Similarly, Layer 2 further reduces the resolution to 160 × 160 pixels and outputs to Layer 3 [
10,
60]. Layers in blue utilize the RepNCSpELAN4 block, maintaining the resolution while merging feature maps from different layers, enhancing the model’s ability [
56]. In Layer 5, another RepNCSpELAN4 block with parameters [512, 256, 128, 1] keeps the resolution unchanged and outputs to Layer 6 and directly to Layer 24. It also concatenates with the upsampled Layer 13 at a further level. Resolution reduction and up-scaling procedures are fundamental to object detection models in neural networks, allowing the models to balance detail and semantics and ensuring robustness and efficiency [
51,
52].
Figure 13 and
Figure 14 show that the input undergoes several stages with different modules to further enhance the model’s ability. At the end of the architecture, the final layer, “DualDDetect”, takes inputs from previous layers: 31, 34, 37, 16, 19, and 22. It combines features extracted at different resolutions and stages of the network, improving detection performance.
2.4. RTMP Server Construction
There are many options available for servers that allow for real-time image transfer. However, since the DJI application compatible with the UAV model used in this study only supports RTMP and RTSP transmission, the focus was on developing an RTMP server.
The construction of the RTMP server involved using NGINX (1.22.0) software, which allows users to create a virtual server on a device, and a Raspberry Pi-4 [
61] with Ubuntu 20.04, a Debian-based operating system [
62], was installed. The installation procedure was as follows:
Install Ubuntu 20.04 on Raspberry Pi 4:
- -
Download the Ubuntu 20.04 image from the official Ubuntu website [
62].
- -
Use Balena Etcher to flash the Ubuntu image onto a microSD card.
- -
Insert the microSD card into the Raspberry Pi 4 and power it on.
- -
Follow the on-screen instructions and complete the Ubuntu installation.
After successfully installing Ubuntu, the system was ready for the NGINX installation. Since the updated version of NGINX no longer supports RTMP servers as a built-in function, additional modules were used to create the server.
Install can Configure RTMP module for NGINX:
GINX: |
sudo | apt update |
sudo | apt i n s t a l l l ibnginx − mod − rtmp |
The following configuration was added:
rtmp { |
server { |
listen 1935; |
chunk_size 4096; |
application live { |
live on; }}} |
Listen 1935 allows for RTMP server to be access via the listening port 1935.
sudo systemctl reload nginx.service
Figure 15 demonstrates the terminal displaying the success in activating the server.
After the configuration, the server was ready for internal use. The server URL is “
http://IP:1935/live/streamkey” (accessed on 18 May 2024), where the device’s IP address can be checked using the command “ifconfig.” The stream key can be configured according to the user’s needs. The next step required setting the firewall to allow access from outside to port 1935 by enabling port forwarding on the router [
63].
Port Forwarding:
Access your router’s web interface.
Navigate to the port forwarding section.
Create a new port forwarding rule:
- -
Service Name: RTMP Server;
- -
Protocol: TCP;
- -
External Port: 1935;
- -
Internal IP Address: (IP address of your Raspberry Pi);
- -
Internal Port: 1935.
Save the settings.
At this point, the RTMP server can be accessed without needing to be on the same network.
Figure 16 flowchart explains how the connection network of the system works.
In the setup shown in
Figure 16, the images captured by the UAV were transferred to the controller and then pushed to the RTMP server (1) with the pre-written Python code on the ground station that utilizes OpenCV’s video capture capability [
29,
64] using the following command:
The image can be streamed and used in the detection process in real time. The processed image will then be pushed to the RTMP server (2), which can be streamed to multiple IoT devices. The code requires the installation of FFmpeg (version 0.4.9-pre1) software [
65], and the code is given by the following:
The ffmpeg command in
Figure 17 enables the Python code to push the detection results to the RTMP server (2) within the same session as receiving and detecting the received images [
65].