1. Introduction
Logistics hubs, including ports, are key links to modern supply chains. In these hubs, logistics service providers collaborate to transship among multiple modes, store goods and offer value-added services through asset and operations sharing, taking advantage of economies of scale. One significant aspect of logistics hubs is security against a wide array of threats, including unauthorized entry, cargo theft, illicit trade and terrorism. Cargo owners and logistics providers have long recognized that guarding against security threats is fundamental to business continuity, operational efficiency and cost-effectiveness [
1]. As a result, logistics hubs have developed strong practices, processes and systems that are security-certified under international standards, such as the ISPS code [
2] or TAPA [
3]. Among others, these standards focus on access processes and restrictions and the means to raise alarms against illegal activities through surveillance.
To effectively address these areas, it is beneficial to include advanced technology systems in the logistics hub security strategy [
1,
4]. The industry status with respect to surveillance requirements, operations and systems has been reviewed in [
5], indicating that logistics facility operators use mostly traditional surveillance systems, such as Closed-Circuit TV (CCTV). In this case, a set of surveillance cameras are installed throughout the facility and provide real-time video feed to the operator(s) responsible for monitoring potential suspicious activity. The requirement of human supervision, however, may pose challenges related to reliability and costs [
6]. Furthermore, the efficiency and efficacy of the CCTV set-up are not ideal due to the large physical space the facility extends to [
7]. Finally, CCTV cameras have a specific field of view, and, as a result, blind spots cannot be avoided [
8].
Thus, users agree that more advanced systems are necessary to eliminate the weaknesses of CCTV-based systems [
5]. In response to this need, the industry is shifting towards integrating UAVs into surveillance operations [
9]. Drone-based surveillance systems offer superior coverage and efficiency. Integration of AI in such systems enables drones to accurately track objects and autonomously identify potential threats with minimal human intervention, enhancing the reliability of surveillance operations [
10].
The YOLOv3 model, a one-stage artificial intelligence object detector, has been shown to be highly effective for drone-based surveillance systems due to its fast and precise object detection and recognition capabilities. It may locate an object and determine its class in real time, which is essential in dynamic aerial supervision situations [
11]. Furthermore, YOLOv3 may detect small objects, which is an important requirement for drones that operate in various and demanding conditions, such as diverse and oftentimes high altitudes and variable lighting conditions [
12].
However, training YOLO AI models is a complex process that requires extensive annotated data, in many cases from several sources [
13]. It also requires significant computer resources [
14]. There are additional training challenges for drone imaging object detection: UAV photos have low resolution, and this frequently results in the blurring of objects. Several object categories may also be underrepresented in drone images, and object sizes may vary over a broad range due to the angle and height of the drone [
15]. These variations necessitate the use of highly robust algorithms. Finally, it is worth noting the scarcity of drone-specific datasets, which may affect the effectiveness of training.
The purpose of this study is to optimize the training of the YOLOv3 object detection model to effectively support drone surveillance systems operating in logistics hubs and facilities. Specifically, we aim to tune the model’s hyperparameters to achieve superior training for high-altitude object detection in outdoor environments that are typical of logistics hubs. The main challenge is the computational time and resources required for experimenting with a large number of hyperparameter combinations during training. To accelerate this process, we have explored two levers: (a) how to systematically search the hyperparameter space and (b) how to minimize the training time required to assess each hyperparameter combination. For the first lever, we employ a full factorial design of experiments (DOE) to explore the space of hyperparameters. For the second lever, we estimate model performance for each hyperparameter combination early in the training process, thus considerably reducing the hyperparameter tuning process.
The structure of the remainder of this paper is as follows:
Section 2 overviews important research on drone surveillance systems that incorporate AI.
Section 3 outlines the method for hyperparameter tuning.
Section 4 presents the analysis of the factorial experiments and estimates the effects of hyperparameters on model performance.
Section 5 discusses an early termination strategy to reduce the computational time in hyperparameter space exploration. In
Section 6, we validate the performance of the tuned YOLOv3 model and the proposed early termination strategy. Finally,
Section 7 presents the conclusions of the study, discusses key aspects of the proposed approaches and proposes directions for future research.
4. Experimental Results and Analysis
The computer system used for the experiments was equipped with an AMD Ryzen 9 3900X 12-Core Processor with 32GB RAM (Advanced Micro Devices, Inc., Santa Clara, CA, USA) and Nvidia RTX 3090 24GB graphics card (Nvidia Corporation, Santa Clara, CA, USA) with included CUDA and CuDNN drivers. Additionally, the OpenCV toolkit, an open-source computer vision library providing a wide range of tools for image and video processing, was installed and used in a variety of ways in the YOLOv3 model. Finally, to automatically adjust all the aforementioned hyperparameters, a Python script (version 3.10) was developed to guide the appropriate configuration of the YOLOv3 model across all experimental runs.
It is noted that YOLOv3 training needs significant computational resources, which lead to long computational times, especially when a less powerful system, such as the one above, is employed.
Table 3 presents the effect of image dimensions on the computational time required to complete each training run using the above system. It is observed that the computational time increases almost linearly with the dimension of the image size.
4.1. Results and Analysis
The analysis of the experimental results (96 mAP values) was performed using MiniΤab software.
Table 4 presents the significance of the effects and their interactions (
p-value ≤ 0.05) on validation mAP. All main effects are statistically significant, as expected. Some two-way and three-way interactions are also significant, suggesting complex dependencies between the hyperparameters that affect the detection capabilities of the model. Only the significant interactions are shown in
Table 4.
Figure 5 quantifies the effects of the main factors on the mAP:
Image resolution affects mAP significantly since downsizing the image resolution from 832 × 832 to 416 × 416 and 352 × 352 yields mAP values of 45%, 30% and 25%, respectively. This is an expected finding for drone surveillance operating at high altitudes. In this case, objects generally occupy a few pixels in the image, and therefore, increased resolutions help in object detection by providing a larger number of pixels per object.
Box loss analysis also affects model performance significantly. Using Intersection over Union (IoU) as compared to the Generalized IoU (GIoU) yields an increase in mAP of about 6%. This improvement shows that IoU provides better alignment between predicted and actual object boundaries compared to GIoU.
The size of the anchor boxes has a less (but still statistically significant) impact on mAP, with a decrease of about 2% when changing from the default to the new anchor size.
Network depth and activation functions show an even less (but still statistically significant) effect on mAP. Switching from a Feature Pyramid Network (FPN) to a Feature Pyramid Network with an added Spatial Pyramid Pooling (SPP) block, as well as changing the activation function from Leaky ReLU to Mish ReLU, yield a change in mean average precision (mAP) about 1%.
Figure 6 shows the two-way interaction effects on validation mAP. Consider, for example, the effect of the interaction between “Image resize” and “Anchor size”. The top left plot shows the mAP values at three different image resolutions, 352 × 352, 416 × 416 and 832 × 832, with two anchor size settings: the YOLOv3 default and an updated “new” anchor size. As the image resolution increases, mAP improves for both anchor settings, but the rate of improvement is steeper with the new anchor size than with the default. Therefore, for higher image resolutions, the effect of image resize on mAP is less important. The contrary is true for the interaction between image resize and box loss. The effect of box loss is more pronounced at high image resize levels. In general, two non-parallel curves in a plot indicate significant interactions.
The optimal hyperparameter combination that reached the highest mAP of 50.33% is achieved by resizing the image to 832 × 832, using default anchor sizes, Intersection over Union (IoU) for box losses, Feature Pyramid Network (FPN) for network depth, and Leaky ReLU as an activation function. This configuration excels due to the 832 × 832 resolution that enhances feature detail, while the default anchor sizes probably match well with the typical object scales in the UAV dataset. IoU box loss effectively balances accuracy and recall, FPN incorporates semantic information at all scales, improving detection capabilities, and Leaky ReLU maintains active neurons, ensuring consistent learning throughout the network.
4.2. Robustness of the Method with Respect to Datasets and Network Architecture
To verify the robustness of our hyperparameter tuning approach in terms of input data, we have conducted related experiments and analysis (ANOVA) on the mAP results produced from two datasets: (a) the training/validation one and (b) the testing one. That is, the effects of the hyperparameters and their interactions on mAP were estimated twice based on two independent (but related) datasets. The analysis resulting from these two datasets indicated very similar effects of the main factors and factor interactions on mAP. For example, the total effect of image resize from the validation dataset analysis was estimated to be 20%, and from the test dataset analysis was 15%. Similar results were obtained from the analysis of the two datasets regarding anchor size (2% vs. 1.4%), box loss (6% vs. 4%), network depth (1% vs. 0.4%) and activation function (1% vs. 0.4%), as well as for the factor interactions.
A second investigation in terms of robustness across input datasets, included verification experiments using the DeOPSys dataset, a custom and totally independent dataset related to logistics hub surveillance. The results were very encouraging (see
Table 7 of
Section 6 below), confirming once more the applicability and robustness of the method across datasets.
In terms of the robustness of network architectural changes, the YOLOv3 architecture was one of the factors tested in our work. In particular, we tested two model architectures. The first is YOLOv3 with Feature Pyramid Network (FPN), and the second is YOLOv3 with FPN + Spatial Pyramid Pooling (SPP) module. The effect (difference) on the mAP of these two architectures is limited but statistically significant (see
Table 4 and
Figure 4 of the manuscript under network depth). Similar outcomes hold for the interaction of network depth with other factors. ANOVA analysis related to the FPN vs. the FPN+SPP outcomes (two separate analyses, one per architecture) confirmed that the main effects of image resize (19.5% vs. 19.8%), anchor size (1.8% vs. 2.2%), box loss (6.8% vs. 6.1%) and activation function (1.4% vs. 0.5%) are statistically significant and similar in both FPN and FPN+SPP architectures. The same holds for the factor interactions, indicating the robustness of the analysis across network architectures.
Regarding the training time technique, in the validation case, the best configuration with the FPN architecture using the early termination method is 0.8% away from the best FPN configuration obtained through exhaustive training, while the best configuration with the FPN + SPP architecture is 1.7% away. Similar results were obtained from the VisDrone and DeOPSys test datasets. These differences are limited, indicating that the early termination method effectively identifies high-performing configurations under both architectures tested.
4.3. A Note on Overfitting Avoidance and Computational Effort
In this study, overfitting avoidance was tested in two ways: First, through the classical validation and testing approach of the training process. The performance of the model was quite similar in the validation dataset and in the independent testing dataset (both independent subsets of the VisDrone dataset), indicating the absence of overfitting during training. Secondly, we performed experiments using a UAV image dataset developed in our lab (DeOPSys dataset). Through these experiments, we validated that the model trained under the most efficient combination of hyperparameters was equally effective in a dataset that was totally independent of the one used for training. This is described further in
Section 6.2 below.
In terms of computational effort, as indicated in the introduction of this chapter, training run time varies significantly depending on the resolution of the image (see also
Table 3).
Figure 7 presents the progression of a training run for a 352 × 352 image resolution that requires 20,000 iterations. The red curve represents mAP, which increases as the model’s accuracy improves during training, while the blue curve corresponds to the training loss, which decreases as the model optimizes its parameters progressively. In this case, the 20,000 iterations of complete training require 378 min of computational time.
Thus, an important aspect to consider is the time required to perform the full testing of the 48 different combinations of hyperparameters in two repetitions (a total of 40 days of computational time using the above computer system). The significant computational effort raises a practical limitation in exploring the space spanned by multiple hyperparameters. Certainly, the factorial design offers a robust understanding of how hyperparameters affect model performance. However, the significant investment in time and resources limits its applicability to situations in which rapid model creation and development are important (such as in the logistics hubs case). This highlights the need for more efficient approaches that can reduce the time required for comprehensive testing.
5. Reducing the Computational Time of Hyperparameter Space Exploration
The basic idea in reducing the computational time required for training is to predict the performance of a fully trained model much earlier in the training process. Thus, we investigated the potential of using a value of mAP that results from only a few hundred training iterations instead of the 20,000 iterations required for full model training. The metric used in this analysis is the highest validation mAP value observed between the 400th iteration and the early termination point.
We determined the iteration threshold of 400 by conducting preliminary experiments in the 0–1500 iteration range, where we observed that in early iterations (indicatively up to 200), mAP values were close to 0% due to random initialization of weights and early-stage parameter adjustments. At iteration range 0–600, max mAP reached a value a bit more than 5%, and at iteration range 0–1500, it reached a value of 14.3%. Based on these observations, we selected an iteration threshold between 200 and 600.
To identify the appropriate early training termination points, we used a two-step approach:
We performed a correlation analysis to assess the relationship between the best validation mAP results obtained from exhaustive training vs. those obtained by early termination. This analysis was performed for each of three image size values {352 × 352, 416 × 416, 832 × 832}.
For the promising image sizes, we ranked the best validation mAP results provided by exhaustive training from best to worst. We compared those against the maximum mAP results for the different termination points to assess possible divergence in ranking sequences.
Regarding step 1, for each of the three image resolutions, we determined the correlation between (a) the maximum validation mAP achieved for each combination of the remaining parameters (16 combinations) through exhaustive training and (b) the maximum validation mAP achieved in the training iteration ranges of 400–600, 400–1000 and 400–1500, for the same hyperparameter combinations. For example, for image resolution 832 × 832 (see
Figure 8), a close correlation was observed between the best validation mAP from exhaustive training (
x-axis) and the maximum validation mAP achieved in the interval between 400 and 1000 iterations (
y-axis) for the 16 experiments. R
2 is equal to 0.964.
Table 5 presents the results of the entire correlation analysis. For the 352 × 352 resolution case, the correlation coefficient (R
2) increases significantly with longer training durations, from 0.6722 in the 400–600 iteration case to 0.9114 in the 400–1500 iteration case. This validates the intuitive hypothesis that the mAP value produced by more extensive (but still brief) training sessions predicts the final model performance more closely. The 414 × 416 and 832 × 832 image resolution cases result in improved R
2 values. Here, R
2 also improves for the increased iteration cases. A notable exception is the 400–600 iteration case for the 416 × 416 resolution.
In the 832 × 832 resolution case, the correlation between the maximum mAP achieved during early iterations and in exhaustive training is significantly strengthened as the number of iterations increases, reaching an R2 value of 0.9714 in the range of 400–1500 iterations. This indicates that longer training periods can be highly predictive of the final model’s performance. In the case of the 400–600 iteration range, the low R2 value is due to the high volatility of the values of the model’s weights during the early stages of training.
In step 2 of our approach, we listed all 16 hyperparameter combinations in decreasing order of mAP values for the 832 × 832 image resolution case.
Figure 9 presents four curves of mAP related to exhaustive training and the three early termination cases. The
x-axis represents the model ID (model with a different hyperparameter combination), which is sorted based on decreasing mAP values of the exhaustive training case. The other three curves follow the same experiment sequence. The
y-axis represents the mAP values.
Focusing on the top two curves of
Figure 9, corresponding to the exhaustive training and the 400–1500 cases, it appears that early termination mAP values align well with the exhaustive training mAP values. For example, the top-performing combination of the 400–1500 case corresponds to model 9. In the exhaustive training case, the performance of this model is within 0.8% of the best combination. A similar result is obtained for the 400–1000 iteration range. However, the 400–600 iteration range does not present similar reliability. These observations indicate that knowledge of early performance can be a useful predictor of the performance of the fully trained model, provided that the iteration range is adequate (termination point 1000 iterations or more in this example).
Note that the curves related to 400–1000 and 400–1500 iteration ranges present a fluctuation in the mAP value (see
Figure 9). This is not unexpected, given that early termination has not allowed the final convergence of the model parameter values. However, as
Figure 10 below shows, there is a declining trend in the values of the 400–1000 and 400–1500 iteration cases, quite similar to the one related to the exhaustive training case (the slopes of the three lines are presented in
Figure 10 below). This is quite encouraging and seems to support our argument. In addition, the early terminated models predict the significant drop in performance observed after the model with ID 14, caused by the change in the box loss function. Specifically, models 1, 2, 9, 10, 5, 6, 13 and 14 use IoU as their box loss function, while models 11, 12, 16, 7, 3, 8, 4 and 15 use GIoU.
Similar results are obtained for the 416 × 416 resolution images (see
Figure 11). In this case, mAP deterioration in the exhaustive training results is more gradual (no discernable step change). For both the 400–1500 and the 400–1000 early termination cases, the deterioration trend is preserved. Furthermore, the top-performing combination of exhaustive training attains the top mAP values in both early termination cases, possibly a favorable coincidence. As expected from the corresponding R
2 value of
Table 5, the 400–600 early termination case does not yield useful results.
For the 352 × 352 image resolutions, the distinction between high- and low-performing hyperparameter combinations in the early termination cases deteriorates further. Thus, the maximum mAP resulting from early termination of training (at least up to 1500 iterations) is not considered to be a reliable performance predictor of the fully trained model.
The analysis of the mAP curves in
Figure 9,
Figure 10 and
Figure 11 indicates that for the 832 × 832 and the 416 × 416 image resolutions, a good predictor of model performance appears to be the maximum mAP value obtained from training sessions terminated early at 1500 or 1000 iterations. In this example case, it appears that a good trade-off between computational time and prediction ability is achieved by using image resolution 416 × 416, performing training for up to 1500 iterations and using the maximum mAP found in the range of 400 to 1500 iterations. For this case, the computational time is about 38 min, which is an order of magnitude less than the 492 min required for complete training of the model (20,000 iterations).
In step 2 of our approach (see above), for the lowest resolution (352 × 352) and shortest training duration (400–600 iterations), the models related to early termination failed to predict the performance of the fully trained models. However, at higher resolutions (416 × 416 and 832 × 832), the ability to correctly differentiate the performance under various hyperparameter combinations improves. For the 416 × 416 resolution, increasing iterations from 400–600 to 400–1500 improves success in detecting top-performing models from 20% to 80% and success in identifying the worst-performing models from 40% to 80%. At 832 × 832 resolution, the approach achieves success in detecting top-performing configurations (by 80%) across all iteration ranges. Also, the ability to identify the worst-performing configurations improves significantly with longer training—rising from 60% (400–600) to 100% (400–1500). This suggests that at higher resolutions, additional training iterations enable the model to better differentiate between all hyperparameter configurations, leading to more reliable predictions.
The above analysis indicates the following:
The maximum mAP value obtained by early termination of the training process is a good indicator of the relative performance of the fully trained model;
The hyperparameter combination corresponding to the highest achieved mAP in early termination of training will result in a near-optimal model when exhaustive training is completed.
6. Validation Experiments
We conducted an experimental study to validate the performance of the YOLOv3 models in port surveillance operations. For this study, we used the custom experimental set-up of
Appendix A.
To assess and compare between models, we used two different datasets:
Each dataset has characteristics that help validate the YOLOv3 model’s ability to generalize across different environments and scenarios.
6.1. Validation Using the VisDrone Test Dataset
Table 6 presents the performance of the models corresponding to the 16 hyperparameter combinations when tested with the VisDrone test dataset. The models are listed in descending order of performance with respect to the early training termination case of Image resolution 416 × 416, iteration range 400–1500. The first two columns of
Table 6 indicate the model ID and the mAP values obtained in early termination training (same values as in the green curve of
Figure 11 but in descending order of mAP). The third column presents the mAP values obtained from these models when fully trained on the 832 × 832 image resolution and applied to the images of the VisDrone test dataset, while the fourth column presents the difference between these latter mAP values with the best mAP value obtained.
Referring to the data of
Table 6, the highest achieved mAP value by a fully trained model is 38.82% and corresponds to model 5. Model 9, which had the highest mAP value during the early termination process, achieved a mAP value of 37.43%, which is only 1.38% lower than the highest mAP score. Furthermore, model 1, which displayed the highest training performance during the factorial experiments (mAP value of 50.3% in the 832 × 832 case (see
Figure 8) and the third best performance in early training termination, achieved a mAP value in the VisDrone test dataset of 38.68%, that is 0.14% less than the highest value of model 5. Thus, the following conclusions could be made:
The model corresponding to the hyperparameter combination identified as best in full training had an excellent performance in the independent VisDrone test subset;
The model corresponding to the hyperparameter combination identified as best in early terminated training also had a very good performance in the independent VisDrone test subset.
This indicates that the factorial method using full and, most importantly, early terminated training identifies near-optimal combinations of training hyperparameters.
Four of the top five hyperparameter combinations identified by the early termination method (corresponding to models 5, 1, 2 and 13) were also ranked among the top five performing models in the 832 × 832 VisDrone test. This also supports the value of the proposed method to identify high-performance hyperparameter combinations using lower-resolution experiments with early termination training.
In order to test the applicability and lack of overfitting in datasets with other but relevant image characteristics, we performed experiments using a UAV image set that we developed in our lab (DeOPSys dataset). The performance of the best-identified model by the proposed method was consistent with the performance of the model in the VisDrone dataset and actually surpassed the latter. This provided additional and strong evidence that model training during hyperparameter tuning did not lead to dataset-specific overfitting.
6.2. Validation Using the DeOPSys Test Dataset
We employed an independent dataset generated in our lab to (a) validate the robustness of the proposed method with respect to the data (in addition to the work described in
Section 4.2) and examine the applicability of the results obtained in practical situations.
The dataset generated by our DeOPSys lab [
58] consists of 722 high-resolution images taken from UAVs over two locations on the island of Chios, Greece: (a) Tholos port/shipyard (
Figure 12a) and (
Figure 12b) a rural area in the northern part of the island (
Figure 12b). The images were captured under varying light conditions (daylight and dusk) and from different altitudes (15 m, 30 m and 50 m) [
59]. They were annotated using LabelImg 1.8.6 [
60] to identify and mark objects, such as persons and cars, resulting in a total of 1218 annotated objects.
The analysis applied to the VisDrone Test-Development dataset was also applied to the DeOPSys dataset.
Table 7, structured similarly to
Table 6, presents the performance results of the 16 models in descending order of performance for early termination.
Table 7.
The performance of the 16 models ranked in decreasing mAP resulting from early termination of training (DeOPSys test dataset).
Table 7.
The performance of the 16 models ranked in decreasing mAP resulting from early termination of training (DeOPSys test dataset).
Early Termination Strategy | Fully Trained Models at 832 × 832 |
---|
Model ID | Validation mAP (Early Termination) | Test mAP | Difference with Best mAP |
---|
9 | 19.7% | 73.36% | 1.58% |
13 | 15.7% | 72.40% | 2.54% |
1 | 15.6% | 73.87% | 1.07% |
5 | 14.2% | 74.94% | - |
2 | 14.1% | 74.66% | 0.28% |
10 | 12.8% | 74.72% | 0.22% |
14 | 12.3% | 71.25% | 3.69% |
6 | 11.7% | 74.44% | 0.50% |
11 | 11.2% | 72.12% | 2.82% |
15 | 9.5% | 67.13% | 7.81% |
16 | 8.6% | 69.07% | 5.87% |
12 | 8.5% | 74.19% | 0.75% |
3 | 8.5% | 66.94% | 8.00% |
7 | 8.1% | 73.12% | 1.82% |
8 | 7.5% | 67.55% | 7.39% |
4 | 7.0% | 70.27% | 4.67% |
The highest test mAP achieved by a fully trained model (model 5) was 74.94% (a particularly high value). Model 1, which displayed the best performance in the factorial experiments, achieved a test mAP of 73.87%, which is only 1.07% lower than model 5. Similarly, the model achieved the best performance in early termination training (model 9) and achieved a test mAP of 73.36%, a difference of 1.58% from the performance of model 5 and 0.51% from the performance of model 1.
These test results indicate the following:
Superior performance is obtained by the top models of the factorial experiments, as well as the top models of the early termination approach. In fact, the performance in the independent DeOPSys dataset was significantly better than the one corresponding to the VisDrone dataset used in the original model training.
In this case, the early termination approach was also able to identify the high-performing hyperparameter combinations.
The proposed method is robust across datasets and may identify hyperparameter combinations resulting in high-performing models.
Model training during hyperparameter tuning did not lead to dataset-specific overfitting.
7. Conclusions and Discussion
7.1. Hyperparameter Tuning and Early Termination Strategy
In this study, we focused on optimizing the training process of the YOLOv3 algorithm for drone surveillance. This involved two aspects: (a) tuning of key training hyperparameters through exploration of the hyperparameter space and (b) use of an early termination strategy to rationalize the long computational times involved.
Factorial experiments demonstrated the importance of hyperparameter selection on model performance. During training/validation, the best-performing model achieved a mAP value of 50.33%, while the worst-performing model achieved a mAP value of 36.8%, a considerable difference in performance. Factors such as image resolution, box loss functions and anchor sizes were found to have the most significant effect on mAP. Furthermore, ANOVA may systematically explore the effects of all hyperparameters and all their interactions on model performance.
However, the scalability of factorial design is limited as the number of hyperparameters increases. In our case, the factorial design required approximately 40 days of computational time in the (weak) system we used, highlighting the need for more efficient tuning methods.
To address this challenge, an early termination strategy was introduced, which reduced processing time by up to 92%. Despite the reduced computational cost, the methodology still identified high-performing hyperparameter combinations. The largest observed mAP difference between the best-performing model trained exhaustively and the best-performing model identified during the early termination process (and then trained exhaustively) was only 0.8%. Under this limited mAP difference, the best-performing model identified through early termination remains effective in the object’s detection case under study, ensuring that it can still identify potential threats or targets in logistics surveillance. In practical applications, this limited accuracy trade-off is acceptable, given the significant reduction in computational cost, making the approach well-suited for real-time drone-based surveillance. It is also noted that the proposed early termination approach is highly scalable.
Testing of the best-performing model using a specially developed dataset that meets the characteristics of logistics hubs further supported the effectiveness of the proposed method; that is, factorial exploration of the hyperparameter space with early training termination may be used to develop highly performing models.
7.2. Future Research Directions and Potential of Advanced Methods
Further research may explore other training hyperparameters related to structural changes in neural network architecture. Additionally, integrating more intelligent search methods, such as Bayesian optimization, genetic algorithms or reinforcement learning, could enhance the efficiency of hyperparameter tuning.
These techniques may provide the potential to study a greater number of hyperparameters (beyond the five ones investigated in the current study), addressing the scalability limitation of factorial designs. Exploring a wider hyperparameter space intelligently (as opposed to the brute force approach of the current study) may provide the opportunity to reveal configurations that can achieve higher accuracy and detection efficiency with reasonable computational cost.
From our experience in ongoing work beyond the scope of this manuscript, we have observed that genetic algorithms (GA) significantly reduce computational costs compared to a full factorial approach. Specifically, in an experiment tuning 16 hyperparameters instead of the 5 used in this study, a full factorial design would require 65,536 models. However, by applying a GA with a population size of 30, 20 generations and five parents, the algorithm converges by running a bit more than 500 models, significantly reducing computational requirements. This highlights the potential of GA as an efficient alternative to exhaustive hyperparameter tuning methods. However, it should be noted that genetic algorithms are limited in identifying the effects of hyperparameters and their interactions on model performance.