Next Article in Journal
Field Trial and Performance Evaluation of Soybean-Based Bio-Fog Seals for Asphalt Rejuvenation
Previous Article in Journal
Prediction Models and Feature Importance Analysis for Service State of Tunnel Sections Based on Machine Learning
Previous Article in Special Issue
Load Effect Analysis Method of Cable-Stayed Bridge for Long-Span Track Based on Adaptive Filtering Method
 
 
Article
Peer-Review Record

Intelligent Structural Health Monitoring and Noncontact Measurement Method of Small Reservoir Dams Using UAV Photogrammetry and Anomaly Detection

Appl. Sci. 2024, 14(20), 9156; https://doi.org/10.3390/app14209156
by Sizeng Zhao 1, Fei Kang 2,*, Lina He 3, Junjie Li 1,2,*, Yiqing Si 4 and Yiping Xu 5
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2024, 14(20), 9156; https://doi.org/10.3390/app14209156
Submission received: 12 August 2024 / Revised: 2 October 2024 / Accepted: 4 October 2024 / Published: 10 October 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The article is interesting and well prepared. However, some minor mistakes sholud be improved in order to enrich the scientific soundness.

 - lines 333-346 - Why was the comparison YOLOv8 vs YOLOv5 presented? Not v6 or v7?

 - lines 444-456 - The presented steps of processing are clear, but where is the computing performed - on UAV? on Windows workstation?

- Obtained results comparison to state-of-the-art results is missing.

Author Response

Thank you for valuable comments and enthusiastic review work. The paper is revised according to your comments as follows.

1. Comment: - lines 333-346 - Why was the comparison YOLOv8 vs YOLOv5 presented? Not v6 or v7?

Response: Thank you for the professional comments, this comparison process indeed requires some clarification. YOLOv8 is one of the novel object detection algorithms, with strengths not only in detection speed and accuracy but also due to its adaptability across various operating environments and platforms. It can run on a wide range of devices. YOLOv8 is developed and maintained by Ultralytics, a company that continues to update the network, further enhancing its practicality.

Similarly, YOLOv5 was also created by Ultralytics, which indicates a certain level of similarity between the two networks. Both architectures are composed of the Backbone, PANet, and Head structures, with PANet incorporating an upsampling fusion followed by a downsampling fusion. YOLOv8 introduced improvements in some modules, such as C2f and SPPF, which build upon the similar structure and make the two networks comparable.

In contrast, YOLOv6, YOLOv7, and YOLOX were developed by different authors, and their structures vary significantly from YOLOv5 and YOLOv8. Comparing these networks directly with YOLOv5 or YOLOv8 might not adequately highlight the advantages of the network improvements presented in this study. For this reason, this study compares only YOLOv8 with YOLOv5.

However, as pointed out by the reviewer, if the rationale behind this choice is not explicitly stated in the study, it can easily lead to misunderstandings. Therefore, we have added this explanation to clarify the reasoning behind the comparison.

The YOLO network is one of the most widely adopted object detection algorithms for UAV image detection, and is suitable for deployment on UAV platforms to achieve real-time detection [47, 48]. YOLOv8, like YOLOv5, was developed and continues to be updated by Ultralytics, brings about notable enhancements in detection speed and accuracy [49].

YOLOv8 composited a similar network architecture with YOLOv5, making it possible to directly compare their performance when using similar parameters. This also allows for an effective evaluation of how newly added modules enhance the network's capabilities. Furthermore, compared with YOLOv5, several improvements are adopted in YOLOv8. A C2f module is used to replace the C3 module to obtain more abundant gradient flow information while maintaining a lightweight architecture.

This is revised in Section 3.1.

2. Comment: - lines 444-456 - The presented steps of processing are clear, but where is the computing performed - on UAV? on Windows workstation?

Response: Thank you for your comment. The entire computation process is conducted on a workstation running the Windows operating system, with the UAV connected to the local system via network communication. UAV transmits its position, camera angles, and other relevant information using the MQTT protocol. The local workstation processes the images and received data, leveraging its computational power and integrating it with pre-defined spatial coordinates of the inspection area. This approach ensures that the final inspection results are produced directly, streamlining the process by establishing a connection with predefined spatial information. We have incorporated this explanation into the original study.

The main steps are processed on Windows workstation, details are as follows and summarized in Fig. 11:

A workstation running on Windows with an Intel i7-13700F CPU was used, which can integrate its computational power with pre-defined spatial coordinates of the inspection area, and produce inspection result directly.

This is revised in Section 3.4 and 4.1.

3. Comment: Obtained results comparison to state-of-the-art results is missing.

Response: Thank you for your valuable comments, we agree that comparing the research results with state-of-the-art methods is essential to highlight the advantages of the proposed approach. Currently, no studies have employed identical methods for inspecting small reservoir dams. Therefore, this study compares similar inspection techniques.

From a methodological perspective, commonly used 3D reconstruction methods based on UAV images [25] aim to capture the 3D coordinates of complex surface structures. However, these methods are time-consuming and unsuitable for the real-time demands of small reservoir dam inspections. Other methods estimate the target's position using the dimensions of specific objects in the images (e.g., windows), but these techniques are not applicable to the small reservoir dam environment. In contrast, our approach requires only a coordinate transformation, which allows for both target detection and safety assessment of individuals at identified locations, making it more suitable for small reservoir dam safety inspections.

In terms of accuracy, the proposed method integrates a laser rangefinder, effectively converting 2D pixels into 3D coordinates. Without the laser rangefinder, measurement methods based on the spatial relationship between the images and the structure could result in an error of around 1m. In this study, however, the positioning error is within 0.2m, which meets the inspection requirements of small reservoir dams. This is especially beneficial during tracking, where it can accurately assess whether individuals have exited hazardous areas.

From a practical standpoint, the method developed in this study requires minimal additional equipment. The coordinate system of the small reservoir dam does not need to be pre-acquired; the UAV coordinates can be transformed based solely on ground control point coordinates. This is crucial for meeting the low-cost requirements of small reservoir dam inspections. Moreover, the proposed method enables the creation of virtual electronic fences for small reservoir dams, as shown in Figure 24. Therefore, the proposed method, compared to current state-of-the-art techniques, is more suitable for small reservoir dam safety management and can help guide disaster prevention strategies [65, 66].

Compared to current UAV inspection approaches, the proposed method in this study is suitable for small reservoir dams. From methodological perspective, unlike 3D reconstruction [25], the proposed method does not require excessive time to capture 3D coordinates in complex structural areas or rely on other structural features for coordinate transformation. Proposed approach only requires a single coordinate transformation, making it more suitable for evaluating the safety of identified individuals in small reservoir dam inspections. In terms of accuracy, the proposed method incorporates a laser rangefinder, effectively converting 2D pixels into 3D coordinates. When relying solely on measurement methods based on the spatial relationship between the image and the structure, the error could be around 1 meter. However, in this study, the positioning error is within 0.2 meters, which meets the inspection positioning requirements for small reservoir dams. For practicality, the method developed in this research requires minimal additional equipment. The coordinate system of the small reservoir dam does not need to be pre-established; the UAV coordinates can be transformed using ground control point coordinates alone. This is particularly important for meeting the low-cost demands of small reservoir dam inspections. Additionally, the proposed method enables the creation of a virtual electronic fence for small reservoir dams, as illustrated in Figure 24. Thus, compared to the current state-of-the-art methods, the proposed approach is more suitable for small reservoir dams, supporting safety management and guiding disaster prevention strategies [65, 66].

This is revised in Conclusion.

Thank you again for your time and effort, and for helping us improve the manuscript. The supplementary description of details mentioned by reviewer have been added in the revised paper.

Reviewer 2 Report

Comments and Suggestions for Authors

Overall interesting paper with reasonable merit for publication. Some areas for improvement:

How authors optimized hyperparameters? Why not use automated search algorithms (e.g., grid search, Bayesian optimization) considering the data?
I would suggest to employ multi-modal input data to test the approach (RGB+Lidar)
Authors should add cross-validation to ensure model generalizability.
Why not incorporate new data augmentation techniques for better generalization of the model. Very dated approaches used, considering quite small dataset.
Add ablation study to isolate the effects of each model component in fig11 and t different stages (data preprocessing, model architecture, post-processing) .    Use sensitivity analysis to quantify the impact of removing specific layers or features to catch when it fails to detectobjects in dams.
Use random noise injection to test model resiliencynd conduct out-of-distribution tests to assess the model’s robustness on unseen data​.
Also test model performance across different scales of input data (e.g., subsampling) to ensure scalability, considering variations of distance till dams in your dataset and utilize few-shot learning to assess how well the model adapts with limited data.

 

I think the above would very much increase the validity of the experimental part.

Thanks and looking forward for revised paper.

Author Response

Thank you for valuable comments and enthusiastic review work. The paper is revised according to your comments as follows.

1. Comment: How authors optimized hyperparameters? Why not use automated search algorithms (e.g., grid search, Bayesian optimization) considering the data?

Response: Thank you for your professional comment, hyperparameters are important for network training. In this study, we initially adopted a manual adjustment approach to fine-tune the hyperparameters of the YOLOv8 network, comparing training results in terms of accuracy, loss and other metrics. However, this method proved to be inefficient, and the network's accuracy decreased after parameter modification. Subsequently, we used the ‘Ray Tune’ method, which is the default hyperparameter optimization approach in YOLOv8. Despite this, the optimized hyperparameters did not show significant improvements during training and performed poorly during validation. Therefore, the default hyperparameters of the network were used in this study.

We have conducted an analysis of this phenomenon. First, YOLOv8 default hyperparameters have been fine-tuned through extensive experimentation and optimization, demonstrating strong initial performance across various datasets. As a result, even without additional hyperparameter tuning, the default settings are robust enough to deliver good performance across many datasets. Furthermore, part of the dataset used in this study consists of small targets filtered from publicly available datasets, where the original hyperparameters already produce satisfactory results. Additionally, for small-object detection in UAV imagery, improving the network architecture, such as introducing different modules or detection paths, plays a more crucial role than optimizing hyperparameters, which may only provide marginal improvements. Finally, hyperparameter optimization is a resource-intensive process that requires substantial computational power for continuous parameter refinement. Given that the focus of this study lies in coordinate transformation and non-contact measurement, hyperparameter optimization was not a primary focus.

Thus, the default network parameters were used for training and testing, yielding relatively satisfactory results. Nonetheless, hyperparameter optimization should indeed be considered as one of the future directions of this study, which we mention in the outlook section.

To evaluate the effectiveness of the proposed method, all compared networks were kept consistent in terms of parameters. Additionally, since the Ray Tune method did not significantly improve the network's accuracy during hyperparameter optimization, this study employed the default hyperparameters of YOLOv8.

For network optimization, significant computational resources should be invested to fine-tune the network architecture and hyperparameters, while also increasing the dataset size to further enhance pedestrian detection accuracy.

This is revised in Section 4.2 and Conclusion.

2. Comment: I would suggest to employ multi-modal input data to test the approach (RGB+Lidar)

Response: Thank you for your professional comment and valuable suggestion to help us improve this paper. Multi-modal data has a wide range of applications in engineering, particularly as noted by the reviewer, where RGB+LiDAR data can provide both the textural information of targets and detailed dimensional coordinates. This combination is frequently used in areas like autonomous driving, smart construction, and structural health monitoring. The object detection method proposed in this study primarily focuses on detecting small pedestrian targets in UAV images. Therefore, we have tested the effects of different network structures and loss functions on detection performance, analyzed the role of different modules within the network, comparing them with traditional detection methods. From the perspective of image processing, the proposed method shows a certain improvement in accuracy.

As the reviewer mentioned, multi-modal data can test the robustness of the proposed method and further enhance its generalizability. However, we did not test the method with multimodal data in this study for the following reasons:

  1. Lack of appropriate datasets: Testing a network requires high-quality RGB images and LiDAR datasets, particularly for the inspection of small reservoir dams. While the image dataset can be organized from existing images, it is challenging to obtain high-quality LiDAR data for small reservoirs, and the available point clouds do not provide sufficient data quality.
  2. The method focuses on image processing performance: The general applicability of the network needs to be verified across various data types, including both images and LiDAR data. In this study, we primarily focus on pedestrian detection accuracy, so the ability to detect multimodal data does not play a crucial role in the inspection of small reservoir dams.
  3. LiDAR data acquisition is slower: Unlike images, acquiring LiDAR data typically takes time and is not real-time, which affects the network's detection speed. UAVs often perform inspections at a fast pace, and if LiDAR data is acquired too slowly, it may hinder the inspection process. Therefore, this study did not consider the effectiveness of the proposed method in detecting LiDAR data.

In summary, as the reviewer pointed out, multi-modal data can be a valuable tool for testing the network performance. However, due to the focus of this study on detecting pedestrian targets in images and the need for real-time detection, we did not employ multi-modal data to test the proposed method. Future research should consider combining RGB images with LiDAR data for a more comprehensive analysis of small reservoir dams.

Additionally, methods that integrate image and LiDAR data should be considered quickly detect targets while simultaneously determining the corresponding point cloud locations, thereby enabling the detection of multiple targets on complex geometric surfaces.

This is revised in Conclusion.

3. Comment: Authors should add cross-validation to ensure model generalizability.

Response: Thank you for your professional comment, we agree cross-validation can be used to valid the model generalizability. Cross-Validation is a method used for evaluating the performance of a network by dividing the dataset into multiple subsets, repeatedly testing the model to reliably estimate its generalization capability. In this study, we applied the K-Fold Cross-Validation method, splitting the data into 5 parts and iteratively training on 4 parts while using the remaining part for validation. As noted by the reviewer, this approach effectively demonstrates the generalization capability of the proposed method.

Moreover, to objectively assess the network generalization ability, K-Fold Cross-Validation method was utilized, dividing the data of 15,680 images into 5 parts. The proposed method was then iteratively trained on 4 parts and validated on the remaining part. The results of the 5 training iterations are presented in Table 3, with an average mAP of 78.2 and F1 of 70.5. Additionally, the training results showed no significant differences across the different folds, indicating that the proposed model exhibits a strong level of generalization.

Table 3. Cross-Validation results with k = 5 (%).

Evaluation metrics

Division 1

Division 2

Division 3

Division 4

Division 5

Everage

F1

70.6

71.0

70.2

70.3

70.2

70.5

mAP

78.1

78.6

77.9

78.0

78.4

78.2

This is revised in Section 4.2.

4. Comment: Why not incorporate new data augmentation techniques for better generalization of the model. Very dated approaches used, considering quite small dataset.

Response: Thank you for your professional comment and valuable suggestion to help us improve this paper, data augmentation is a valuable technique to enhance network performance, particularly in the field of structural health monitoring. Researchers often employ data augmentation to improve network performance, especially in scenarios with small datasets. For multi-scale targets, data augmentation can also enhance detection accuracy. In this study, the focus is on detecting pedestrian targets in small reservoir dams, which presents unique challenges. The reasons for not using more advanced data augmentation methods are as follows:

  1. Use of mosaic data augmentation: During training, we applied the Mosaic data augmentation method built into the YOLOv8 framework. This method randomly crops and combines four images into one, enriching the background diversity of detection targets. It also normalizes the calculation by considering the data from four images in one composite image. This approach helps improve the robustness of the model without introducing excessive complexity.
  2. Target specific detection: As mentioned earlier, this study focuses on detecting small pedestrian targets in UAV image. The network was trained with default parameters, which effectively handles common targets. We also modified the network structure to enhance the feature extraction for small pedestrian targets. Despite this, we were concerned that excessive data augmentation, such as scaling, might reduce the detection ability for small targets. Overuse of augmentation could also blur the targets, impacting detection accuracy. Therefore, we avoided additional augmentation methods to maintain focus on small target detection.
  3. Focus on non-contact measurement: One of the core aspects of this study is the coordinate transformation method and non-contact measurement theory, which occupied a significant portion of the research effort. Although we explored various options, due to limited resources, we prioritized solving the challenges related to coordinate transformation and non-contact measurement over dedicating computational resources to complex data augmentation techniques.

However, as the reviewer pointed out, data augmentation is undeniably a crucial aspect of target detection. To ensure the general applicability of the proposed network, incorporating advanced data augmentation techniques will be necessary to strengthen its performance. Therefore, in future research, we plan to explore and integrate advanced data augmentation methods to enhance the detection of various targets in small reservoir dams.

5. Comment: Add ablation study to isolate the effects of each model component in fig11 and t different stages (data preprocessing, model architecture, post-processing). Use sensitivity analysis to quantify the impact of removing specific layers or features to catch when it fails to detect objects in dams.

Response: Thank you for your comment. Ablation study is an important method for analyzing the capabilities of detection networks. However, in the original research, Figure 11 represents the entire non-contact measurement process, and we apologize for any ambiguity this may have caused the reviewers.

Specifically, Figure 11 encompasses data acquisition, UAV coordinate transformation, target detection, target 3D localization and result analysis. It is not feasible to ablate a specific module within this process. The UAV coordinate transformation provides critical reference points for self-localization, ensuring its positioning within the coordinate system of a small reservoir dam. Target detection provides the precise locations of pedestrians within the images, enabling the UAV inspection process to focus on specific targets. Meanwhile, target 3D localization determines the 3D coordinates of pedestrians based on the UAV position, laser ranging data and recognized targets. Ablating any single process in this entire workflow would hinder the achievement of the target non-contact measurement.

Furthermore, this study conducted ablation experiments on the detection network, with results presented in Table 4. This process was primarily aimed at analyzing the roles of different modules within the network and their direct impact on accuracy. The results clearly demonstrate that the employed models enhance the network's capability to extract small targets. More importantly, the findings validate the role of the tiny head within the entire network, indicating the feasibility of the proposed improvement pathway.

Lastly, the reviewers requested an analysis of the impact of different network layers on feature extraction until complete recognition failure. We conducted several investigations but found it challenging to quantitatively describe this process. The textural features in different images vary significantly, resulting in differing sensitivities of the same feature layer across diverse images. Additionally, when utilizing the original network structure alone, it is evident from Table 2 and Figure 15 that some targets can still be detected.

To address this, we replaced certain structures from Figure 10, such as C2f, with 1×1 convolutional modules, allowing for validation while maintaining structural integrity. Although the effects vary across different images, the general trend indicates that if all layers from 0 to 6 in the Backbone are replaced, and the same data is used for training, the detection performance deteriorates significantly, resulting in the inability to detect many targets within the images. We have revised the description of Figure 11 to eliminate ambiguity and have detailed the sensitivity analysis process.

The entire procedure effectively translates 2D pixels into 3D coordinates through coordinate transformation, target detection, and localization.

Finally, to analyze the network sensitivity, some layers are replaced with 1×1 convolutional modules. Despite the variations in texture among different images, the results from a substantial number of validation images indicate that replacing all layers from 0 to 6 in the backbone significantly hampers the detection of small pedestrian targets within the images.

This is revised in Section 3.4 and 4.2.

6. Comment: Use random noise injection to test model resiliency conduct out-of-distribution tests to assess the model’s robustness on unseen data.

Response: Thank you for your professional suggestion, it would be a great validation method. We employed two methods for introducing noise into the images. The first method involved adding Gaussian noise to create varying degrees of blur in the images. Testing revealed that the detection accuracy did not significantly decline. Moreover, considering that images in practical situations are unlikely to contain excessive noise, this approach was not adopted in our study. The second method involved incorporating background noise, as illustrated in Figure 16, where no pedestrians are present, but there are other disturbances of similar size. The testing results indicate that the proposed network did not produce any false positives, demonstrating its strong robustness.

To further analyze the robustness of the proposed method, the trained model was utilized to detect background images, as shown in Fig. 16. Although these background images do not contain pedestrians, they feature other disturbances of similar size. The test results confirm that the proposed network did not generate any false detections, further indicating its robust performance.

This is revised in Section 4.2.

7. Comment: Also test model performance across different scales of input data (e.g., subsampling) to ensure scalability, considering variations of distance till dams in your dataset and utilize few-shot learning to assess how well the model adapts with limited data.

Response: Thank you for your professional suggestion. Multi-scale data plays a critical role in improving the effectiveness of network training, and we agree that it is essential to test the robustness of the proposed network to accommodate targets of varying sizes. Considering that in real-world scenarios, the distance between the UAV and the target may vary, causing the pedestrian size in the image to differ, the detection network must adapt to multi-scale targets. During the network training process, the following aspects were considered to achieve multi-scale data adaptability:

  1. We ensured that the dataset included pedestrian targets of various scales and images of different sizes. Thus, the entire dataset does not consist only of small pedestrian targets like those shown in Figure 15, but also includes larger pedestrian targets from datasets like COCO. This mix of images helped enhance the robustness of the network.
  2. More importantly, we filtered out extremely small pedestrian images. Tiny targets, such as pedestrians as small as 5×5 pixels, are nearly impossible to detect and are unlikely to appear in real inspections. Therefore, we excluded such small targets from the dataset. If these small targets were downsampled further, they would become even smaller, potentially disrupting network training and not being encountered during actual inspections.
  3. The network architecture includes both large and tiny heads. Unlike traditional three-channel detection, our approach considers the coexistence of both large and small targets. The network structure is designed to ensure that targets of different scales can be accurately detected.
  4. The test dataset also includes both large-scale targets and images of varying sizes, providing a diverse evaluation scenario. As a result, the test outcomes reflect the general applicability of the network and its adaptability to various target sizes.

However, these considerations were not explicitly described in the original text, leading to potential misunderstandings. As the reviewer rightly pointed out, this is an important aspect for validating the network's robustness. We have added these details to the manuscript to clarify:

Dataset includes multi-scale targets, ranging from small pedestrians captured from a distance to larger ones photographed up close. The image sizes used during training and testing vary, further enhancing the robustness of the network.

This is revised in Section 4.2.

Thank you again for your time and effort, and for helping us improve the manuscript. The modification and supplement details of network have been revised in the modified paper.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Authors have clarified key points. Paper can be accepted, though I would still recommend deeper ablation study (minor revision).

Author Response

1. Comment: Authors have clarified key points. Paper can be accepted, though I would still recommend deeper ablation study (minor revision).

Response: Thank you for your recognition and professional evaluation. We have reconsidered your comment and agree that ablation research should indeed be conducted. The main research approach is to gradually replace the modules in the backbone with 1x1 ordinary convolutions, while keeping the stride consistent with the original network. Analyzing the importance of adding modules in improving the network through this approach. More importantly, it can be analyzed whether the improved small object detection head has a significant impact on recognition performance. After analysis, we found that the shallow part of the network is very important. This is mainly reflected in the ability to extract features of small targets. The specific analysis content is as follows:

Finally, to analyze the importance of different modules, more deeper ablation experiments were conducted. Some modules in Backbone have been replaced with 1x1 convolutional networks while maintaining the same stride as the original modules. After module 6 and module 8 were replaced with convolutional modules, mAP decreased by 0.77% and 0.63%, respectively. However, when module 2 and module 4 were replaced, mAP decreased by 1.21% and 1.45%, respectively. This indicates that the effectiveness of shallow network in extracting small target features is crucial for the detection results. Despite the variations in texture among different images, the results from a substantial number of validation images indicate that replacing all layers from 0 to 6 in the backbone significantly hampers the detection of small pedestrian targets within the images.

This is revised in Section 4.2

Thank you again for your time and effort. The supplementary of ablation study mentioned by reviewer have been added in the revised paper, and we hope that can satisfy the requirement.

Back to TopTop