Next Article in Journal
The Development of a Prototype Solution for Detecting Wear and Tear in Pedestrian Crossings
Previous Article in Journal
Large-Depth Ground-Penetrating Radar for Investigating Active Faults: The Case of the 2017 Casamicciola Fault System, Ischia Island (Italy)
 
 
Article
Peer-Review Record

Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection

Appl. Sci. 2024, 14(15), 6461; https://doi.org/10.3390/app14156461
by Nuo Chen, Jin Zhu * and Linhan Zheng
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2024, 14(15), 6461; https://doi.org/10.3390/app14156461
Submission received: 6 July 2024 / Revised: 19 July 2024 / Accepted: 23 July 2024 / Published: 24 July 2024
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript doesn't clearly explain how their YOLOv8n variant is "lightweight," lacking specifics on how it reduces size or speeds up computations compared to standard YOLO models.

The new attention method (sparse connectivity and deformable convolution) is said to improve accuracy, but the paper doesn't clarify exactly how it helps detect fishing nets better in different underwater conditions.

They report good precision (89.3%), recall (80.7%), and mAP (86.7%), but don't compare their model enough with other top models or test it across a wide range of underwater scenarios.

Using GANs to add more images is good, but they should explain more about how diverse and realistic these new images are and how they improve the model's performance.

Comparing their model with YOLOv8n is useful, but they should give more specific details on how their model is better in terms of performance and computational efficiency (like model size and speed) to fully understand its advantages.

Author Response

1.The manuscript doesn't clearly explain how their yolov8n variant is "lightweight," lacking specifics on how it reduces size or speeds up computations compared to standard yolo models.

The yolov8n variant of the light-yolo model is a lightweight version. Our paper aims to build a powerful underwater detector that balances accuracy and speed. This paper addresses this point by adding the following specifics:

Lightweight Attention Mechanism: In this paper, we introduce an attention mechanism based on sparse connectivity and deformable convolution, which helps focus on critical features and improves overall efficiency by reducing unnecessary computational paths. A modularized network structure is used in the network structure, allowing flexible adjustment of model depth and width according to task complexity and ensuring that unnecessary computational load is reduced while maintaining high accuracy.

Light-yolo uses a lightweight attention mechanism, improving detection efficiency and enhancing detection accuracy in complex and changing underwater environments, providing strong technical support for safe and efficient underwater operations.

Article rewrite part(Yellow text background for modified additions):(Page 7, line 227)

3.4. Block diagram of Light-yolo

  This part introduces the Light-yolo architecture, depicted in Fig. 5, which enhances yolov8n by substituting its SPPF module with a custom-designed DA2D module. Unlike the SPPF's parallel pooling layers, the DA2D module adjusts more flexibly, better capturing target shapes and details, which is ideal for complex geometrical transformations and occlusions. It also incorporates the CoT and Seam modules to boost interaction between the detection head and feature extraction, markedly enhancing performance and efficiency over yolov8n.

 

Fig. 5 Block diagram of Light-yolov8

In summary, this thesis model uses a lightweight attention mechanism introduced in this paper based on sparse connectivity and deformable convolution, which not only helps focus on critical features but also improves overall efficiency by reducing unnecessary computational paths. A modularized network structure is used in the network structure, which allows flexible adjustment of the model depth and width according to the task complexity, ensuring that unnecessary computational loads are reduced while maintaining high accuracy.

 

 

 

 

 

 

 

 

2.The new attention method (sparse connectivity and deformable convolution) is said to improve accuracy, but the paper doesn't clarify exactly how it helps detect fishing nets better in different underwater conditions.

The new attention method mentioned in our paper, based on sparse connectivity and deformable convolution, plays a vital role in improving the accuracy of detecting fishing nets under different underwater conditions.

Sparse connectivity: This mechanism allows the model to focus only on the critical visual features for target detection while ignoring background noise. In underwater environments, where lighting conditions, turbidity, and color are highly variable, sparse connectivity helps the model focus on the nets' critical structural features: the grid pattern and texture, which can be effectively identified even in low visibility conditions.

Deformable Convolution: It enhances the model's ability to handle changes in viewing angle and deformation, which is critical for detecting nets whose shape may change due to currents, marine life, or physical damage. By learning the offsets, deformable convolution dynamically adjusts the sensory field to capture a more accurate target position, even in complex underwater environments.

We also employ COTAttention and SEAM, additional modules that further enhance the interaction between features and enable the model to understand different levels of contextual information, which is particularly important for recognizing fishing nets that may behave differently under different light and water quality conditions.

Regarding the additions to the above points, there is a good response in the ablation experiments. (The experiments will be mentioned later)

Article rewrite part:(Page 4, line 145)

The DA2D module provides the model with the ability to adaptively adjust the location of attentional sampling points according to the input content by drawing on the deformable convolutional[20] concept to provide the model with an ability to adaptively adjust the location of attentional sampling points based on the input content. This innovation eschews the traditional approach of employing a uniform or fixed sampling strategy on the global feature map. The module predicts a set of offsets for each query location in a specific implementation. It applies these offsets to a predetermined grid of reference coordinates, generating a series of new contextually relevant sampling locations. This mechanism realizes the dynamic focusing of crucial information and its effective extraction, effectively avoids computational redundancy caused by undifferentiated processing of all pixels, and thus improves computational efficiency. Further, the DA2D module adopts a sparse connectivity strategy[21] that incorporates only a small number of parameters into the discrimination process of each forward propagation instead of considering all the parameters, a strategy that may lead to an increase in the number of parameters but reduces the number of floating point operations (FLOPs). With this approach, the study effectively improves the performance of broken region detection while keeping the computational effort low and realizes the goal of substantial performance improvement based on a small computational effort.

In summary, the DA2D module offers several advantages that make it particularly suitable for underwater target detection. The module lets the model focus on visual features critical for target detection through a sparse connectivity mechanism, ignoring background noise. Given the wide range of variations in lighting conditions, turbidity, and color in underwater environments, sparse connectivity helps the model to focus on critical structural features of fishing nets, such as mesh patterns and textures, which can be effectively identified even in low visibility conditions. In addition, deformable convolution enhances the model's ability to deal with perspective changes and morphology variations, which is crucial for detecting nets that may have changed shape due to water currents, the influence of marine organisms, or physical damage. By learning the offset, deformable convolution can dynamically adjust the sensory field to capture more accurate target positions, ensuring precise localization in complex underwater environments.

 

 

 

 

 

3.Using GANs to add more images is good, but they should explain more about how diverse and realistic these new images are and how they improve the model's performance.

Our paper uses the GAN method to expand the dataset, obtain sufficient data to train the model and improve the model's ability. We specifically increase the principle and training parameters of GAN and increase the content analysis of the data set, which can more intuitively see the authenticity diversity of the generated images

Article rewrite part:(Page 8, line 240)

4 Experimental

4.1 Fishing net vulnerability dataset collection and construction

4.1.1Expansion of the fishing net vulnerability dataset

This study used a dataset obtained from underwater camera shots taken in a laboratory pool. The small number of samples in the annotated dataset will significantly limit the model's ability to achieve the desired detection performance if not properly processed. In addition, due to the scarcity of image resources available for learning, the model is prone to overfitting, leading to an increase in the gap in generalization ability between the training loss and the validation loss, as well as limiting the effective scaling of the model and performance improvement.

To solve this problem, Generative Adversarial Network (GAN [25]) is introduced into the model. The GAN network consists of two sub-modules: one is a generator, and the other is a discriminator, which improves their respective performances by confronting each other during the training process. The schematic diagram of the GAN network principle is shown in Figure 6 below.

Fig.6 Schematic diagram of GAN network principle

Generator G receives random noise vectors and outputs a realistic fishing net hole image after adversarial training; discriminator D inputs the original image and the generated fishing net hole image and recognizes their authenticity. In the training process, random noise is first fed into generator G to generate time-frequency map samples. These samples generated by G are then fed into the discriminator D along with the actual time-frequency maps for verification. The entire system operates in an adversarial training framework. As training progresses, the generator G is optimized and learns to generate samples closer to the real ones. At the same time, the discriminator D enhances its ability to distinguish between real and fake samples. The training goal of the GAN network is to complete the training by minimizing the sum of the loss functions of G and D and continuously updating the parameters of the two networks. The detailed parameters of our trained GAN network are shown in Table:

Table 1. GAN network training parameters

Parameters

Generator

Discriminator

Learning rate

0.002

Batchsize

32

Epoch

300

Optimizer

Adam

Loss

BCEloss

Generator input noise

100

The model can generate high-quality images of broken fishing nets without destroying the original background, thus significantly improving the usefulness of the fishing net hole detection model. By this method, this paper utilizes the GAN network to expand 1756 images from the original labeled 359 images, thus expanding the size of the dataset to 2124 images. The comparison between the images generated by the GAN image and the original image is shown in Figure 7 below.

bottom of the form

 

 

 

Fig. 7 Comparison of GAN image generated image as well as original image

4.1.2 Fishing Net Vulnerability Dataset

In this experiment, we used a GoPro8 underwater camera to shoot by recording video several times, slicing, and manually labeling. The target of this annotation is the vulnerability class, labeled as a hole, with various shapes of vulnerabilities, some presenting narrow, some presenting large voids, and some presenting minor vulnerabilities, presenting a variety of forms in the natural environment of the laboratory pool.

 

 

Fig.8 Example of an image of the fishing net vulnerability dataset

The normalized target size map is shown in Fig. 9a, while Fig. 9 shows the regularized target location map, which shows most of the small and medium-sized targets (the darker the color in Fig. a and Fig. b, the higher the number of targets). Finally, we divided the training and validation sets according to a ratio of 9:1, and we used 1944 frames for training and 180 frames for validation.

 

   (a)                               (b)

Fig. 9 Statistical results for this dataset: (a) Normalized target size plot; (b) Normalized target location plot;

 

  1. They report good precision (89.3%), recall (80.7%), and mAP (86.7%), but don't compare their model enough with other top models or test it across a wide range of underwater scenarios.

We added three sets of state-of-the-art models for comparison, adding RTDETR (2024CVPR), DDQ (2023CVPR), and yolov10 (2024CVPR), and the experimental results show that Light-yolo is a powerful model for underwater detection of fishing nets, and it is a model that balances both size and accuracy of detection.

Article rewrite part:(Table on page 14, line 356 and page 15, line 378)

Models

 

 

[email protected]/%

[email protected]/%

Params

GFLOPs

FPS

yolov3-tiny[26]

91.4

77.1

83.5

43.3

17.4

12.9

434.78

yolov7-tiny[27]

91.1

81.3

85.5

47.8

12.3

13.2

123.46

yolov5n[28]

91.0

77.6

85.6

49.6

9.5

7.1

833.33

yolov5s

90.0

77.0

84.1

49.2

34.7

23.8

263.15

yolov8n

87.6

78.1

83.7

47.6

3.0

8.2

322.58

yolov10n[29]

90.2

78

84.5

48.7

8.06

24.8

230.49

Light-yolo

89.3

80.7

86.7

51.8

4.4

6.1

121.95

Models

[email protected]/%

[email protected]/%

Params

GFLOPs

FPS

yoloxs[30]

87

50.7

8.05

21.8

227.69

Gold_yolo[31]

85.3

48.9

46

21.5

148.6

Cascade RCNN[32]

79.6

42.0

69.152

209.92

50.8

RTDETR[33]

84.6

47.8

32.80

108.0

116.32

DDQ[34]

83.9

48.7

48.31

236.2

46.15

Light-yolo

86.7

51.8

4.4

6.1

121.95

 

  1. Comparing their model with yolov8n is useful, but they should give more specific details on how their model is better in terms of performance and computational efficiency (like model size and speed) to fully understand its advantages.

In order to demonstrate the advantages of the models, we added FPS experiments to all models. Although the experimental results show a disadvantage on the FPS test, 121 frames are sufficient for detecting fishing net vulnerabilities.

Article rewrite part:(Page 14, line 353 to page 15, line 392.)

4.5 Comparison of Conventional Lightweight yolos

To highlight the excellent performance of the Light-yolo, we start with a traditional lightweight yolo family comparison.

Table 6.Comparative experiments of the traditional yolo family

Models

 

 

[email protected]/%

[email protected]/%

Params

GFLOPs

FPS

yolov3-tiny[26]

91.4

77.1

83.5

43.3

17.4

12.9

434.78

yolov7-tiny[27]

91.1

81.3

85.5

47.8

12.3

13.2

123.46

yolov5n[28]

91.0

77.6

85.6

49.6

9.5

7.1

833.33

yolov5s

90.0

77.0

84.1

49.2

34.7

23.8

263.15

yolov8n

87.6

78.1

83.7

47.6

3.0

8.2

322.58

yolov10n[29]

90.2

78

84.5

48.7

8.06

24.8

230.49

Light-yolo

89.3

80.7

86.7

51.8

4.4

6.1

121.95

In the comparative analysis of the traditional yolo family, the "Light-yolo" model demonstrated its exceptional performance under lightweight design. This model achieved an accuracy of 89.3% and a recall rate of 80.7%. Although its accuracy is slightly lower than "yolov3-tiny" and "yolov7-tiny," its recall rate surpasses all other comparison models, including "yolov7-tiny." This result indicates a clear advantage of "Light-yolo" in accurately identifying the correct targets. In more detailed performance metrics, "Light-yolo" reached 86.7% in [email protected] and achieved a high score of 51.8% under the more stringent [email protected] evaluation standard, leading other models and showcasing its high precision and reliability in various object detection scenarios.

In terms of computational efficiency, the number of parameters of "Light-yolo" is only 4.4 million, which is far less than that of "yolov5s" (34.7 million) and "yolov10n" (8.06 million), and its GFLOPs is only 6.1, compared with "yolov5s" (23.8) and "yolov10n" (24.8), respectively. This low computational complexity makes "Light-yolo" especially suitable for running on devices with limited computational resources, such as mobile devices and embedded systems.

Although the frame rate (FPS) of 121.95 of "Light-yolo" is not as good as that of 833.33 of some other models, such as "yolov5n", the FPS provided by "Light-yolo" is sufficient to meet the needs of most real-time application scenarios, especially in video surveillance and mobile devices, while maintaining high accuracy and low computational requirements. In summary, "Light-yolo" is an ideal choice for efficient, high-precision target detection in resource-constrained environments due to its excellent balance between performance and efficiency. These features make it an outstanding performer in lightweight target detection models and provide a necessary reference value for research and applications in related fields.

4.6 Comparison of other families

After comparing the traditional lightweight yolo family, we move on to comparing other yolo-based improvements and other target detection solutions.

Table 7.Other family comparison experiments

Models

[email protected]/%

[email protected]/%

Params

GFLOPs

FPS

yoloxs[30]

87

50.7

8.05

21.8

227.69

Gold_yolo[31]

85.3

48.9

46

21.5

148.6

Cascade RCNN[32]

79.6

42.0

69.152

209.92

50.8

RTDETR[33]

84.6

47.8

32.80

108.0

116.32

DDQ[34]

83.9

48.7

48.31

236.2

46.15

Light-yolo

86.7

51.8

4.4

6.1

121.95

When compared to other models, Light-yolo stands out for its balance of accuracy and efficiency. Light-yolo not only achieved the highest [email protected] at 51.8%, but it also has the fewest parameters at only 4.4M and a GFLOPs of just 6.1, demonstrating exceptional computational efficiency.

Compared to yoloX, although yoloX has an advantage in [email protected] with 87%, Light-yolo outperforms in [email protected] and has fewer parameters and computational load. When compared to Gold-yolo, Light-yolo not only exceeds Gold-yolo's accuracy of 48.9%, but it is also more efficient in terms of parameters and computation. Similarly, while RTDETR and Cascade RCNN perform well in certain aspects, their complexity and higher computational requirements make them less efficient than Light-yolo.

In addition, Light-yolo employs COTAttention and SEAM modules, which enhance the interaction between features and enable the model to understand different levels of contextual information better. Together with the DA2D of deformable convolution and sparse connectivity feature modules, Light-yolo not only outperforms many competitors in terms of accuracy but also achieves an inference speed of 121.95 FPS, which is second only to yolox's 227.69 FPS but realizes a highly high computational efficiency while maintaining high accuracy, making it an ideal choice for both accuracy and efficiency.

 

 

 

 

 

 

 

 

 

Thank you for your constructive comments on my thesis and for helping us to improve the quality of our research. Your time and effort in reviewing our work is greatly appreciated.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors



A lightweight model based on YOLOv8 is proposed in the study "Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection" as a method for identifying fishing nets in underwater settings. While the research addresses an important issue in underwater robotics, several aspects could be improved for clarity, rigor, and overall impact.

·         Considering the increasing need for effective underwater inspection techniques, the study's emphasis on a lightweight model for underwater fishing net detection is both contemporary and pertinent.

·         The introduction of the DA2D, COTAttention, and SEAM modules to enhance the model's performance in complex underwater environments is innovative. For real-time applications, these modules designs are intended to increase detection accuracy and decrease computing overhead.

·        The paper contains significant experimental data that includes measurements of accuracy, recall, and mAP. The efficacy and efficiency of the model are shown by the ablation research and comparisons with other YOLO variations.

·         The equation of precision presented in (9) is not correct

·         There are several gaps in the literature review. It does not provide a thorough study of the state-of-the-art techniques for underwater object identification, although mentioning a few relevant papers. A more thorough comparison with current methods might improve the paper's quality.

·        The training and testing dataset is addressed in passing but not in great depth. It would be helpful to include additional details about the attributes of the dataset, including the variety of photos, the kinds of fishing nets, and the difficulties exposed by the environment.

·         Some methodological aspects are not fully explained. For instance, the specifics of how the GAN was used to expand the dataset are not thoroughly discussed. More details on the training process, parameter settings, and the rationale behind certain design choices would enhance the paper's transparency and reproducibility.

·         Even though the article provides a range of performance indicators, it would be beneficial to provide additional qualitative findings, including illustrations of fishing net detections in diverse underwater environments. This would help to clarify how well the model performs in real-world scenarios.

·         Although the comparison with other lightweight YOLO models is interesting, it would be more compelling if it also included comparisons with approaches that are not based on YOLO. This would provide a more comprehensive viewpoint on how well the model performs in comparison to other methods in the industry.

·         Although the paper emphasizes the model's computational efficiency, it lacks a detailed analysis of the model's runtime performance on different hardware setups. Providing such details would help assess the model's applicability in various real-world scenarios.

·         Training curves of precision, recall, mAP0.5 and mAP0.5:0.95 are important to see the evolution of all models during each epoch, you must add them.

***check the references, respect template standards

·         With the creation of the Light-YOLO model, the study makes a substantial addition to the area of underwater object identification. However, it may gain from a more detailed methodological explanation a more extensive comparative study, and a more thorough review of the literature. By resolving these concerns, the paper's overall impact on the discipline would be enhanced, along with its lucidity and meticulousness.

Comments on the Quality of English Language



A lightweight model based on YOLOv8 is proposed in the study "Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection" as a method for identifying fishing nets in underwater settings. While the research addresses an important issue in underwater robotics, several aspects could be improved for clarity, rigor, and overall impact.

·         Considering the increasing need for effective underwater inspection techniques, the study's emphasis on a lightweight model for underwater fishing net detection is both contemporary and pertinent.

·         The introduction of the DA2D, COTAttention, and SEAM modules to enhance the model's performance in complex underwater environments is innovative. For real-time applications, these modules designs are intended to increase detection accuracy and decrease computing overhead.

·        The paper contains significant experimental data that includes measurements of accuracy, recall, and mAP. The efficacy and efficiency of the model are shown by the ablation research and comparisons with other YOLO variations.

·         The equation of precision presented in (9) is not correct

·         There are several gaps in the literature review. It does not provide a thorough study of the state-of-the-art techniques for underwater object identification, although mentioning a few relevant papers. A more thorough comparison with current methods might improve the paper's quality.

·        The training and testing dataset is addressed in passing but not in great depth. It would be helpful to include additional details about the attributes of the dataset, including the variety of photos, the kinds of fishing nets, and the difficulties exposed by the environment.

·         Some methodological aspects are not fully explained. For instance, the specifics of how the GAN was used to expand the dataset are not thoroughly discussed. More details on the training process, parameter settings, and the rationale behind certain design choices would enhance the paper's transparency and reproducibility.

·         Even though the article provides a range of performance indicators, it would be beneficial to provide additional qualitative findings, including illustrations of fishing net detections in diverse underwater environments. This would help to clarify how well the model performs in real-world scenarios.

·         Although the comparison with other lightweight YOLO models is interesting, it would be more compelling if it also included comparisons with approaches that are not based on YOLO. This would provide a more comprehensive viewpoint on how well the model performs in comparison to other methods in the industry.

·         Although the paper emphasizes the model's computational efficiency, it lacks a detailed analysis of the model's runtime performance on different hardware setups. Providing such details would help assess the model's applicability in various real-world scenarios.

·         Training curves of precision, recall, mAP0.5 and mAP0.5:0.95 are important to see the evolution of all models during each epoch, you must add them.

***check the references, respect template standards

·         With the creation of the Light-YOLO model, the study makes a substantial addition to the area of underwater object identification. However, it may gain from a more detailed methodological explanation a more extensive comparative study, and a more thorough review of the literature. By resolving these concerns, the paper's overall impact on the discipline would be enhanced, along with its lucidity and meticulousness.

Author Response

1.The equation of precision presented in (9) is not correct

 

Changed, thanks for pointing it out

 

2.There are several gaps in the literature review. It does not provide a thorough study of the state-of-the-art techniques for underwater object identification, although mentioning a few relevant papers. A more thorough comparison with current methods might improve the paper's quality.

We added two papers from 24 years of literature. One resulted from the latest top meeting AAAI2024, the excellent underwater detector AMSP-UOD.

Article rewrite part(Yellow text background for modified additions):(3 Pages 87 lines)

In the field of underwater image processing and object detection, Wang Yudong's research[12] evaluates how various underwater image enhancement techniques affect object detection performance. Although it offers a detailed analysis of these techniques on detection, it needs to consider their effectiveness in extreme conditions like very low-light or highly turbid waters, limiting their broader applicability. Jian Zhang's research[13] developed an RTMDet[14]. Framework with a CSPLayer[15] incorporated the BoT3[16] module and MHSA mechanism, enhancing network contextual information capture and detection accuracy. This approach achieved an impressive [email protected] of 0.879 on the URPC2019 and URPC2020 datasets. However, it notably increases computational complexity, potentially limiting its use in real-time or resource-constrained scenarios. The study of Chen, J [17] used Deformable Convolution Network (DCN) version 3 instead of yolo backbone, added fusion space, scale, and channel feature fusion, Final realization at DUO dataset realize [email protected] 0.867. However, their study increased the computational complexity, which might limit its application in real-time or resource-constrained scenarios.The study of Zhou, J [18] proposed AMSP-UOD, a single-stage underwater object detection network combining eddy convolution (AMSP-VConv) and feature association decoupling (FAD-CSP) modules, which significantly improved detection accuracy and noise immunity in complex underwater environments by optimizing the feature extraction and noise reduction capabilities, with an AP50 of 86.1% on the RUOD dataset.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3.Some methodological aspects are not fully explained. For instance, the specifics of how the GAN was used to expand the dataset are not thoroughly discussed. More details on the training process, parameter settings, and the rationale behind certain design choices would enhance the paper's transparency and reproducibility.

 

Our paper uses the GAN method to expand the dataset to obtain sufficient data to train the model and improve its capability. We add the schematic diagram of the principle of GAN and the training parameters to show the detailed process of dataset expansion more intuitively.

Article rewrite part:(Page 8, line 240)

4.1 Fishing net vulnerability dataset collection and construction

4.1.1Expansion of the fishing net vulnerability dataset

This study used a dataset obtained from underwater camera shots taken in a laboratory pool. The small number of samples in the annotated dataset will significantly limit the model's ability to achieve the desired detection performance if not properly processed. In addition, due to the scarcity of image resources available for learning, the model is prone to overfitting, leading to an increase in the gap in generalization ability between the training loss and the validation loss, as well as limiting the effective scaling of the model and performance improvement.

To solve this problem, Generative Adversarial Network (GAN [25]) is introduced into the model. The GAN network consists of two sub-modules: one is a generator, and the other is a discriminator, which improves their respective performances by confronting each other during the training process. The schematic diagram of the GAN network principle is shown in Figure 6 below.

Fig.6 Schematic diagram of GAN network principle

Generator G receives random noise vectors and outputs a realistic fishing net hole image after adversarial training; discriminator D inputs the original image and the generated fishing net hole image and recognizes their authenticity. In the training process, random noise is first fed into generator G to generate time-frequency map samples. These samples generated by G are then fed into the discriminator D along with the actual time-frequency maps for verification. The entire system operates in an adversarial training framework. As training progresses, the generator G is optimized and learns to generate samples closer to the real ones. At the same time, the discriminator D enhances its ability to distinguish between real and fake samples. The training goal of the GAN network is to complete the training by minimizing the sum of the loss functions of G and D and continuously updating the parameters of the two networks. The detailed parameters of our trained GAN network are shown in Table:

Table 1. GAN network training parameters

Parameters

Generator

Discriminator

Learning rate

0.002

Batchsize

32

Epoch

300

Optimizer

Adam

Loss

BCEloss

Generator input noise

100

The model can generate high-quality images of broken fishing nets without destroying the original background, thus significantly improving the usefulness of the fishing net hole detection model. By this method, this paper utilizes the GAN network to expand 1756 images from the original labeled 359 images, thus expanding the size of the dataset to 2124 images. The comparison between the images generated by the GAN image and the original image is shown in Figure 7 below.

bottom of the form

 

 

 

Fig. 7 Comparison of GAN image generated image as well as original image

 

 

 

 

 

 

 

 

  1. The training and testing dataset is addressed in passing but not in great depth. It would be helpful to include additional details about the attributes of the dataset, including the variety of photos, the kinds of fishing nets, and the difficulties exposed by the environment.

We enrich the presentation of the dataset by adding details of the analysis within the underwater fishing net vulnerability dataset, including specific information such as the size of the position and the size of the calibrated box.

Article rewrite part:(Page 9, line 265)

4.1.2 Fishing Net Vulnerability Dataset

In this experiment, we used a GoPro8 underwater camera to shoot by recording video several times, slicing, and manually labeling. The target of this annotation is the vulnerability class, labeled as a hole, with various shapes of vulnerabilities, some presenting narrow, some presenting large voids, and some presenting minor vulnerabilities, presenting a variety of forms in the natural environment of the laboratory pool.

 

 

Fig.8 Example of an image of the fishing net vulnerability dataset

The normalized target size map is shown in Fig. 9a, while Fig. 9 shows the regularized target location map, which shows most small and medium-sized targets (the darker the colour, the higher the number of targets in Fig. a and Fig. b). Finally, this paper divides the training and validation sets in the ratio of 9:1, using 1944 and 180 frames for training.

 

   (a)                               (b)

Fig. 9 Statistical results for this dataset: (a) Normalized target size plot; (b) Normalized target location plot;

5.Even though the article provides a range of performance indicators, it would be beneficial to provide additional qualitative findings, including illustrations of fishing net detections in diverse underwater environments. This would help to clarify how well the model performs in real-world

We added specific photos of underwater fishing net vulnerabilities to demonstrate the difficulty of underwater identification more visually scenarios.

Article rewrite part:(Page 10, line 270)

 

 

Fig.8 Example of an image of the fishing net vulnerability dataset

 

  1. Although the comparison with other lightweight YOLO models is interesting, it would be more compelling if it also included comparisons with approaches that are not based on YOLO. This would provide a more comprehensive viewpoint on how well the model performs in comparison to other methods in the industry.

We additionally added two groups of state-of-the-art models (non-yolo series) for comparison, RTDETR (2024CVPR), DDQ (2023CVPR), and the latest outing yolov10 (2024CVRP) Experimental results show that Light-yolo is a powerful model for underwater detection of fishing nets, and that it is a model that balances the detection scale and the accuracy of the model.

Article rewrite part:(Table on page 14, line 356 and page 15, line 378)

Models

 

 

[email protected]/%

[email protected]/%

Params

GFLOPs

FPS

yolov3-tiny[26]

91.4

77.1

83.5

43.3

17.4

12.9

434.78

yolov7-tiny[27]

91.1

81.3

85.5

47.8

12.3

13.2

123.46

yolov5n[28]

91.0

77.6

85.6

49.6

9.5

7.1

833.33

yolov5s

90.0

77.0

84.1

49.2

34.7

23.8

263.15

yolov8n

87.6

78.1

83.7

47.6

3.0

8.2

322.58

yolov10n[29]

90.2

78

84.5

48.7

8.06

24.8

230.49

Light-yolo

89.3

80.7

86.7

51.8

4.4

6.1

121.95

Models

[email protected]/%

[email protected]/%

Params

GFLOPs

FPS

yoloxs[30]

87

50.7

8.05

21.8

227.69

Gold_yolo[31]

85.3

48.9

46

21.5

148.6

Cascade RCNN[32]

79.6

42.0

69.152

209.92

50.8

RTDETR[33]

84.6

47.8

32.80

108.0

116.32

DDQ[34]

83.9

48.7

48.31

236.2

46.15

Light-yolo

86.7

51.8

4.4

6.1

121.95

 

 

 

 

7.Although the paper emphasizes the model's computational efficiency, it lacks a detailed analysis of the model's runtime performance on different hardware setups. Providing such details would help assess the model's applicability in various real-world scenarios.

In order to demonstrate the advantages of the models, we added FPS experiments to all models. Although the experimental results show a disadvantage on the FPS test, 121 frames are sufficient for detecting fishing net vulnerabilities.

Article rewrite part:(Page 14, line 353 to page 15, line 392.)

4.5 Comparison of Conventional Lightweight yolos

To highlight the excellent performance of the Light-yolo, we start with a traditional lightweight yolo family comparison.

Table 6.Comparative experiments of the traditional yolo family

Models

 

 

[email protected]/%

[email protected]/%

Params

GFLOPs

FPS

yolov3-tiny[26]

91.4

77.1

83.5

43.3

17.4

12.9

434.78

yolov7-tiny[27]

91.1

81.3

85.5

47.8

12.3

13.2

123.46

yolov5n[28]

91.0

77.6

85.6

49.6

9.5

7.1

833.33

yolov5s

90.0

77.0

84.1

49.2

34.7

23.8

263.15

yolov8n

87.6

78.1

83.7

47.6

3.0

8.2

322.58

yolov10n[29]

90.2

78

84.5

48.7

8.06

24.8

230.49

Light-yolo

89.3

80.7

86.7

51.8

4.4

6.1

121.95

In the comparative analysis of the traditional YOLO family, the "Light-yolo" model demonstrated its exceptional performance under lightweight design. This model achieved an accuracy of 89.3% and a recall rate of 80.7%. Although its accuracy is slightly lower than "yolov3-tiny" and "yolov7-tiny," its recall rate surpasses all other comparison models, including "yolov7-tiny." This result indicates a clear advantage of "Light-yolo" in accurately identifying the correct targets. In more detailed performance metrics, "Light-yolo" reached 86.7% in [email protected] and achieved a high score of 51.8% under the more stringent [email protected] evaluation standard, leading other models and showcasing its high precision and reliability in various object detection scenarios.

In terms of computational efficiency, the number of parameters of "Light-yolo" is only 4.4 million, which is far less than that of "yolov5s" (34.7 million) and "yolov10n" (8.06 million), and its GFLOPs is only 6.1, compared with "yolov5s" (23.8) and "yolov10n" (24.8), respectively. This low computational complexity makes "Light-yolo" especially suitable for running on devices with limited computational resources, such as mobile devices and embedded systems.

Although the frame rate (FPS) of 121.95 of "Light-yolo" is not as good as that of 833.33 of some other models, such as "yolov5n", the FPS provided by "Light-yolo" is sufficient to meet the needs of most real-time application scenarios, especially in video surveillance and mobile devices, while maintaining high accuracy and low computational requirements. In summary, "Light-yolo" is an ideal choice for efficient, high-precision target detection in resource-constrained environments due to its excellent balance between performance and efficiency. These features make it an outstanding performer in lightweight target detection models and provide a necessary reference value for research and applications in related fields.

4.6 Comparison of other families

After comparing the traditional lightweight yolo family, we move on to comparing other yolo-based improvements and other target detection solutions.

Table 7.Other family comparison experiments

Models

[email protected]/%

[email protected]/%

Params

GFLOPs

FPS

yoloxs[30]

87

50.7

8.05

21.8

227.69

Gold_yolo[31]

85.3

48.9

46

21.5

148.6

Cascade RCNN[32]

79.6

42.0

69.152

209.92

50.8

RTDETR[33]

84.6

47.8

32.80

108.0

116.32

DDQ[34]

83.9

48.7

48.31

236.2

46.15

Light-yolo

86.7

51.8

4.4

6.1

121.95

When compared to other models, Light-yolo stands out for its balance of accuracy and efficiency. Light-yolo not only achieved the highest [email protected] at 51.8%, but it also has the fewest parameters at only 4.4M and a GFLOPs of just 6.1, demonstrating exceptional computational efficiency.

Compared to yoloX, although yoloX has an advantage in [email protected] with 87%, Light-yolo outperforms in [email protected] and has fewer parameters and computational load. When compared to Gold-yolo, Light-yolo not only exceeds Gold-yolo's accuracy of 48.9%, but it is also more efficient in terms of parameters and computation. Similarly, while RTDETR and Cascade RCNN perform well in certain aspects, their complexity and higher computational requirements make them less efficient than Light-yolo.

In addition, Light-yolo employs COTAttention and SEAM modules, which enhance the interaction between features and enable the model to understand different levels of contextual information better. Together with the DA2D of deformable convolution and sparse connectivity feature modules, Light-yolo not only outperforms many competitors in terms of accuracy but also achieves an inference speed of 121.95 FPS, which is second only to yolox's 227.69 FPS but realizes a highly high computational efficiency while maintaining high accuracy, making it an ideal choice for both accuracy and efficiency.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

8.Training curves of precision, recall, mAP0.5 and mAP0.5:0.95 are important to see the evolution of all models during each epoch, you must add them.

In the original text, 4.4.Experimental Results has given all the training curves, thank you for your opinion.(Page 12, line 323)

 

 

  1. check the references, respect template standards

I have rechecked all the literature citations again; thanks for pointing them out!

 

 

 

 

 

 

 

 

 

Thank you for your constructive comments on my thesis and for helping us to improve the quality of our research. Your time and effort in reviewing our work is greatly appreciated.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

This revised version can be accepted.

Reviewer 2 Report

Comments and Suggestions for Authors

I have no further comments, the authors have responded to my previous remarks. I recommend a minor revision of the layout.

Back to TopTop