Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection

Chen, Nuo; Zhu, Jin; Zheng, Linhan

doi:10.3390/app14156461

Open AccessArticle

Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection

by

Nuo Chen

,

Jin Zhu

^*

and

Linhan Zheng

Ocean College, Jiangsu University of Science and Technology, Zhenjiang 212003, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6461; https://doi.org/10.3390/app14156461

Submission received: 6 July 2024 / Revised: 19 July 2024 / Accepted: 23 July 2024 / Published: 24 July 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Detecting small dark targets underwater, such as fishing nets, is critical to the operation of underwater robots. Existing techniques often require more computational resources and operate under harsh underwater imaging conditions when handling such tasks. This study aims to develop a model with low computational resource consumption and high efficiency to improve the detection accuracy of fishing nets for safe and efficient underwater operations. The Light-YOLO model proposed in this paper introduces an attention mechanism based on sparse connectivity and deformable convolution optimized for complex underwater lighting and visual conditions. This novel attention mechanism enhances the detection performance by focusing on the key visual features of fishing nets, while the introduced CoTAttention and SEAM modules further improve the model’s recognition accuracy of fishing nets through deeper feature interactions. The results demonstrate that the proposed Light-YOLO model achieves a precision of 89.3%, a recall of 80.7%, and an [email protected] of 86.7%. Compared to other models, our model has the highest precision for its computational size and is the lightest while maintaining similar accuracy, providing an effective solution for fishing net detection and identification.

Keywords:

YOLOv8; variational convolution; sparse connection; GAN; underwater fishing net vulnerability detection; lightweight

1. Introduction

As the marine economy expands and marine technology capabilities mature, the deep-sea fish farming industry has experienced rapid growth [1]. However, this growth brings challenges, notably the susceptibility of seabed nets to corrosion from environmental factors and damage from fish nibbling. Traditional manual diving methods for net inspection could be more efficient and safe in complex underwater environments. The advent of underwater robotics represents a promising solution to these challenges [2], offering safer and potentially more efficient inspection methods. Nevertheless, the dynamic and unpredictable underwater environment complicates the detection of damaged areas in fishing nets [3,4], which is further exacerbated by the limited computational and storage capacities of underwater robots.

In target detection technology, algorithms are divided into two main categories: two-stage and one-stage algorithms. Two-stage algorithms, like R-CNN, Fast R-CNN, and Faster R-CNN [5,6,7], streamline the processing by generating candidate target regions in sequences, thereby reducing the number of regions that require detailed analysis. For instance, Faster R-CNN employs a region proposal network (RPN) to create proposals, followed by classification and bounding box regression through a deep CNN, enhancing accuracy in complex scenes. Despite these advancements, the processing speed remains a challenge, particularly for applications requiring rapid or real-time analysis. This necessitates further research into optimizing these algorithms to serve fast-paced underwater operations better.

Unlike traditional two-stage target detection algorithms, single-stage target detection algorithms such as the SSD [8] and YOLO [9] series map image pixels directly to bounding box coordinates and category probabilities by simplifying the detection process as a regression problem. In particular, the YOLO family is widely recognized in several research and application areas for its excellent performance and fast processing speed. It is especially suitable for real-time application scenarios such as video surveillance [10] and autonomous driving [11]. However, although the YOLO algorithm performs well in real-time target detection, its performance is limited in specific scenarios, such as underwater environments.

Factors such as light refraction, water current interference, and suspended particles in underwater environments often result in low contrast and blurred images, posing challenges for vision-based target detection algorithms. Fishing nets, which have complex textures and are easily integrated with the underwater environment, are particularly difficult to recognize and localize accurately using the YOLO algorithm. Due to the fine structure and large size of fishing nets, it is difficult for YOLO algorithms to capture their overall morphology and detailed features in a single-scale field of view. This may result in small targets (e.g., a part of a fishing net) being ignored and computational resources being wasted or key features of large-size fishing nets being lost, affecting localization and recognition accuracy. In addition, overlapping targets (e.g., multi-layered or partially stacked fishing nets) in an underwater scene increase the difficulty of recognition, and it is challenging to distinguish closely connected or partially occluded objects due to the limited ability of the YOLO algorithm to handle overlapping targets.

To optimize this problem, this paper constructs a structure-optimized Light-YOLO model from the YOLOv8 series. The main contributions include the following:

(1): A dataset labelled with underwater fishing net vulnerabilities is created, and the dataset is expanded using GAN network generation techniques.
(2): A fast, efficient feature processing module, DA2D, based on sparse linkage and variability convolution is proposed. Using this module, we designed the model to dynamically adjust the position of the attention sampling points according to the input content, allowing the model to focus on local features, reduce computational overhead, and effectively improve accuracy.
(3): This CoTAttention module is introduced. This enhanced self-attention mechanism digs deeper into the contextual information of the key object by combining the local and global contextual information, effectively improving the model’s ability to recognize and model the features of the elongated gaps. The recognition accuracy of the model is improved.
(4): The SEAM module is introduced to optimize the object boundary fitting and enhance the sensitivity of the model to contextual information by constructing a twin network structure with shared weights. The covariant regularization technique is used to ensure the consistency of the class activation maps (CAMs) under different transformations, and the PCM pixel correlation module is used to refine the CAMs.

2. Related Work

In the field of underwater image processing and object detection, Wang Yudong’s research [12] evaluates how various underwater image enhancement techniques affect object detection performance. Although it offers a detailed analysis of these techniques on detection, it needs to consider their effectiveness in extreme conditions like very low light or highly turbid waters, limiting their broader applicability. Jian Zhang’s research [13] developed an RTMDet [14]. Framework with a CSPLayer [15] incorporated the BoT3 [16] module and MHSA mechanism, enhancing network contextual information capture and detection accuracy. This approach achieved an impressive [email protected] of 0.879 on the URPC2019 and URPC2020 datasets. However, it notably increases computational complexity, potentially limiting its use in real-time or resource-constrained scenarios. The study of J Chen [17] used deformable convolution network (DCN) version 3 instead of the YOLO backbone; added fusion space, scale, and channel feature fusion; and yielded a final realization using the DUO dataset with an [email protected] of 0.867. However, their study increased the computational complexity, which might limit its application in real-time or resource-constrained scenarios. The study of J Zhou [18] proposed AMSP-UOD, a single-stage underwater object detection network combining eddy convolution (AMSP-VConv) and feature association decoupling (FAD-CSP) modules, which significantly improved detection accuracy and noise immunity in complex underwater environments by optimizing the feature extraction and noise reduction capabilities, with an AP@50 of 86.1% on the RUOD dataset.

3. Model Improvement

3.1. An Improved Approach to Feature Fusion

In underwater environment monitoring, fishing nets are broken due to the current impact of water, and their morphology is complex and variable, making them difficult to distinguish. The attention mechanism is introduced in traditional methods to improve the recognition accuracy [19]. However, the global attention mechanism requires computation-intensive interactions, which conflicts with the computational capacity limitations of edge devices.

Therefore, this paper proposes a low-computation attention improvement module, DA2D, optimized for underwater scenes. This module can dynamically adapt and prioritize potential hotspot regions. It is a flexible and efficient self-attention mechanism that dynamically adapts the positions of key locations and values according to the input data, thus enabling the model to focus more on relevant regions and capture more informative features. The flowchart of the sub-network module of the DA2D module is shown in Figure 1.

Based on the self-attention mechanism, DA2D firstly generates a set of uniformly distributed reference points on the feature map and learns the offsets of these reference points based on the query feature using an offset network. The offset network module diagram is shown in Figure 2.

For each query q, the offset vectors of all reference points obtained from the query features are computed using the offset network O, as shown in Equation (1):

{Δ p}_{o f f s e t} = O (q)

(1)

where

O

(q) denotes the new offset of each reference point relative to the original position. In the feature sampling and deformation phase, the offset network takes the query features as inputs and samples the feature map by bilinear interpolation based on the learned offsets to obtain the deformed keys and values, as shown in Equations (2) and (3):

K_{d e f o r m e d}^{(m)} = F (x + {Δ p}_{o f f s e t})

(2)

V_{d e f o r m e d}^{(m)} = G (x + {Δ p}_{o f f s e t})

(3)

where (m) denotes the mth attention head in which F and G are the functions that map the deformed positions back to the feature map to extract the keys and values, respectively. In this way, the corresponding offsets are calculated for each reference point. The relative positional bias is also calculated from the deformed points to enhance the multi-head attention mechanism and to output the transformed feature representation. The multi-head attention computation follows the principle of multi-head attention computation in the standard transformer module, where the keys and values used have already undergone the deformation operations described above, and relative position offsets computed from the deformed points are introduced to enhance the attention mechanism. The formula is provided as follows:

Z_{d e f o r m e d}^{(m)} = σ (\frac{q (m) K_{d e f o r m e d}^{(m)}}{d}) V_{d e f o r m e d}^{(m)}

(4)

where σ is a softmax function to normalize the attention weights at each position, and d is half of the feature dimensions in each attention head. By projecting queries, keys, and values into separate subspaces and performing independent attention computations, the model can capture different schema dependencies simultaneously. DA2D further introduces the ability to dynamically adjust the key points and value locations, thus making the attention more focused on essential regions.

The DA2D module provides the model with the ability to adaptively adjust the location of attentional sampling points according to the input content by drawing on the deformable convolutional [20] concept to provide the model with an ability to adaptively adjust the location of attentional sampling points based on the input content. This innovation eschews the traditional approach of employing a uniform or fixed sampling strategy on the global feature map. The module predicts a set of offsets for each query location in a specific implementation. It applies these offsets to a predetermined grid of reference coordinates, generating a series of new contextually relevant sampling locations. This mechanism realizes the dynamic focusing of crucial information and its effective extraction, effectively avoids computational redundancy caused by undifferentiated processing of all pixels, and thus improves computational efficiency. Further, the DA2D module adopts a sparse connectivity strategy [21] that incorporates only a small number of parameters into the discrimination process of each forward propagation instead of considering all the parameters, a strategy that may lead to an increase in the number of parameters but reduces the number of floating point operations (FLOPs). With this approach, the study effectively improves the performance of broken region detection while keeping the computational effort low and realizes the goal of substantial performance improvement based on a small computational effort.

In summary, the DA2D module offers several advantages that make it particularly suitable for underwater target detection. The module lets the model focus on visual features critical for target detection through a sparse connectivity mechanism, ignoring background noise. Given the wide range of variations in lighting conditions, turbidity, and color in underwater environments, sparse connectivity helps the model to focus on critical structural features of fishing nets, such as mesh patterns and textures, which can be effectively identified even in low visibility conditions. In addition, deformable convolution enhances the model’s ability to deal with perspective changes and morphology variations, which is crucial for detecting nets that may have changed shape due to water currents, the influence of marine organisms, or physical damage. By learning the offset, deformable convolution can dynamically adjust the sensory field to capture more accurate target positions, ensuring precise localization in complex underwater environments.

3.2. Citing an Attention Mechanism

In the field of underwater environmental monitoring, monitoring objects such as fishing nets often present multiple elongated slit features. Traditional convolutional neural networks (CNNs) may not be effective in recognizing and modelling such elongated slit features due to their inherent ability to model local information. This is mainly due to the fact that CNNs lack the ability to model long-range dependencies, which is crucial for the accurate recognition and processing of elongated features. To address this problem, we introduce the transformer module, which has powerful global information modelling capabilities. However, the self-attention mechanism in the standard transformer architecture calculates the attention weights mainly through the interaction between query and key, which does not fully consider the interconnections between keys. For this reason, we adopt the CoT (contextual transformer) block [22]. The CoT (contextual transformer) block is designed to dig deep into the contextual information of the keys and utilize this information to guide the process of computing dynamic attention weights. In this way, the CoT block can effectively enhance the model’s ability to process visual representations, aggregating the advantages of contextual information mining and the self-attention mechanism into one. As shown in Figure 3, the implementation of this structure significantly improves the model’s recognition and modelling effect on slender gap features.

In the initial stage of processing the input features, we set three key variables and let the variables

Q

and

K

initially share the same input value, X. In order to capture and represent the features with local context, we perform a k × k sized grouped convolution on

K

, which in turn yields an augmented

K

(denoted as

K

*). This step essentially models the local information statically. Further, we merged

K

* with

Q

with the aim of integrating the local contextual information with the original query information. Immediately after that, we performed two successive rounds of convolutional processing on the merged result; these steps aim to refine further and optimize the fused feature representation to enhance the model’s ability to represent the input data.

A = [K^{- 1}, Q] W_{θ} W_{δ}

(5)

Unlike traditional self-attention mechanisms, the A-matrix construction in our approach relies not only on the direct relationship between query and key but is realized through the interaction between

Q

and

K

* augmented with local context. This design allows the attention mechanism to not only focus on the direct connection between query and key but also incorporates the interaction of local context information, thus significantly improving the performance of the self-attention mechanism. Next, by multiplying this dynamically generated attention map with the value vector

V

, we implement dynamic context-based modelling. This process allows the model to dynamically adjust its emphasis on information, which in turn significantly improves the quality of the feature representation.

K^{2} = V * A

(6)

Ultimately, the contextual transformer (CoT) module effectively fuses local and global contexts by integrating features obtained from local static context modelling with those obtained based on dynamic context modelling. The CoT module is able to better distinguish fishing nets from the background through its ability to efficiently integrate contextual information and maintain a high recognition accuracy even in the case of poor visibility. In addition, dynamic context modelling allows the model to adapt to the changes of the fishing nets in different scenarios, such as the changes in the state of the net under the action of water currents.

3.3. Integration of the SEAM Module

In studying the detection of underwater fishing net vulnerabilities that the recognition system may miss due to water currents superimposing the vulnerability with the intact net behind it, we developed advanced techniques to enhance detection accuracy. We incorporate the SEAM module [23] into the YOLO detection framework to enhance the spatial transformation invariance of the model. SEAM effectively bridges the gap between fully supervised and weakly supervised semantic segmentation by constructing a twin-network structure with shared weights to achieve covariant regularization. In this structure, one branch directly processes the original input image. In contrast, the other branch first applies affine transformations (e.g., scaling, rotating, or flipping) to the input image before forward propagation. This design ensures that the generated class activation maps (CAMs) remain consistent even when the input image is transformed, mimicking the nature of pixel-level labels as the image transforms under fully supervised conditions, where SEAM introduces covariant regularization to ensure that the predicted CAMs from various transformed images [24] provide self-supervision for network learning. F(.) denotes the network, A(.) denotes the affine transformation, and I is the input image.

R_{E R} = {||F (A (I)) - A (F (I))||}_{1}

(7)

Further, SEAM introduces a pixel correlation module (PCM) at the end of the network to further refine the CAM by capturing the contextual information of each pixel through a self-attentive mechanism. The PCM measures the feature similarity between pixels using the cosine distance and calculates the affinity by normalizing the inner product in the feature space to optimize the CAM to fit the object boundary more accurately. This approach not only ensures the consistency of the CAM under different transformations but also effectively improves the detection performance by integrating low-level features and adjusting the inter-pixel similarity.The block diagram of the PCM module is shown in Figure 4.

In integrating SEAM and PCM into the YOLO framework, PCM has the following advantages. By removing unnecessary jump connections, reducing parameters, and using the ReLU activation function instead of the sigmoid, the model structure is streamlined, overfitting is avoided, and the sensitivity of the model to contextual information is improved. These improvements enable the model to identify and localize vulnerabilities more efficiently when detecting underwater fishing net vulnerabilities, even if they are accurately detected under different spatial transformations, enhancing the accuracy of underwater fishing net detection.

3.4. Block Diagram of Light-YOLO

This part introduces the Light-YOLO architecture, depicted in Figure 5, which enhances YOLOv8n by substituting its SPPF module with a custom designed DA2D module. Unlike the SPPF’s parallel pooling layers, the DA2D module adjusts more flexibly, better capturing target shapes and details, which is ideal for complex geometrical transformations and occlusions. It also incorporates the CoT and SEAM modules to boost the interaction between the detection head and feature extraction, markedly enhancing performance and efficiency over YOLOv8n.

In summary, this thesis model uses a lightweight attention mechanism introduced in this paper based on sparse connectivity and deformable convolution, which not only helps focus on critical features but also improves overall efficiency by reducing unnecessary computational paths. A modularized network structure is used in the network structure, which allows flexible adjustment of the model depth and width according to the task complexity, ensuring that unnecessary computational loads are reduced while maintaining high accuracy.

4. Experimental

4.1. Fishing Net Vulnerability Dataset Collection and Construction

4.1.1. Expansion of the Fishing Net Vulnerability Dataset

This study used a dataset obtained from underwater camera shots taken in a laboratory pool. The small number of samples in the annotated dataset will significantly limit the model’s ability to achieve the desired detection performance if not properly processed. In addition, due to the scarcity of image resources available for learning, the model is prone to overfitting, leading to an increase in the gap in the generalization ability between the training loss and the validation loss, as well as limiting the effective scaling of the model and performance improvement.

To solve this problem, the generative adversarial network (GAN [25]) is introduced into the model. The GAN network consists of two sub-modules: one is a generator, and the other is a discriminator, which improves their respective performances by confronting each other during the training process. The schematic diagram of the GAN network principle is shown in Figure 6 below.

Generator G receives random noise vectors and outputs a realistic fishing net hole image after adversarial training; discriminator D inputs the original image and the generated fishing net hole image and recognizes their authenticity. In the training process, random noise is first fed into generator G to generate time–frequency map samples. These samples generated by G are then fed into the discriminator D along with the actual time–frequency maps for verification. The entire system operates in an adversarial training framework. As training progresses, the generator G is optimized and learns to generate samples closer to the real ones. At the same time, the discriminator D enhances its ability to distinguish between real and fake samples. The training goal of the GAN network is to complete the training by minimizing the sum of the loss functions of G and D and continuously updating the parameters of the two networks. The detailed parameters of our trained GAN network are shown in Table 1.

The model can generate high-quality images of broken fishing nets without destroying the original background, thus significantly improving the usefulness of the fishing net hole detection model. With this method, this paper utilizes the GAN network to expand 1756 images from the original labeled 359 images, thus expanding the size of the dataset to 2124 images. The comparison between the images generated by the GAN image and the original image is shown in Figure 7 below.

4.1.2. Fishing Net Vulnerability Dataset

In this experiment, we used a GoPro8 underwater camera to obtain images by recording video several times, slicing, and manually labeling the frames. The target of this annotation is the vulnerability class, labeled as a hole, with various shapes of vulnerabilities, with some presenting narrow features, some presenting large voids, and some presenting minor vulnerabilities, representing a variety of forms in the natural environment of the laboratory pool.The specific underwater images are shown in Figure 8.

The normalized target size map is shown in Figure 9a. Figure 9 shows the regularized target location map, which shows most small- and medium-sized targets (the darker the color is, the higher the number of targets in Figure 9a,b). Finally, this paper divides the training and validation sets based on a ratio of 9:1, using 1944 and 180 frames for training.

4.2. Experimental Environment and Training Strategy

The hardware platform and environmental parameters used in the experimental training phase are shown in Table 2.

In order to adapt to the needs of different hardware devices and to ensure flexible deployment in various application scenarios, the YOLOv8 model adjusts the model width and depth to form five model variants of different sizes. These variants are YOLOv8 (n, s, m, l, x), which gradually increase in terms of the number of parameters and resource consumption while the detection performance is also improved. The specific width, depth, and maximum number of channels set for these five different model sizes are detailed in Table 3.

Since it is in an underwater environment and the edge device is weak in arithmetic, we choose the smallest scale, YOLOv8n, as the baseline selection model. Some key parameter settings during model training are shown in Table 4.

4.3. Evaluation Indicators

In order to test the detection performance of our proposed improved model with evaluation metrics, we use precision, recall, [email protected], [email protected]–0.95, parametric quantity, GFLOPs, and FPS evaluation metrics.

Let us start with some concepts. True positive (TP) refers to the number of samples correctly predicted to be in the positive category, and false positive (FP) is the number of samples incorrectly predicted to be in the positive category. True negative (TN) refers to the number of samples correctly predicted to be in the negative category, and false negative (FN) refers to the number of samples incorrectly predicted to be in the negative category.

Precision (precision) measures the proportion of samples predicted by the model to be in the positive category that is actually in the positive category and is calculated as shown in Equation (8):

P r e c i s i o n = \frac{T P}{T P + F N}

(8)

Recall (recall) describes the proportion of samples correctly predicted by the model to be positively classified as a percentage of all actual positively classified samples. Recall is calculated as shown in Equation (9):

R e c a l l = \frac{T P}{T P + F N}

(9)

The average precision (AP) is equal to the area under the precision-recall curve, which is calculated as shown in Equation (10):

A P = \int_{0}^{1} P r e c i s i o n (R e c a l l) d (R e c a l l)

(10)

The mean average precision (mAP) is the result obtained from the weighted average of the AP values of all sample categories. It is used to measure the detection performance of the model in all categories with the formula shown in Equation (11):

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(11)

IoU (intersection over union) denotes the ratio of intersection and concatenation of the predicted bounding box with the real bounding box. The formula is shown in Equation (12):

L_{I o U} = 1 - I o U = 1 - \frac{|A \cap B|}{|A \cup B|}

(12)

The [email protected] metric focuses on cases where the intersection over union (IoU) equals 0.5. This means this prediction is only considered correct if the overlap between the prediction box and the real box is at least 50%.

The [email protected]:0.95 is a range of IoU thresholds that were considered, ranging from 0.5 to 0.95 and increasing in steps of 0.05. Here, the AP (average precision) was calculated for each threshold, and then the average of these APs was taken as the final mAP value.

4.4. Experimental Results

Based on the above-listed parameters for training, Figure 10 shows how the various loss functions (loss), mean average precision (mAP), and precision and recall metrics change during the training period (epoch) during the training and validation phases. It can be observed from the figure that with the completion of enough training cycles, our model reaches a converged state, which means that the model gradually stabilizes during the learning process and shows good training results.

Curve of the algorithm reveals an average precision (AP) of 0.857 on specific categories and all categories. This result highlights the high accuracy of the algorithm in target detection. Notably, the algorithm maintains a recall as high as 0.93 even at very low confidence thresholds, demonstrating its ability to identify most instances of positive classes efficiently. In addition, with the F1-confidence curve, the model achieves an F1 score of 0.84 at a confidence threshold of 0.258, a score that balances precision and recall and is particularly suitable for dealing with scenarios with unbalanced data. These combined results demonstrate the excellent performance of Light-YOLO and robustness in target detection tasks.

To further evaluate the effectiveness of the method proposed in this study in detecting defects in fishing nets, we executed an ablation study on a dataset expanded by a GAN network and compared the results with the YOLOv8n model. To ensure the reliability and transparency of the experimental results, we adopted [email protected] and [email protected]:0.95 as the primary metrics for performance evaluation. We utilized Params and GFLOPs as measures of model size and computational speed. See Table 5 below for specific experimental results.

A comprehensive picture of the model’s performance is provided. The precision-recall experimental results clearly demonstrate how each improvement measure gradually enhances the detection performance of the baseline model. First, introducing the CoTAttention mechanism into YOLOv8n, which effectively integrates local and global contextual information, elevated the model’s [email protected] metric to 85.9%. After that, the SEAM module is added, which successfully improves the performance to 88.7% without significantly increasing the number of parameters and computation by adopting a twin network structure with shared weights and PCM modelling at the pixel level. Ultimately, by integrating the lightweight and efficient DA2D module, the completed build of the light YOLO model not only further improved the [email protected] by 0.7% but also achieved a significant reduction in GFLOPs (26.5%), despite a slight increase in the number of parameters. However, this drawback was effectively mitigated due to the substantial decrease in computational load. Ablation study results demonstrated that the Light-YOLO model possesses a lightweight design and powerful modelling capabilities, perfectly aligning with the research objectives for underwater detection. These experiments critically validated the effectiveness of each module, providing important evidence for enhancements in model performance and theoretical advancements.

4.5. Comparison of Conventional Lightweight YOLOs

To highlight the excellent performance of the Light-YOLO, we start with a traditional lightweight YOLO family comparison. Specific data are shown in Table 6.

In the comparative analysis of the traditional YOLO family, the Light-YOLO model demonstrated its exceptional performance under lightweight design. This model achieved an accuracy of 89.3% and a recall rate of 80.7%. Although its accuracy is slightly lower than YOLOv3-tiny and YOLOv7-tiny, its recall rate surpasses all other comparison models, including YOLOv7-tiny. This result indicates a clear advantage of Light-YOLO in accurately identifying the correct targets. In more detailed performance metrics, Light-YOLO reached 86.7% in [email protected] and achieved a high score of 51.8% under the more stringent [email protected]–0.95 evaluation standard, leading other models and showcasing its high precision and reliability in various object detection scenarios.

In terms of computational efficiency, the number of parameters of Light-YOLO is only 4.4 million, which is far less than that of YOLOv5s (34.7 million) and YOLOv10n (8.06 million), and its GFLOPs is only 6.1, compared with YOLOv5s (23.8) and YOLOv10n (24.8), separately. This low computational complexity makes Light-YOLO especially suitable for running on devices with limited computational resources, such as mobile devices and embedded systems.

Although the frame rate (FPS) of 121.95 of Light-YOLO is not as good as that of 833.33 of some other models, such as YOLOv5n, the FPS provided by Light-YOLO is sufficient to meet the needs of most real-time application scenarios, especially in video surveillance and mobile devices, while maintaining high accuracy and low computational requirements. In summary, Light-YOLO is an ideal choice for efficient, high-precision target detection in resource-constrained environments due to its excellent balance between performance and efficiency. These features make it an outstanding performer in lightweight target detection models and provide a necessary reference value for research and applications in related fields.

4.6. Comparison of Other Families

After comparing the traditional lightweight YOLO family, we move on to comparing other YOLO-based improvements and other target detection solutions. Specific data are shown in Table 7.

When compared to other models, Light-YOLO stands out for its balance of accuracy and efficiency. Light-YOLO not only achieved the highest [email protected]–0.95 at 51.8%, but it also has the fewest parameters at only 4.4M and a GFLOPs of just 6.1, demonstrating exceptional computational efficiency.

Although YOLOX has an advantage in [email protected] with a value of 87%, Light-YOLO outperforms YOLOX in [email protected]–0.95 and has fewer parameters and computational load. When compared to Gold-YOLO, Light-YOLO not only exceeds Gold-YOLO’s accuracy of 48.9%, but it is also more efficient in terms of parameters and computation. Similarly, while RTDETR and Cascade RCNN perform well in certain aspects, their complexity and higher computational requirements make them less efficient than Light-YOLO.

In addition, Light-YOLO employs CoTAttention and SEAM modules, which enhance the interaction between features and enable the model to understand different levels of contextual information better. Together with the DA2D of deformable convolution and sparse connectivity feature modules, Light-YOLO not only outperforms many competitors in terms of accuracy but also achieves an inference speed of 121.95 FPS, which is second only to YOLOX’s inference speed of 227.69 FPS, but realizes a highly high computational efficiency while maintaining high accuracy, making it an ideal choice for both accuracy and efficiency.

4.7. Model Visualization

In order to highlight the excellent performance of the Light-YOLO model in terms of modelling and recognition capabilities, a detailed comparative analysis was conducted. During the analysis, the specific pictures of which are shown in Figure 11. It was first observed that the YOLOv7-tiny model, although showing high confidence, suffers from omission and misdetection problems. Similarly, compared with the YOLOv3-tiny model, the same problem is encountered, and the phenomenon of a single vulnerability being repeatedly detected several times is found. Meanwhile, we randomly selected a set of validation sets, including output block diagrams and corresponding labels, to observe Light-YOLO in detail. The results show that Light-YOLO is extremely accurate in identifying various vulnerabilities and is optimized for problems in the YOLOv7-tiny and YOLOv3-tiny models. This proves that Light-YOLO possesses superior recognition ability and fully meets this study’s requirements in terms of accurate recognition.

5. Conclusions

In conclusion, we address the challenge of detecting vulnerabilities in fishing nets using underwater platforms. The study introduces a novel network architecture that combines a lightweight attention method, DA2D, with Light-YOLO. First, we provide a background on the need for improved detection methods in underwater scenarios and outline the limitations of existing models. Subsequently, our proposed methodology leverages a dataset augmented through GAN network generation technology to train and test the new architecture. The results indicate that while maintaining competitive detection accuracy, the Light-YOLO model significantly reduces the number of parameters and computational demands. This establishes a balance between high accuracy and system efficiency. The paper’s final section discusses the conclusions drawn from our experiments and analyses, verifying the feasibility of the proposed optimizations. Future work will enhance both the performance and computational efficiency of the Light-YOLO model. This includes adopting advanced compression techniques and developing strategies for efficient deployment in underwater robotics to ensure fast response times and low power consumption in resource-limited settings.

Author Contributions

N.C.: Major code and article writing. J.Z.: Article guidelines. L.Z.: Code as well as article assistance. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Open Project of Jiangsu Ship and Offshore Engineering Equipment Technology Innovation Center (No. 1012992301) project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding authors upon reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Chu, Y.-I.; Wang, C.-M.; Zhang, H.; Abdussamie, N.; Karampour, H.; Jeng, D.-S.; Baumeister, J.; Aland, P.A. Offshore Fish Farms: A Review of Standards and Guidelines for Design and Analysis. J. Mar. Sci. Eng. 2023, 11, 762. [Google Scholar] [CrossRef]
Burguera, A.; Bonin-Font, F. Advances in Autonomous Underwater Robotics Based on Machine Learning. J. Mar. Sci. Eng. 2022, 10, 1481. [Google Scholar] [CrossRef]
Ziaei Nafchi, H.; Cheriet, M. Efficient No-Reference Quality Assessment and Classification Model for Contrast Distorted Images. IEEE Trans. Broadcast. 2018, 64, 518–523. [Google Scholar] [CrossRef]
Zhang, W.; Wang, Y.; Li, C. Underwater Image Enhancement by Attenuated Color Channel Correction and Detail Preserved Contrast Enhancement. IEEE J. Ocean. Eng. 2022, 47, 718–735. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9905. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Li, G.; Ji, Z.; Qu, X.; Zhou, R.; Cao, D. Cross-Domain Object Detection for Autonomous Driving: A Stepwise Domain Adaptative yolo Approach. IEEE Trans. Intell. Veh. 2022, 7, 603–615. [Google Scholar] [CrossRef]
Qin, L.; Shi, Y.; He, Y.; Zhang, J.; Zhang, X.; Li, Y.; Deng, T.; Yan, H. ID-yolo: Real-Time Salient Object Detection Based on the Driver’s Fixation Region. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15898–15908. [Google Scholar] [CrossRef]
Wang, Y.; Guo, J.; He, W.; Gao, H.; Yue, H.; Zhang, Z.; Li, C. Is Underwater Image Enhancement All Object Detectors Need? arXiv 2023, arXiv:2311.18814. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, J.; Zhou, K.; Zhang, Y.; Chen, H.; Yan, X. An Improved yolov5-Based Underwater Object-Detection Framework. Sensors 2023, 23, 3693. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Wang, C.; Liao, H.M.; Yeh, I.; Wu, Y.; Chen, P.; Hsieh, J. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
Srinivas, A.; Lin, T.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for Visual Recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville. TN, USA, 20–25 June 2021; pp. 16514–16524. [Google Scholar]
Chen, J.; Er, M.J. Dynamic yolo for Small Underwater Object Detection. Artif. Intell. Rev. 2024, 57, 165. [Google Scholar] [CrossRef]
Zhou, J.; He, Z.; Lam, K.-M.; Wang, Y.; Zhang, W.; Guo, C.; Li, C. AMSP-UOD: When Vortex Convolution and Stochastic Perturbation Meet Underwater Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–28 February 2024; Volume 38, pp. 7659–7667. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. Neural Inform. Process. Syst. 2017, 30. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Liu, L.; Deng, L.; Hu, X.; Zhu, M.; Li, G.; Ding, Y.; Xie, Y. Dynamic Sparse Graph for Efficient Deep Learning. arXiv 2018, arXiv:1810.00859. [Google Scholar]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual Transformer Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 45, 1489–1500. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Zhang, J.; Kan, M.; Shan, S.; Chen, X. Self-Supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12272–12281. [Google Scholar]
Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2016, 128, 336–359. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative Adversarial Nets. arXiv 2014, arXiv:1406.2661. [Google Scholar]
Redmon, J.; Farhadi, A. yolov3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Wang, C.; Bochkovskiy, A.; Liao, H.M. yolov7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Ultralytics. yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 November 2021).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. yoloX: Exceeding yolo Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Han, K.; Wang, Y. Gold-yolo: Efficient Object Detector via Gather-and-Distribute Mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat yolos on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Zhang, S.; Xinjiang, W.; Wang, J.; Pang, J.; Lyu, C.; Zhang, W.; Luo, P.; Chen, K. Dense Distinct Query for End-to-End Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7329–7338. [Google Scholar]

Figure 1. Flowchart of the sub-network module of the DA2D module.

Figure 2. Offset sub-network module diagram.

Figure 3. CoTAttention module diagram.

Figure 4. PCM module diagram.

Figure 5. Block diagram of Light-YOLOv8.

Figure 6. Schematic diagram of the GAN network principle.

Figure 7. Comparison of a GAN-generated image as well as the original image; (a) Original image; (b) Generated image.

Figure 8. Example of an image of the fishing net vulnerability dataset.

Figure 9. Statistical results for this dataset: (a) Normalized target size plot; (b) Normalized target location plot.

Figure 10. Various losses of the training process and graphs of some metrics.

Figure 11. Visualization of underwater target recognition with different models: (a) YOLOv3-tiny; (b) YOLOv7-tiny; (c) Light-YOLO.

Table 1. GAN network training parameters.

Parameters	Generator	Discriminator
Learning rate	0.002
Batchsize	32
Epoch	300
Optimizer	Adam
Loss	BCEloss
Generator input noise	100

Table 2. Parameters of the training environment and hardware platform.

Experimental Environment	Configure
GPUs	GeForce RTX 4090
CPU	Intel Core i5-10400F (2.9 GHz)
Systems	Windows 11 Professional
Framework (deep learning framework)	Torch-2.0.1
Programming language versions	Python-3.8.16
CUDA version (GPU parallel computing platform)	CUDA 11.7
cuDNN version (deep learning GPU acceleration library)	cuDNN 8.5.0
Compilers	pycharm

Table 3. Parameters corresponding to different sizes of YOLOv8.

Model	Depth	Width	Max Channels
YOLOv8n	0.33	0.25	1024
YOLOv8s	0.33	0.50	1024
YOLOv8m	0.67	0.75	768
YOLOv8l	1.00	1.00	512
YOLOv8x	1.00	1.25	512

Table 4. Some key parameters set during the model training process.

Parameters	Set Up
Epochs	300
Optimizer	SGD
Learning rate	0.001
Momentum	0.937
weight_decay	0.0005
batch	32
imgsz	640
iou	0.7

Table 5. Model ablation experiments.

Baseline	$M o d u l e$			$P r e c i s i o n$	$R e c a l l$	$mAP @ 0.5 (%)$	$mAP @ 0.5 - 0.95 (%)$	Params (MB)	GFLOPs
Baseline	CoTAttention	SEAM	DA2D	$P r e c i s i o n$	$R e c a l l$	$mAP @ 0.5 (%)$	$mAP @ 0.5 - 0.95 (%)$	Params (MB)	GFLOPs
YOLOv8n				87.6	78.1	83.7	47.6	3.0	8.2
Light-YOLO	√			88.5	79.0	85.9	50.1	3.1	8.3
	√	√		88.7	79.1	86.0	50.1	3.1	8.3
	√	√	√	89.3	80.7	86.7	51.8	4.4	6.1

Table 6. Comparative experiments of the traditional YOLO family.

Models	$P r e c i s i o n$	$R e c a l l$	[email protected]/%	[email protected]–0.95 (%)	Params	GFLOPs	FPS
YOLOv3-tiny [26]	91.4	77.1	83.5	43.3	17.4	12.9	434.78
YOLOv7-tiny [27]	91.1	81.3	85.5	47.8	12.3	13.2	123.46
YOLOv5n [28]	91.0	77.6	85.6	49.6	9.5	7.1	833.33
YOLOv5s	90.0	77.0	84.1	49.2	34.7	23.8	263.15
YOLOv8n	87.6	78.1	83.7	47.6	3.0	8.2	322.58
YOLOv10n [29]	90.2	78	84.5	48.7	8.06	24.8	230.49
Light-YOLO	89.3	80.7	86.7	51.8	4.4	6.1	121.95

Table 7. Other family comparison experiments.

Models	[email protected] (%)	[email protected]–0.95 (%)	Params	GFLOPs	FPS
YOLOX [30]	87	50.7	8.05	21.8	227.69
Gold_YOLO [31]	85.3	48.9	46	21.5	148.6
Cascade RCNN [32]	79.6	42.0	69.152	209.92	50.8
RTDETR [33]	84.6	47.8	32.80	108.0	116.32
DDQ [34]	83.9	48.7	48.31	236.2	46.15
Light-YOLO	86.7	51.8	4.4	6.1	121.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, N.; Zhu, J.; Zheng, L. Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection. Appl. Sci. 2024, 14, 6461. https://doi.org/10.3390/app14156461

AMA Style

Chen N, Zhu J, Zheng L. Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection. Applied Sciences. 2024; 14(15):6461. https://doi.org/10.3390/app14156461

Chicago/Turabian Style

Chen, Nuo, Jin Zhu, and Linhan Zheng. 2024. "Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection" Applied Sciences 14, no. 15: 6461. https://doi.org/10.3390/app14156461

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection

Abstract

1. Introduction

2. Related Work

3. Model Improvement

3.1. An Improved Approach to Feature Fusion

3.2. Citing an Attention Mechanism

3.3. Integration of the SEAM Module

3.4. Block Diagram of Light-YOLO

4. Experimental

4.1. Fishing Net Vulnerability Dataset Collection and Construction

4.1.1. Expansion of the Fishing Net Vulnerability Dataset

4.1.2. Fishing Net Vulnerability Dataset

4.2. Experimental Environment and Training Strategy

4.3. Evaluation Indicators

4.4. Experimental Results

4.5. Comparison of Conventional Lightweight YOLOs

4.6. Comparison of Other Families

4.7. Model Visualization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI