RTL-YOLOv8n: A Lightweight Model for Efficient and Accurate Underwater Target Detection

Feng, Guanbo; Xiong, Zhixin; Pang, Hongshuai; Gao, Yunlei; Zhang, Zhiqiang; Yang, Jiapeng; Ma, Zhihong

doi:10.3390/fishes9080294

Open AccessArticle

RTL-YOLOv8n: A Lightweight Model for Efficient and Accurate Underwater Target Detection

by

Guanbo Feng

¹

,

Zhixin Xiong

²,

Hongshuai Pang

³,

Yunlei Gao

¹,

Zhiqiang Zhang

¹,

Jiapeng Yang

⁴ and

Zhihong Ma

^5,*

¹

YuHang Smart Aquaculture Research Center, Hangzhou 311100, China

²

College of Mechanical and Electronic Engineering, Dalian Minzu University, Dalian 116600, China

³

College of Information Engineering, Dalian Ocean University, Dalian 116023, China

⁴

Suzhou Jiean Information Technology Co., Ltd., Suzhou 215125, China

⁵

College of Agriculture and Biotechnology, Zhejiang University, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

Fishes 2024, 9(8), 294; https://doi.org/10.3390/fishes9080294

Submission received: 22 June 2024 / Revised: 13 July 2024 / Accepted: 17 July 2024 / Published: 24 July 2024

(This article belongs to the Special Issue Intelligent Recognition Research for Fish Behavior)

Download

Browse Figures

Versions Notes

Abstract

:

Underwater object detection is essential for the advancement of automated aquaculture operations. Addressing the challenges of low detection accuracy and insufficient generalization capabilities for underwater targets, this paper focuses on the development of a novel detection method tailored to such environments. We introduce the RTL-YOLOv8n model, specifically designed to enhance the precision and efficiency of detecting objects underwater. This model incorporates advanced feature-extraction mechanisms—RetBlock and triplet attention—that significantly improve its ability to discern fine details amidst complex underwater scenes. Additionally, the model employs a lightweight coupled detection head (LCD-Head), which reduces its computational requirements by 31.6% compared to the conventional YOLOv8n, without sacrificing performance. Enhanced by the Focaler–MPDIoU loss function, RTL-YOLOv8n demonstrates superior capability in detecting challenging targets, showing a 1.5% increase in [email protected] and a 5.2% improvement in precision over previous models. These results not only confirm the effectiveness of RTL-YOLOv8n in complex underwater environments but also highlight its potential applicability in other settings requiring efficient and precise object detection. This research provides valuable insights into the development of aquatic life detection and contributes to the field of smart aquatic monitoring systems.

Keywords:

artificial intelligence; underwater object detection; automated aquaculture; RTL-YOLOv8n model; feature extraction; balancing precision and light-weighting

Key Contribution: The model incorporates advanced feature-extraction mechanisms and a lightweight coupled detection head, resulting in a 31.6% reduction in computational requirements while improving detection precision by 5.2% and increasing [email protected] by 1.5% compared to conventional YOLOv8n models.

1. Introduction

As the global demand for marine resources continues to grow, mariculture has demonstrated its importance in food production, while also bringing about an urgent need for efficient technologies [1,2]. Traditional mariculture refers to the cultivation of marine organisms in large-scale facilities under artificially controlled conditions [3,4]. These methods are characterized by high labor intensity, inability to monitor around the clock, and low management efficiency [5]. Moreover, as the scale of farming continues to expand, relying on manual monitoring becomes increasingly difficult and inefficient, failing to meet the demands of modern mariculture for precise and efficient management [3]. With the development of deep-learning techniques, the introduction of underwater target detection technology has provided innovative means for managing and monitoring mariculture. This technology can efficiently and accurately identify biological targets in farming in complex and variable underwater environments, which is crucial for disease warning, biological size measurement, and biomass assessment by farmers [6,7,8]. Underwater target detection technology, through real-time and precise underwater monitoring, not only significantly enhances farming efficiency and biological safety but also reduces labor costs and provides strong support for marine ecological protection [1,9]. Traditional target detection technologies mainly rely on several typical handcrafted feature extraction and classification techniques: edge detection algorithms [10], shape matching techniques [11], texture analysis [12], support vector machines (SVM) [13], decision trees [14], k-nearest neighbors (k-NN) [15], etc. However, these methods are limited in complex and variable underwater environments by factors such as light, turbidity, and underwater suspended solids [16], and it is difficult for the handcrafted feature extraction to capture the comprehensive features of targets with insufficient generalization ability [17], resulting in significant limitations in accuracy and robustness [18]. Therefore, although traditional technologies are effective in stable and simple scenarios, they are difficult to meet the accuracy requirements of detection in the complexity of the marine environment [19,20,21].

In recent years, deep-learning technology has gradually become an effective means of solving these problems due to its ability to automatically learn features [22]. Deep learning, through the construction of deep neural networks, can automatically learn and extract high-level features from data, overcoming the limitations of manual feature extraction. Deep-learning models can identify complex patterns and features in large datasets, significantly enhancing the ability to recognize and detect targets in complex environments. However, despite the significant progress made in target detection, the balance between model accuracy and the number of parameters remains a pressing issue. As the depth and complexity of networks increase, the number of parameters and computational demands also rise sharply, making the deployment of models in practical applications challenging. Furthermore, deep-learning models with a large number of parameters require substantial hardware resources for training and inference, limiting their applicability in resource-constrained environments. Therefore, optimizing network structures, introducing new lightweight techniques, and employing efficient feature-extraction methods are crucial for the widespread application and implementation of deep-learning technologies in real-world scenarios [23].

In this study, we propose the RTL-YOLOv8n structure and effectively solve the balance problem between model light-weighting and high accuracy by combining it with the YOLOv8n framework. Specifically, to address the issue of difficult identification of underwater targets, the RetBlock and triplet attention structures effectively enhance the model’s ability to extract local and global features, making detection results more accurate, even for smaller targets. While ensuring detection accuracy, this study significantly reduces the computational burden of the model and improves the problem of model complexity by removing or simplifying large kernel-size convolution layers and unnecessary high-resolution feature layers. Finally, we adjust the accuracy of the detection boxes through the Focaler–MPDIoU loss function, resulting in better detection results. The following is a list of our contributions:

Improved the feature-extraction method in the backbone part—using the spatial-decay matrix and Manhattan self-attention mechanism (MaSA), which reduces the computational burden while improving accuracy, addressing the efficiency challenges in global information modeling.
Enhanced the feature-rich representation and improved the model’s ability to capture definitive features through a parameter-free attention mechanism. Focusing on capturing cross-dimensional interactions addresses the limitations of existing methods, achieving significant performance improvement with minimal computational overhead, and enhancing the model’s network feature-extraction capabilities.
Designed a lightweight coupled detection head that significantly reduces the number of parameters by using shared convolutions. By using the scale layer to scale different features, the detection head ensures a balance between model accuracy and complexity with reduced computational cost.
Combined Focaler–MPDIoU to enhance detector performance by focusing on the detection box accuracy in object detection tasks and addressing the distribution of easy and hard samples, effectively improving detection accuracy.

2. Related Work

The development of convolutional neural networks (CNNs) has shown outstanding performance in addressing the inherent challenges of underwater environments, such as poor visual visibility, changing lighting conditions, and the dynamic nature of aquatic organisms [24]. This is particularly evident in the work of Wang et al. [25]: the improved YOLOv7 model integrates the AcmixBlock module, skip connections, 1 × 1 convolution architecture, ResNet-ACMIX module, and global attention mechanism (GAM) to improve detection accuracy and adapt to underwater targets. However, this model still exhibits false detections and missed detections in complex underwater environments, and the large number of parameters makes it unsuitable for industrial deployment. The GGS lightweight feature-extraction module marks an important contribution towards reducing model complexity while maintaining attention to target information [26]. In the backbone part, traditional convolution, regularization, and activation functions are used to retain good features. The original PANet convolution was replaced with DWConv, and then multiple GGS feature extractions were used, significantly saving computational costs. This method effectively reduces model complexity but results in a significant decrease in accuracy compared to the original YOLOv5. Similarly, the FFT_YOLOX model proposed by Wang Peng et al. [27] uses the FFT_Filter module for global feature extraction. By replacing the standard 3 × 3 kernel in the original YOLOX model with the FFT_Filter, performing 2D FFT transforms the feature map into the spectral domain. The FFT_Filter captures global features by operating in the frequency domain and increases the receptive field of FFT by learning the global interaction of spatial positions. Although this model ensures a balance between accuracy and complexity, the model parameters are reduced by only about 5%, and [email protected] increases by 0.3%, which is not significant. In the field of underwater target detection, to enhance detection accuracy, models often need to introduce more parameters and deeper network structures, which increases model complexity and computational requirements [28,29]. On the other hand, model light-weighting aims to reduce computational resource consumption and improve execution efficiency, usually achieved by simplifying network structures and reducing the number of parameters, but this can lead to a decrease in detection accuracy. Therefore, balancing model complexity and light-weighting to achieve both fast processing and high detection accuracy is an ongoing challenge [30,31]. In summary, current research has made some progress in improving the accuracy of underwater target detection and reducing model complexity, but many challenges remain. Our proposed RTL-YOLOv8n model, by integrating more advanced modules and optimization strategies, better addresses the complexities of underwater environments. It significantly reduces model complexity while improving detection accuracy. Specifically, RTL-YOLOv8n maintains a lightweight design and, through improved feature extraction and a global attention mechanism, effectively reduces computational resource consumption while enhancing the capture of target information. This balance between model complexity and detection accuracy demonstrates superior performance. Therefore, we believe that RTL-YOLOv8n can better address the challenges faced by existing models in underwater target detection, thereby advancing the field further.

3. Materials and Methods

3.1. Data Source

The two datasets in this study are sourced from the Roboflow public dataset, published by Yifan Wu and g18L5754, respectively. To verify the generalization of the model for detecting difficult underwater targets, we selected sea cucumbers, shells, sea urchins, starfish, and fish datasets for model validation. Due to some inaccurate annotations in the first dataset, we manually verified all images before the experiment, used auto-orientation, and cropped the image size to 640 × 640, removing invalid annotations and images. Finally, the first dataset contains a total of 7599 images, from which we randomly selected 5319 images as the training set, 760 images as the test set, and 1520 images as the validation set. The first dataset showcases a variety of underwater environments. The primary color tones of the images range from blue-green to deep green, reflecting different water qualities and lighting conditions. Some images are brightly colored, showing shallower underwater areas, while others are darker, possibly from deeper waters. Influenced by the lighting conditions and water quality of the underwater environment, some images may be blurred or affected by noise due to suspended particles or underwater microparticles. Overall, these characteristics enable the dataset to cover a wide range of underwater scenes. The second dataset contains a total of 1879 images, from which we randomly selected 1315 images as the training set, 376 images as the validation set, and 188 images as the test set. And it primarily presents the unique blue-green hue of underwater environments. The color is relatively consistent, with the intensity and brightness of the blue varying due to the shooting depth and lighting conditions. There are also differences in the size and density of the target objects in the images.

3.2. Method

In underwater environments, light propagation is affected by multi-media refraction, and information about the appearance and position of objects is easily blurred by the water medium. Additionally, issues such as the presence of planktonic interference make target detection more challenging. Therefore, the model needs to have a strong ability to capture spatial information and employ effective detection methods in complex environments. To this end, we improved YOLOv8n, as shown in Figure 1. The star symbol represents RetBlock, the heart symbol represents triplet attention, the moon shape represents the Lightweight Coupled Detection head, and the diamond represents the Focaler–MPDIoU loss function. First, we improved the feature-extraction method in the Backbone part with RetBlock to reduce computational complexity and enhance global and local feature-extraction capabilities. Then, we used triplet attention to further enhance the feature fusion method in the neck part, improving the model’s ability to detect multi-scale targets in complex environments. Third, considering the issues of class imbalance, positive sample occlusion, and close proximity in the underwater dataset, we proposed the Focaler–MPDIoU method to give higher attention to difficult targets. Finally, to ensure the balance between model complexity and accuracy, we designed a lightweight coupled detection head that reduces parameters and ensures accuracy through shared convolutions.

3.2.1. YOLOv8n Baseline

Compared to previous versions, YOLOv8n introduces the C2f module in place of the original C3 module. The C2f module is inspired by the design concepts of the C3 and ELAN modules and enhances the model performance and richness of the gradient streams by adding more jump connections and additional diversion operations [32,33]. YOLOv8n uses a novel unanchored and decoupled head structure to separate the classification and regression tasks, which helps the model to focus more on their respective tasks and improve the detection accuracy [32,34]. This enables YOLOv8n to further improve detection accuracy while maintaining high detection speed.

3.2.2. RetBlock

RetBlock enables the model to perceive the relative positional relationships between tokens by utilizing a spatial-decay matrix, thereby better considering spatial information when processing images, as shown in Figure 2. This distance-based weight adjustment helps the model focus more effectively on important areas of the image, reducing the interference of common visual noise in underwater environments, such as turbidity and suspended particles, on detection performance. The core of RetBlock is to improve image recognition performance by introducing a spatial-decay matrix and an attention decomposition form. The spatial-decay matrix is defined based on the Manhattan distance, and it is used to introduce explicit spatial priors into the self-attention mechanism. First, a two-dimensional decay matrix is defined as

D_{2 d}

whose elements

D_{2 d}^{n m}

represent decay values based on the Manhattan distance. For any two tokens in the image,

X n

and

X m

, the Manhattan distance between them is defined as follows:

D_{2 d}^{n m} = γ^{|X n - X m| + |Y n - Y m|}

(1)

Here,

X n

and

X m

are the horizontal coordinates of the nth and mth tokens in the image, respectively, and

Y n

and

Y m

are their corresponding vertical coordinates. γ is a decay coefficient used to control the rate of attenuation of the attention scores with increasing distance. MaSA is a self-attention mechanism inspired by RetNet that combines the spatial-decay matrix

D_{2 d}

to introduce explicit spatial priors. The calculation of MaSA can be expressed as follows:

M a S A (X) = (S o f t m a x ({Q K}^{T}) \otimes D_{2 d}) V

(2)

Here, Q, K, and V are the attention matrices obtained through linear transformations, ⊗ denotes the Hadamard product (i.e., element-wise multiplication), and the

S o f t m a x

function is used to normalize the attention weights.

To reduce computational complexity while retaining the prior information in the spatial-decay matrix, RMT uses an attention decomposition form, calculating attention scores separately from the vertical and horizontal directions, and applying a one-dimensional bidirectional decay matrix to these attention weights. The one-dimensional decay matrix representing the horizontal and vertical distances between tokens can be expressed as follows:

{A t t n}_{H} = S o f t m a x (Q_{H} K_{H}^{T}) \otimes D_{H}

(3)

{A t t n}_{W} = S o f t m a x (Q_{W} K_{W}^{T}) \otimes D_{W}

(4)

M a S A (X) = {{A t t n}_{H} ({A t t n}_{W} V)}^{T}

(5)

Here,

Q_{H}

,

K_{H}

,

V_{H}

are the query, key, and value matrices in the horizontal direction, respectively.

Q_{W}

,

K_{W}

,

V_{W}

are the corresponding matrices in the vertical direction.

D_{H}

and

D_{W}

are the horizontal and vertical decay matrices, respectively, representing the distance decay of tokens in the horizontal and vertical directions. Through this decomposition, Ret can maintain spatial priors while reducing the computational complexity of self-attention from quadratic to linear, significantly improving the efficiency of the model. To further enhance the local expressive ability of MaSA, the DWConv contextual local enhancement module was introduced:

X_{o u t} = M a S A (X) + L C E (V)

(6)

The spatial-decay matrix ensures that the model does not overlook local features that are important for the task but occupy a small area in the global image when handling small underwater targets. Specifically, the spatial-decay matrix adjusts the attention scores by measuring the distance between tokens using the Manhattan distance. For tokens that are farther apart, their attention scores are attenuated, while for tokens that are closer together, especially those within small target areas, their attention scores remain high, making the model more sensitive when processing these areas. Additionally, the attention decomposition form reduces computational complexity and improves model efficiency by breaking down complex attention calculations into more manageable sub-tasks. This dual strategy not only enhances the robustness and accuracy of underwater target detection but also provides new ideas for target detection in other constrained environments.

3.2.3. Triplet Attention

After feature extraction by RetBlock, triplet attention captures the interactions between the spatial dimensions (H and W) and the channel dimension (C) in the input tensor through three parallel branches, with the feature map generated by the Z-pool operation highlighting difficult targets in complex underwater environments, as shown in Figure 3. The Z-pool layer is a key component of triplet attention, which reduces dimensionality through MaxPool and AvgPool operations. This enhanced feature representation helps improve target recognition and localization accuracy, especially when visual information becomes unclear due to underwater conditions. Specifically, this step applies MaxPool and AvgPool to each channel of the feature, then fuses the results of these two pooling operations, thus capturing background information while maintaining feature significance, which is crucial for target detection in complex backgrounds. For an input tensor of shape (C × H × W), the Z-pool operation can be expressed as follows:

Z - p o o l (x) = [{M a x P o o l}_{0}^{d} (x), {A v g P o o l}_{0}^{d} (x)]

(7)

Here,

O^{d}

indicates pooling operations performed along the 0th dimension (channel dimension). The result of the Z-pool is a tensor of shape (2 × H × W), which contains the maximum and average features of the original tensor.

The triplet-attention module receives an input tensor

x

of shape (C × H × W) and outputs a refined tensor y of the same shape. The processing of each branch is as follows:

First branch: Processes the interaction between the height dimension (H) and the channel dimension (C). The input tensor $x \in R^{W \times H \times C}$ is rotated 90° counterclockwise (anti-clockwise) about the H axis to obtain a tensor $\hat{x_{1}}$ of shape (W × H × C). Then, $\hat{x_{1}}$ passes through a standard convolution layer with kernel size k = 7 and a batch normalization layer, providing an intermediate dimension output (1 × H × C). The tensor then passes through a sigmoid activation layer ( $σ$ ) to generate the attention weights $φ_{1}$ applied to the input tensor $\hat{x_{1}}$ . Finally, it is rotated 90° clockwise along the H axis to maintain the original input shape $x$ .
Second branch: Processes the interaction between the width dimension (W) and the channel dimension (C). The input tensor $x$ is rotated 90° counterclockwise about the W axis to obtain a tensor $\hat{x_{2}}$ of shape (H × C × W). Similar to the processing in the first branch, $\hat{x_{2}}$ passes through a Z-pool and a convolution layer to generate the attention weights, $φ_{2}$ , and finally rotates 90° clockwise along the W axis to maintain the same shape as the input tensor $x$ .
Third branch: Constructs spatial attention, handling the dependency between the height and width dimensions (H and W). The input tensor $x$ is reduced in dimensionality by Z-pool to two channels, obtaining a tensor $\hat{x_{3}}$ of shape (2 × H × W). Then, $\hat{x_{3}}$ is processed through a convolution layer to generate attention weights $φ_{3}$ of shape (1 × H × W) and applied to the input tensor $\hat{x_{3}}$ . Finally, the refined tensors of shape (C × H × W) generated by the three branches are aggregated by averaging.

In summary, the process of obtaining the refined attention-applied tensor y from the triple attention on the input tensor

x \in R^{C \times H \times W}

can be expressed by the following equation:

y = \frac{1}{3} (\bar{\hat{x_{1}} σ (φ_{1} (\hat{x_{1} *}))} + \bar{\hat{x_{2}} σ (φ_{2} (\hat{x_{2} *}))} + x σ (φ_{3} (\hat{x_{3}}))

(8)

Here,

σ

represents the sigmoid activation function.

φ_{1}, φ_{2}, φ_{3}

denote the two-dimensional convolution layers with k = 7 in the respective branches, and the overline symbols are mainly used to represent the rotation operation in the triplet-attention module. This equation can be simplified as follows:

y = \frac{1}{3} (\bar{\hat{x_{1}} ω_{1}} + \bar{\hat{x_{2}} ω_{2}} + x ω_{3}) = \frac{1}{3} (\bar{y_{1}} + \bar{y_{2}} + y_{3})

(9)

ω_{1}

,

ω_{2}

,

ω_{3}

represent the attention weights for the three cross dimensions.

\bar{y_{1}}

and

\bar{y_{2}}

, respectively, represent a 90° clockwise rotation to maintain the original input shape (C × H × W). These tensors are rotated back to their original shapes to capture dependencies across different dimensions, thereby enhancing the effectiveness of the attention mechanism.

In the model, the neck is the critical area for fusing the low-level features extracted by the backbone with high-level semantic information. The efficiency of triplet attention as a lightweight module makes it particularly suitable for the neck part, as it can enhance performance without significantly increasing the computational burden. The unclear recognition of objects is often due to a lack of sufficient spatial details in the feature representation. Triplet attention, by establishing interactions between the channel dimension and the spatial dimensions, can better capture and emphasize detailed features in the image, including the texture and shape information of smaller objects.

3.2.4. LCD-Head

To achieve the goal of model light-weighting, we analyzed the parameter distribution of YOLOv8n, where the detection head accounts for 0.75 million parameters, occupying one-quarter of the entire model’s parameters. To maximize light-weighting while maintaining good performance, we designed the LCD detection head, which has only 0.1 million parameters. The specific structure is shown in Figure 4 and Table 1. As shown in Table 1, sourced from experimental results, the FLOPS and parameters of the LCD occupy only 47% and 13.3% of the YOLOv8n-head, fully achieving light-weighting. Furthermore, the LCD detection head accounts for only 4.9% of the entire model, significantly reducing the overall computational burden. Lightweight coupled detection (LCD) shares parameters between different tasks (e.g., bounding box regression, classification), allowing the output of the same convolution layer to be used simultaneously for predicting the location of the bounding box and the category of the object, thereby greatly reducing the number of parameters the model needs to learn independently. LCD does not forcefully reconstruct the feature network in each forward pass and can dynamically adjust the feature network during model inference as needed, thereby improving the model’s adaptability and efficiency. To capture multi-scale feature information, ensuring the model focuses more on useful feature representations and can consider both the overall characteristics and local details of the target object, we use 1 × 1 convolution layers and group normalization to reduce the dimensionality of feature maps, reducing computational load and over-fitting risk while retaining the most important features. In anchor design, the anchor sizes are predefined based on prior knowledge and statistical information for different datasets, with the strides variable storing the steps of anchors on different feature maps, thus mapping predictions back to the original image size. The design of the shared convolution layer considers spatial and semantic information; the first 3 × 3 convolution layer’s output dimension feature vector focuses more on spatial information, while the second 3 × 3 convolution layer’s output dimension feature focuses more on semantic information. The concatenated vector contains sufficient information to describe the object’s position, size, and category. Finally, the scale layer is introduced to scale features. A ‘scale’ instance with a scaling factor of 1 is created for each element in the channel, and ultimately, boundary box and category probability predictions are generated through two parallel convolution layers, cv2 and cv3, addressing the inconsistency of detecting targets at multiple scales. These structural designs enable the model to make more accurate predictions of bounding boxes and categories while significantly reducing the number of parameters. When regMax is greater than 1, we use the DFL (distribution focal loss) strategy for more precise bounding box regression. The decoding formula for the bounding box is as follows:

b o x = T (d) \otimes a \times S

(10)

Here, d represents the output of the DFL layer, corresponding to the distribution parameters converted from the predicted values. a is the anchor size, and S is the stride vector. After converting d to the bounding box through the transformation T, it can be expressed as T(d).

3.2.5. Focaler–MPDIoU

To address the accuracy loss caused by class imbalance and varying target sizes in underwater datasets, we combined the Focaler and MPDIoU loss functions to simultaneously focus on the difficulty of samples and localization accuracy. The core idea of MPDIoU is to improve the accuracy and efficiency of regression by minimizing the distance between the top-left and bottom-right points of the predicted bounding box and the ground truth bounding box. This method not only considers all relevant factors in existing loss functions (such as overlapping or non-overlapping areas, center point distance, width, and height deviations) but also simplifies the calculation process, thereby improving computational efficiency and stability.

Targets in underwater environments are often affected by light scattering and reflection, leading to reduced image contrast and clarity, which increases the difficulty of detection and localization. Traditional loss functions often overlook details when handling these complex situations, making it difficult to provide precise regression results. MPDIoU, by focusing on the four corner points of the bounding box, provides a more detailed evaluation standard, giving the model an advantage in handling subtle changes. Additionally, the introduction of Focaler addresses the issue of sample difficulty. In underwater images, targets vary greatly in size and shape, with some being difficult to recognize and others relatively easy. Focaler dynamically adjusts the weights of the loss function, allowing the model to focus more on difficult samples and less on easy ones, thereby enhancing the overall performance of the detector. Through this approach, the model can better identify smaller or blurrier targets and maintain high detection and localization accuracy in complex backgrounds. Therefore, the Focaler–MPDIoU loss function improves the model’s robustness and accuracy, enabling efficient target detection and localization even in complex underwater environments.

For any two convex shapes A,B ⊆ S

\in R^{n}

with the input image’s width and height being w and h, respectively, the calculation process of MPDIoU is as follows:

d_{1}^{2} = {({x_{1}}^{B} - {x_{1}}^{A})}^{2} + {({y_{1}}^{B} - {y_{1}}^{A})}^{2}

(11)

d_{2}^{2} = {({x_{2}}^{B} - {x_{2}}^{A})}^{2} + {({y_{2}}^{B} - {y_{2}}^{A})}^{2}

(12)

For A and B,

({x_{1}}^{A}, {y_{1}}^{A})

and

({x_{2}}^{A}, {y_{2}}^{A})

represent the coordinates of the top-left and bottom-right points of A, respectively, and

({x_{1}}^{B}, {y_{1}}^{B})

and

({x_{2}}^{B}, {y_{2}}^{B})

represent the coordinates of the top-left and bottom-right points of B.

Calculate the intersection area of A and B:

I = m a x (0, m i n ({x_{1}}^{A}, {x_{1}}^{B}) - m a x ({x_{1}}^{A}, {x_{1}}^{B})) × m a x (0, m i n ({y_{1}}^{A}, {y_{1}}^{B}) - m a x ({y_{1}}^{A}, {y_{1}}^{B}))

(13)

If the bottom-right corner coordinates of rectangle B

({x_{2}}^{B}, {y_{2}}^{B})

are greater than the top-left corner coordinates of rectangle A

({x_{1}}^{A}, {y_{1}}^{A})

, then the two rectangles have an overlapping area, and the intersection area I is greater than 0. Otherwise, the intersection area I is 0.

Calculate the union area of A and B:

U = A \cup B

(14)

Calculate minimum point distance intersection over union (MPDIoU):

M P D I o U = \frac{I}{U} - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(15)

MPDIoU as a loss function can be defined as follows:

L_{M P D I o U} = 1 - M P D I o U

(16)

In this algorithm, we first determine the coordinates of the top-left and bottom-right points of the two bounding boxes, then calculate the Euclidean distance between these points. Next, we calculate the intersection area and the union area of the two bounding boxes. Finally, by subtracting the ratio of the squared sum of the distances between the top-left and bottom-right points from the squared sum of the image dimensions from the intersection area, we obtain the MPDIoU value.

The core idea of Focaler–IoU is to dynamically adjust the weights of the loss function based on the difficulty of the samples (e.g., the size of the target or the complexity of detection), making the model focus more on samples that are difficult to detect correctly (hard samples) while reducing the focus on samples that are easy to recognize (easy samples). This method can improve the detector’s performance by focusing on different regression samples in various detection tasks.

The formula for Focaler–IoU is as follows:

{I o U}_{f o c a l e r} \{\begin{array}{c} 0, i f I o U < d \\ I o U - \frac{d}{u - d} i f d \leq I o U \leq u \\ 1, i f I o U > u \end{array}

(17)

Here,

I o U

is the original intersection over union value, and d and u are two thresholds that define the range of interval mapping. By adjusting the values of d and u, we can control the focus of

{I o U}_{f o c a l e r}

on different regression samples. The value of

{I o U}_{f o c a l e r}

is designed to vary linearly between d and u, When

I o U

is less than d,

{I o U}_{f o c a l e r}

is 0, indicating that these almost non-overlapping samples are ignored. When

I o U

is greater than u,

{I o U}_{f o c a l e r}

is 1, indicating that these samples are already very close to the ground truth bounding box and do not need further attention. Between d and u,

{I o U}_{f o c a l e r}

increases linearly, thereby paying more attention to those partially overlapping samples. Combining the MPDIoU loss function, we get the following:

L_{F o c a l e r - M P D I o U} = 1 - {I o U}_{f o c a l e r}

(18)

The combination of Focaler–IoU and MPDIoU loss functions effectively improves the model’s training efficiency, sample difficulty handling, and localization accuracy. Through multiple experiments, we chose d = 0, u = 0.99, achieving the best performance. D = 0 means the loss function pays attention to all samples starting from IoU = 0, which is beneficial for difficult-to-distinguish targets in underwater datasets. Even very small overlapping parts with the target are considered in the loss, helping the model learn to recognize blurred or indistinct target boundaries. U = 0.99 sets a high threshold for the Focaler–IoU loss function, meaning the loss is 0 only when the predicted box and the ground truth box have a very high IoU (close to 1). This encourages the model to focus on nearly perfectly aligned predicted boxes and imposes a smaller penalty on these predictions. Underwater images contain a lot of noise and interference, such as particles in the water, suspended matter, or other organisms in the background. The high value of u ensures the model focuses on high-quality predicted boxes rather than low-quality predictions that may result from image noise or other interferences, which is also beneficial for handling subtle underwater targets.

4. Experiments

4.1. Experiment Platform

The experiments are conducted on a computer with 13th Gen Intel(R) Core(TM) i9-13900K 3.00 GHz (64G RAM) and NVIDIA GeForce RTX 4080 16G. The software configuration environments are CUDA 11.3, Python 3.9.18, and Torch 1.12.1. The training parameters are listed in Table 2. In our preliminary experiments, we observed a rapid increase in the model’s performance up to 250 epochs, beyond which the performance plateaued and posed a risk of overfitting. Consequently, to optimize computational resources, we set the number of epochs at 250. We selected a batch size of 8 to ensure sufficient sample size for gradient estimation at each update without excessive resource consumption. This batch size was determined based on multiple experimental trials. Based on our experience, we set the initial learning rate to 0.01. Our choice of optimizer was stochastic gradient descent (SGD), favored for its fast convergence on large-scale datasets, superior test data generalization, simplicity, effectiveness, and minimal memory usage, aligning with our experiment’s lightweight objective.

4.2. Experiment Result

We used dataset one to train the RTL model and YOLOv8n and obtained results. The model evaluation criteria included [email protected], parameters, FLOPS, recall, and precision, compared with the original YOLOv8n, as shown in Figure 5. Figure 5a shows the change in [email protected] (mean average precision at IoU threshold 0.5) over epochs. It can be seen that our RTL model, with a 30% reduction in parameters, still has a higher accuracy than YOLOv8n. Figure 5b shows the change in precision over training epochs. As the epochs increase, the precision of RTL gradually surpasses that of YOLOv8n. Figure 5c shows the change in recall over training epochs. Similar to Figure 5a, RTL shows a faster convergence rate, indicating that RTL can learn useful features more quickly, making it more efficient and better suited for scenarios with limited training time and resources. In Figure 5d, it can be seen that the training loss of the RTL model decreases very rapidly in the initial stage, indicating that the RTL model can quickly adapt to the training data in the early training stages. Fast loss reduction usually means high learning efficiency, allowing the model to start optimizing its parameters more quickly. During the training process, the loss of the RTL model not only decreases quickly but also reaches a lower steady-state value than YOLOv8n in the later stages of training. The loss curve of the RTL model is smoother than that of YOLOv8n, indicating that the learning process is more stable during training, which is especially important in complex underwater scenes. From Figure 6, it is more intuitive that the RTL model performs well in complex underwater environments, capturing targets with occlusion and cluttered backgrounds. Despite the uneven distribution of different categories of targets, the RTL model can still stably detect various targets. Additionally, the overall confidence level is high, indicating that the RTL model is reliable and stable in recognizing targets. Finally, the position and size of the detection boxes are relatively accurate, implying that the model is optimized for feature extraction and target localization, reducing false positives and missed detections.

4.3. Ablation Experiments

Compared to the baseline, our proposed RTL-YOLOv8 includes improvements in four areas. To verify the effectiveness of these improvements, we used parameters GFLOPS, [email protected], and [email protected] as evaluation metrics and conducted ablation experiments on each module, as shown in Table 3. The data in Table 3 comes from ablation experiments that were conducted to analyze the performance of different models and modules in the RTL-YOLOv8 framework. According to the results in the table, compared to the baseline, these four improvements show different enhancements in underwater target detection.

RetBlock, by introducing explicit spatial priors and a decomposed self-attention mechanism, effectively reduces the model’s parameter count and computational complexity while capturing feature information efficiently. This leads to a 6% reduction in parameters, an 11% reduction in GFLOPS, a 0.5% increase in [email protected], and a 0.3% increase in [email protected]. The decomposed self-attention mechanism in RetBlock retains the advantages of traditional self-attention in capturing long-distance dependencies while reducing computational complexity and memory consumption through decomposition. This ensures the model remains efficient and accurate when processing high-dimensional features. By combining these techniques, RetBlock significantly improves the model’s efficiency and performance while retaining strong feature-extraction capabilities. Triplet attention enhances the model’s ability to perceive different regions of the image by capturing cross-dimensional interactions between the channel and spatial dimensions of the input tensor. This improves [email protected] and [email protected] by 0.2% and 0.3%, respectively, without changing the model’s complexity. Triplet attention, with its three-branch structure, further enhances the feature fusion method in the neck part. Each branch captures cross-dimensional interactions between the channel and spatial dimensions (height or width). Through input tensor rotation operations and residual transformations, it effectively captures dependencies between height-channel, width-channel, and height-width, achieving richer feature representation without increasing model complexity. This cross-dimensional interaction capture mechanism allows the model to detect multi-scale targets more accurately in complex environments, significantly improving target detection performance. LCD removes redundant convolutions, reducing the overall model size by 21.3%, and improves [email protected] by 0.2% through scale layer extraction. Shared parameters significantly reduce parameters and computational consumption while simultaneously predicting detection boxes and targets. The 1 × 1 group normalization ensures that the downscaled feature maps retain effective information. Focaler–MPDIoU optimizes the classification and localization of difficult targets and bounding box regression by focusing on the distribution of difficult and easy samples combined with distance metrics, improving [email protected] and [email protected] by 0.6% and 0.2%, respectively, without changing the model size. Additionally, Focaler–MPDIoU uses an adaptive weight allocation strategy during training, allowing the model to handle difficult samples more effectively and achieve higher accuracy and stability in regression tasks. This improvement not only enhances target detection accuracy but also reduces training bias caused by sample imbalance.

Combining these four improvements, we obtain the best-performing RTL model. Compared to the baseline, it reduces parameters by 31.6%, GFLOPS by 29.3%, increases [email protected] by 1%, and increases [email protected] by 0.7%, proving the effectiveness of these four modules. From the figure, we can see that when RetBlock, Triplet, LCD, and FMIoU are added to the model in different combinations, the performance varies. For example, in Model 5, the combined use of RetBlock and Triplet does not show as significant an improvement as when used individually. We believe that during training, these two modules inevitably extract similar features. Therefore, handling overlapping information does not lead to cumulative performance improvement. Furthermore, the interaction between modules may require re-tuning hyperparameters [35]. Modules that perform well individually may not achieve optimal results when combined, as the shared hyperparameter space changes.

4.4. Comparative Experiments

We compared our model with the original YOLOv8, YOLOv8-DBBNCSPELAN [36], YOLOv8-DynamicHGNet [37], YOLOv8-FasterEMA [38], and YOLOv8-GhostHGNet [37], and other innovative lightweight models. The results presented in Figure 7 detail the performance of these models in underwater object detection. In key metrics such as [email protected], precision, and recall, the RTL model showed significant advantages, proving its superiority in accuracy and recall rate. To ensure the accuracy of the experiments, we switched to dataset two. Figure 7 shows the performance of the aforementioned models on the second dataset. Despite the relatively simple structure of the RTL model, the test results still indicate that it achieved the highest level of accuracy, highlighting its excellent performance and efficient design optimization. Using heatmaps to compare different models offers a clear visualization of their performance differences, highlighting areas of excellence and underperformance. This method facilitates a quick understanding of each model’s strengths and weaknesses. Therefore, we used the high-resolution HiResCAM method to generate heatmaps to evaluate whether the model could effectively identify the key features of the targets and the model’s attention focus, as shown in Figure 8. Compared with traditional CAM and Grad-CAM, HiResCAM provides a more detailed visual feature display, enhancing our understanding of the model’s performance. From left to right, Figure 8 and Figure 9 show the original picture(a), YOLOv8n(b), YOLOv8-DBBESPLEAN(c), YOLOv8-DynamicHGNet(d), YOLOv8-FasterEMA(e), YOLOv8-GhostHGNet(f), and RTL-YOLOv8n(g) models. The comparison shows that the heatmap of the RTL model displays more concentrated and clear response areas, with a stronger ability to capture smaller and blurrier objects. This indicates that the RTL model has a higher focus when identifying targets, effectively concentrating on the main targets in the image. In Table 4, as a detailed description of Figure 9, Model RTL demonstrates significant advantages over the preceding five models in several aspects. Firstly, RTL excels in detection accuracy, where its accuracy is notably higher than that of the other models. Secondly, RTL exhibits a lower false positive rate, showcasing greater robustness and stability. For instance, in the third line, RTL accurately detects holothurian with a minimal false positive rate. Additionally, RTL excels in multi-object detection capabilities, as evidenced in line 4, where it successfully detects multiple target objects (dojou and dojou2), whereas other models perform poorly. Overall, RTL maintains consistently high performance across various environments (such as green and blue backgrounds), highlighting its superiority in diverse detection tasks. Additionally, it can be seen that the RTL model more effectively suppresses background interference compared to other models, reducing false detection in complex environments.

In Figure 10, The RTL model achieves the highest precision ([email protected]) compared to the other five models, despite having significantly fewer parameters. This demonstrates its efficiency and effectiveness in achieving high accuracy with a reduced computational load. The chart shows that while the parameters for RTL are only 2.06 million, it reaches a [email protected] of 0.808, outperforming models like YOLOv8-GhosthgNet and YOLOv8n, which have higher parameter counts but lower accuracy.

5. Discussion and Future Development

5.1. Discussion

Underwater target detection plays a crucial role in aquaculture, not only helping to improve farming efficiency but also promoting the protection and sustainability of the underwater ecological environment. To address the issues of high accuracy and computational efficiency in underwater target detection, we proposed several key modules in the RTL-YOLOv8n model and ultimately achieved significant results.

The RetBlock module allows better consideration of spatial information in the model, but its complexity and high computational requirements still have substantial limitations for model training. The intricate structure of RetBlock escalates the difficulty of model tuning, as the numerous components and their complex interactions necessitate extensive hyperparameter adjustments and experimentation to identify the optimal configuration. Additionally, RetBlock’s complexity prolongs the model training duration, as each training step becomes more time-consuming due to the intricate structure and computational demand. These limitations introduce challenges to model development and application. Due to the complexity and variability of underwater environments, the model is trained on relatively limited underwater data. For more complex and variable real underwater environments, there is still room for improvement in the model. We plan to explore more realistic underwater target detection environments in subsequent experimental research, thereby enhancing the applicability and generalization ability of the model.

We deployed the model in a laboratory environment using the same configuration as training, and the model can still achieve processing results at 80 fps while maintaining high performance. Therefore, in large-scale aquaculture operations, we believe that the model can also be applied to more cost-effective deployment devices, maintain relatively efficient work efficiency, and meet the real-time monitoring performance required by users.

In this research, we introduce RTL-YOLOv8n, a novel model that significantly enhances the accuracy and efficiency of underwater target detection. A reduction in computational load by 31.6% allows the model to operate on devices with limited computational resources, thereby increasing its flexibility in real-world applications. Furthermore, an increase of 1.5% in mean average precision (MAP) at an intersection over union (IoU) of 0.5 and a 5.2% improvement in precision demonstrate a substantial enhancement in RTL-YOLOv8n’s capability to identify and locate underwater targets. These advancements have far-reaching implications for the domain of underwater target detection, particularly for applications necessitating high-precision target detection in complex underwater environments, such as underwater exploration and marine biological research. Moreover, the efficiency and precision of RTL-YOLOv8n extend its applicability to other environments requiring efficient and accurate target detection. In the context of intelligent aquaculture systems, the implementation of RTL-YOLOv8n can facilitate more accurate monitoring and tracking of aquatic organisms, thereby improving breeding efficiency and the management of biological health. For instance, the health status of fish schools can be evaluated by detecting the quantity and behavior of fish, and biodiversity can be preserved by identifying and tracking specific species. The enhancements in RTL-YOLOv8n offer a potent tool for underwater target detection, aiding in our understanding and protection of marine environments. Simultaneously, they pave the way for potential advancements in the development of intelligent aquaculture systems

5.2. Future Development

To further contribute to industrialization, in future work, we plan to optimize the entire computational process based on multiply-accumulate (MAC) operations, from data processing to GPU computation, inference, and non-maximum suppression (NMS), to achieve overall efficiency improvement. First, we will optimize data loading and preprocessing steps to reduce data transfer time and preprocessing costs to the GPU. Second, we will fully utilize the parallel processing capabilities of GPUs, and optimize the implementation of MAC operations by improving algorithms and adjusting memory access patterns to enhance the utilization and computational efficiency of GPU execution cores. In the inference phase, we will design simpler and more efficient network architectures, reduce unnecessary MAC operations, and explore techniques such as model quantization and pruning to reduce the model’s parameter count and computational requirements. Additionally, for the NMS algorithm, we will speed up the computation process through algorithm optimization or parallel processing techniques, effectively reducing post-processing time in object detection. Combining these strategies, we expect to significantly improve overall inference performance, accelerate processing speed, and enhance system efficiency, especially in real-time applications requiring quick responses. This comprehensive optimization strategy will be the focus of our future research and development aimed at achieving more efficient and practical visual identity systems in more complex underwater environments.

6. Conclusions

Traditional aquaculture management methods are labor-intensive, and the prerequisite for achieving automated and intelligent management is the accurate identification of underwater targets. We explored the application of deep-learning methods for recognizing targets in complex underwater environments. Considering the complexity of real underwater environments and the small size, irregular shape, and camouflage of targets, this paper proposes a new method based on YOLOv8n to address the challenge of recognizing difficult targets in such environments. By adding layers for recognizing difficult targets, improving feature fusion methods, and integrating attention mechanisms, our model improved [email protected] by 1% and precision by 5.2% while reducing the parameter count by 31.6% and GFLOPS by 29.3%. Experimental results show that the model can more accurately recognize targets in complex underwater environments, with stronger generalization ability and robustness, meeting the requirements for real-time underwater recognition. This study provides new insights for recognizing difficult targets in complex underwater environments, has theoretical significance and practical application value for developing intelligent aquaculture equipment, and offers guidance for the intelligent upgrading and development of the marine industry.

Author Contributions

Conceptualization, G.F.; Development of RTL-YOLOv8n Model, G.F.; Writing—Original Draft, G.F.; Formal Analysis, G.F.; Data Collection and Annotation, Z.X.; Software Development, Z.X. and Z.Z.; Experimentation, Z.X.; Investigation, H.P.; Data Curation, H.P.; Writing—Review and Editing, H.P.; Project Administration, Y.G.; Resources, Y.G.; Supervision, Y.G.; Data Preprocessing, Z.Z.; Validation, Z.Z.; Software Implementation, J.Y.; Data Analysis, J.Y.; Technical Support, J.Y.; Methodology Development, Z.M.; Validation, Z.M.; Funding Acquisition, Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Science and Technology Program of Zhejiang Province (grant number 2022C02040) and supported by the Dalian Key Laboratory of Intelligent Detection and Diagnostic Technology for Equipment (grant number DL-KL-201904). The authors declare that we have no competing interests. We would like to acknowledge the use of ChatGPT, a language model developed by OpenAI, for assisting in the language editing and refinement of this manuscript. The use of this tool has facilitated the clarity and coherence of our research presentation.

Data Availability Statement

The dataset titled “UnderwaterObjectDetection Dataset” is an open-source resource created by Yifan Wu. It is available on Roboflow Universe, published by Roboflow in April 2023. The “Fish4Knowledge Dataset” is an open-source resource authored by g18L5754. It was published on Roboflow Universe by Roboflow in October 2023.

Conflicts of Interest

Jiapeng Yang is employed by Suzhou Jiean Information Technology Co., Ltd. The rest of the authors declare no conflicts of interest.

References

Stevens, J.R.; Newton, R.W.; Tlusty, M.; Little, D.C. The rise of aquaculture by-products: Increasing food production, value, and sustainability through strategic utilisation. Mar. Policy 2018, 90, 115–124. [Google Scholar] [CrossRef]
Campbell, B.; Pauly, D. Mariculture: A global analysis of production trends since 1950. Mar. Policy 2013, 39, 94–100. [Google Scholar] [CrossRef]
Wang, Q.; Liu, H.; Sui, J. Mariculture: Developments, present status and prospects. In Aquaculture in China: Success Stories and Modern Trends; Wiley: Hoboken, NJ, USA, 2018; pp. 38–54. [Google Scholar]
Mandić, M.; Ikica, Z.; Gvozdenović, S. Mariculture in the Boka Kotorska Bay: Tradition, current state and perspective. In The Boka Kotorska Bay Environment; Springer: Berlin/Heidelberg, Germany, 2017; pp. 395–409. [Google Scholar]
Zheng, L.; Liu, Q.; Liu, J.; Xiao, J.; Xu, G. Pollution control of industrial mariculture wastewater: A mini-review. Water 2022, 14, 1390. [Google Scholar] [CrossRef]
Wang, Z.; Liu, H.; Zhang, G.; Yang, X.; Wen, L.; Zhao, W. Diseased fish detection in the underwater environment using an improved yolov5 network for intensive aquaculture. Fishes 2023, 8, 169. [Google Scholar] [CrossRef]
Gao, T.; Xiong, Z.; Li, Z.; Huang, X.; Liu, Y.; Cai, K. Precise underwater fish measurement: A geometric approach leveraging medium regression. Comput. Electron. Agric. 2024, 221, 108932. [Google Scholar] [CrossRef]
Cai, K.; Miao, X.; Wang, W.; Pang, H.; Liu, Y.; Song, J. A modified YOLOv3 model for fish detection based on MobileNetv1 as backbone. Aquac. Eng. 2020, 91, 102117. [Google Scholar] [CrossRef]
Khan, A.; Fouda, M.M.; Do, D.-T.; Almaleh, A.; Alqahtani, A.M.; Rahman, A.U. Underwater target detection using deep learning: Methodologies, challenges, applications and future evolution. IEEE Access 2024, 12, 12618–12635. [Google Scholar] [CrossRef]
Peli, T.; Malah, D. A study of edge detection algorithms. Comput. Graph. Image Process. 1982, 20, 1–21. [Google Scholar] [CrossRef]
Belongie, S.; Malik, J.; Puzicha, J. Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 509–522. [Google Scholar] [CrossRef]
Bhanu, B. Automatic target recognition: State of the art survey. IEEE Trans. Aerosp. Electron. Syst. 1986, 22, 364–379. [Google Scholar] [CrossRef]
Chapelle, O.; Haffner, P.; Vapnik, V. Support vector machines for histogram-based image classification. IEEE Trans. Neural Netw. 1999, 10, 1055–1064. [Google Scholar] [CrossRef]
Zhou, H.; Jiang, T. Decision tree based sea-surface weak target detection with false alarm rate controllable. IEEE Signal Process. Lett. 2019, 26, 793–797. [Google Scholar] [CrossRef]
Guo, Z.X.; Shui, P.L. Anomaly based sea-surface small target detection using K-nearest neighbor classification. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 4947–4964. [Google Scholar] [CrossRef]
Lei, F.; Tang, F.; Li, S. Underwater target detection algorithm based on improved YOLOv5. J. Mar. Sci. Eng. 2022, 10, 310. [Google Scholar] [CrossRef]
Nanni, L.; Ghidoni, S.; Brahnam, S. Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recognit. 2017, 71, 158–172. [Google Scholar] [CrossRef]
Devulapalli, S.; Potti, A.; Krishnan, R.; Khan, S. Experimental evaluation of unsupervised image retrieval application using hybrid feature extraction by integrating deep learning and handcrafted techniques. Mater. Today Proc. 2023, 81, 983–988. [Google Scholar] [CrossRef]
Kamal, S.; Mohammed, S.K.; Pillai, P.R.S.; Supriya, M.H. Deep learning architectures for underwater target recognition. In Proceedings of the 2013 Ocean Electronics (SYMPOL), Kochi, India, 23–25 October 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 48–54. [Google Scholar]
Liu, P.; Hongbo, Y.A.N.G.; Hu, Y.; Fu, J. Research on target recognition of underwater robot. In Proceedings of the 2018 IEEE International Conference on Advanced Manufacturing (ICAM), Yunlin, Taiwan, 16–18 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 463–466. [Google Scholar]
Er, M.J.; Chen, J.; Zhang, Y.; Gao, W. Research challenges, recent advances, and popular datasets in deep learning-based underwater marine object detection: A review. Sensors 2023, 23, 1990. [Google Scholar] [CrossRef] [PubMed]
Zeng, L.; Sun, B.; Zhu, D. Underwater target detection based on Faster R-CNN and adversarial occlusion network. Eng. Appl. Artif. Intell. 2021, 100, 104190. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, Y.; He, B. Intelligent Marine Survey: Lightweight Multi-Scale Attention Adaptive Segmentation Framework for Underwater Target Detection of AUV. IEEE Trans. Autom. Sci. Eng. 2024. [Google Scholar] [CrossRef]
Han, F.; Yao, J.; Zhu, H.; Wang, C. Underwater image processing and object detection based on deep CNN method. J. Sens. 2020, 2020, 6707328. [Google Scholar] [CrossRef]
Liu, K.; Sun, Q.; Sun, D.; Peng, L.; Yang, M.; Wang, N. Underwater target detection based on improved YOLOv7. J. Mar. Sci. Eng. 2023, 11, 677. [Google Scholar] [CrossRef]
Zhai, X.; Wei, H.; He, Y.; Shang, Y.; Liu, C. Underwater sea cucumber identification based on improved YOLOv5. Appl. Sci. 2022, 12, 9105. [Google Scholar] [CrossRef]
Wang, P.; Yang, Z.; Pang, H.; Zhang, T.; Cai, K. A novel fft_yolox model for underwater precious marine product detection. Appl. Sci. 2022, 12, 6801. [Google Scholar] [CrossRef]
Bao, Z.; Guo, Y.; Wang, J.; Zhu, L.; Huang, J.; Yan, S. Underwater target detection based on parallel high-resolution networks. Sensors 2023, 23, 7337. [Google Scholar] [CrossRef]
Chen, L.; Zheng, M.; Duan, S.; Luo, W.; Yao, L. Underwater target recognition based on improved YOLOv4 neural network. Electronics 2021, 10, 1634. [Google Scholar] [CrossRef]
Fan, Y.; Zhang, L.; Li, P. A Lightweight Model of Underwater Object Detection Based on YOLOv8n for an Edge Computing Platform. J. Mar. Sci. Eng. 2024, 12, 697. [Google Scholar] [CrossRef]
Wei, Y.; Fang, Y.; Cheng, F.; Zhang, M.; Cao, M.; Zhang, H. A lightweight underwater target detection network for seafood. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 242–26 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 8381–8387. [Google Scholar]
Liu, Q.; Huang, W.; Duan, X.; Wei, J.; Hu, T.; Yu, J.; Huang, J. DSW-YOLOv8n: A new underwater target detection algorithm based on improved YOLOv8n. Electronics 2023, 12, 3892. [Google Scholar] [CrossRef]
Wang, X.; Gao, H.; Jia, Z.; Li, Z. BL-YOLOv8: An improved road defect detection model based on YOLOv8. Sensors 2023, 23, 8361. [Google Scholar] [CrossRef]
Tian, Y.; Zhao, C.; Zhang, T.; Wu, H.; Zhao, Y. Recognition Method of Cabbage Heads at Harvest Stage under Complex Background Based on Improved YOLOv8n. Agriculture 2024, 14, 1125. [Google Scholar] [CrossRef]
Pon, M.Z.A.; Krishna Prakash, K.K. Hyperparameter tuning of deep learning models in keras. Sparkling Light Trans. Artif. Intell. Quantum Comput. STAIQC 2021, 1, 36–40. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse branch block: Building a convolution as an inception-like unit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 July 2021; pp. 10886–10895. [Google Scholar]
Han, K.; Wang, Y.; Guo, J.; Wu, E. ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 15751–15761. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]

Figure 1. RTL model flowchart.

Figure 2. Structure of RetBlock.

Figure 3. Structure of triplet attention.

Figure 4. Structure of lightweight coupled detection head.

Figure 5. Figures (a–d) compare the [email protected], precision, recall, and training loss of YOLOv8n and RTL.

Figure 6. The green background represents the first dataset, and the blue background represents the second dataset.

Figure 7. Figures (a–c) compare the precision, recall, and [email protected] of different models on Dataset 1, respectively. Figure (d) compares the [email protected] of the models on Dataset 2.

Figure 8. Heat map of the six models. Each row represents a different comparison image, the first column is the original image, all other columns represent the detection results of different models. (a) the original image set, (b) the YOLOv8n heatmap set, (c) the YOLON8-DBBESPLEAN heatmap set, (d) the YOLOv8-DynamicHGNet heatmap set, (e) the YOLOv8-FasterEMA heatmap set, (f) the YOLOv8-GhostHGNet heatmap set, (g) the RTL-YOLOv8n(our) heatmap group.

Figure 9. Model detection image comparison, each row represents a different comparison image, the first column is the original image, all other columns represent the detection results of different models. (a) the original image set, (b) the YOLOv8n classification result set, (c) the YOLON8-DBBESPLEAN classification result set, (d) the YOLOv8-DynamicHGNet classification result set, (e) the YOLOv8-FasterEMA classification result set, (f) the YOLOv8-GhostHGNet classification result set. (g) RTL-YOLOv8n (our) classification result set.

Figure 10. Comparison of the six models’ parameters, GFLOPS, and [email protected].

Table 1. Comparison between RTL and YOLOv8n detection heads.

.	YOLOv8n	RTL	V8n-Head	LCD	V8n-Head/v8n (%)	LCD/RTL (%)	LCD/v8n-Head (%)
FLOPS (G)	8.1	5.8	2.98	1.4	36.8%	24.1%	47%
Parameters (M)	3.01	2.06	0.75	0.1	25%	4.9%	13.3%

Table 2. Training Parameters.

Parameters	Value
Epochs	250
Batch	8
Optimizer	SGD
CUDA	11.3.1
Pytorch	1.12.1
Python	3.9.18

Table 3. Ablation experiments by module.

Models	RetBlock	Triplet	LCD	FMloU	Parameters (M)	FLOPS (G)	mAPO.5 (%)	mAP0.95 (%)
Baseline					3.01	8.2	79.8	46.2
YOLOv8					−	−	−	−
Model1	√				2.83 (−6%)	7.3 (−11%)	80.3	46.5
Model2		√			−	−	80	46.5
Model3			√		2.37 (−21.3%)	6.6 (−19.5%6)	80	46.4
Model4				√	−	−	80.4	46.4
Model5	√	√			2.7 (−10.3%)	7.3 (−11%)	80.4	46.5
Model6	√	√	√		2.06 (−31.6%)	5.8 (−29.3%)	80.5	46.7
RTL	√	√	√	√	2.06 (−31.6%)	5.8 (−29.3%)	80.8	46.9

Table 4. Model detection image comparison.

	B	C	D	E	F	G
Experimental Figures (1)	starfish: 0.81 holothurian: None	starfish: 0.79 holothurian: None	starfish: 0.79 holothurian: None	starfish: 0.72 holothurian: None	starfish: 0.81 holothurian: None	starfish: 0.82 holothurian: 0.27
Experimental Figures (2)	echinus: 0.75 scallop: None holothurian: None	echinus: 0.67 scallop: None holothurian: None	echinus: 0.58 scallop: None holothurian: None	echinus: 0.55 scallop: 0.31 holothurian: None	echinus: 0.75 scallop: 0.30 holothurian: None	echinus: 0.76 scallop: 0.56 holothurian: 0.31
Experimental Figures (3)	holothurian: 0.79 echinus: None	holothurian: 0.77 echinus: None	holothurian: 0.72 echinus: 0.34 echinus: 0.33	holothurian: 0.70 holothurian: 0.22 echinus: 0.41	holothurian: 0.75 echinus: None	holothurian: 0.79 echinus: 0.52
Experimental Figures (4)	dojou: None	dojou: None	dojou: None	dojou:0.33	dojou: None	dojou: 0.44
Experimental Figures (5)	dojou1: 0.68 dojou2: None	dojou1: 0.62 dojou2: None	dojou1: 0.68 dojou2: None	dojou1: 0.66 dojou2: None	dojou1: 0.53 dojou2: None	dojou1: 0.78 dojou2: 0.48
Experimental Figures (6)	dojou1: 0.67 dojou2: None funa: 0.88	dojou1: 0.65 dojou2: 0.32 funa: 0.77	dojou1: 0.71 dojou2: 0.37 funa: 0.89	dojou1: 0.73 dojou2: 0.37 funa: 0.80	dojou1: 0.75 dojou2: 0.33 funa: 0.84	dojou1: 0.78 dojou2: 0.43 funa: 0.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, G.; Xiong, Z.; Pang, H.; Gao, Y.; Zhang, Z.; Yang, J.; Ma, Z. RTL-YOLOv8n: A Lightweight Model for Efficient and Accurate Underwater Target Detection. Fishes 2024, 9, 294. https://doi.org/10.3390/fishes9080294

AMA Style

Feng G, Xiong Z, Pang H, Gao Y, Zhang Z, Yang J, Ma Z. RTL-YOLOv8n: A Lightweight Model for Efficient and Accurate Underwater Target Detection. Fishes. 2024; 9(8):294. https://doi.org/10.3390/fishes9080294

Chicago/Turabian Style

Feng, Guanbo, Zhixin Xiong, Hongshuai Pang, Yunlei Gao, Zhiqiang Zhang, Jiapeng Yang, and Zhihong Ma. 2024. "RTL-YOLOv8n: A Lightweight Model for Efficient and Accurate Underwater Target Detection" Fishes 9, no. 8: 294. https://doi.org/10.3390/fishes9080294

Article Menu

RTL-YOLOv8n: A Lightweight Model for Efficient and Accurate Underwater Target Detection

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Data Source

3.2. Method

3.2.1. YOLOv8n Baseline

3.2.2. RetBlock

3.2.3. Triplet Attention

3.2.4. LCD-Head

3.2.5. Focaler–MPDIoU

4. Experiments

4.1. Experiment Platform

4.2. Experiment Result

4.3. Ablation Experiments

4.4. Comparative Experiments

5. Discussion and Future Development

5.1. Discussion

5.2. Future Development

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI