Research on the Identification and Classification of Marine Debris Based on Improved YOLOv8

Jiang, Wenbo; Yang, Lusong; Bu, Yun

doi:10.3390/jmse12101748

Open AccessArticle

Research on the Identification and Classification of Marine Debris Based on Improved YOLOv8

by

Wenbo Jiang

^1,2,*

,

Lusong Yang

^1,2

and

Yun Bu

^1,2

¹

School of Electrical Engineering and Electronic Information, Xihua University, Chengdu 610039, China

²

Sichuan Provincial Key Laboratory of Signal and Information Processing, Xihua University, Chengdu 610039, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(10), 1748; https://doi.org/10.3390/jmse12101748

Submission received: 26 August 2024 / Revised: 27 September 2024 / Accepted: 30 September 2024 / Published: 3 October 2024

(This article belongs to the Special Issue Application of Deep Learning in Underwater Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Autonomous underwater vehicles equipped with target recognition algorithms are a primary means of removing marine debris. However, due to poor underwater visibility, light scattering by suspended particles, and the coexistence of organisms and debris, current methods have problems such as poor recognition and classification effects, slow recognition speed, and weak generalization ability. In response to these problems, this article proposes a marine debris identification and classification algorithm based on improved YOLOv8. The algorithm incorporates the CloFormer module, a context-aware local enhancement mechanism, into the backbone network, fully utilizing shared and context-aware weights. Consequently, it enhances high- and low-frequency feature extraction from underwater debris images. The proposed C2f-spatial and channel reconstruction (C2f-SCConv) module combines the SCConv module with the neck C2f module to reduce spatial and channel redundancy in standard convolutions and enhance feature representation. WIoU v3 is employed as the bounding box regression loss function, effectively managing low- and high-quality samples to improve overall model performance. The experimental results on the TrashCan-Instance dataset indicate that compared to the classical YOLOv8, the [email protected] and F1 scores are increased by 5.7% and 6%, respectively. Meanwhile, on the TrashCan-Material dataset, the [email protected] and F1 scores also improve, by 5.5% and 5%, respectively. Additionally, the model size has been reduced by 12.9%. These research results are conducive to maintaining marine life safety and ecosystem stability.

Keywords:

marine debris identification and classification; YOLOv8; CloFormer transformer; SCConv; WIoU

1. Introduction

Human domestic waste enters the ocean, where it can be eaten by animals or entangled in marine organisms, affecting ecosystem stability and human health [1,2]. Currently, the state-of-the-art technological approach involves the use of autonomous underwater vehicles equipped with marine debris recognition algorithms to clean up oceanic debris [3,4]. However, due to reasons such as weak underwater light intensity, interference from suspended particles, biological adhesion, and changes in the shape of debris, the quick and accurate identification of marine debris is still an urgent problem to be solved.

Marine debris identification can be roughly divided into traditional and deep learning methods. Traditional methods generally use either sensing technology (e.g., sonar, lidar) or traditional machine learning (e.g., dictionary learning). Initially, sonar or lidar was used to conduct underwater detection directly. Zhang et al. (2010) and Tucker et al. (2011) conducted underwater detection based on different sonar systems. The detection range was improved, but there were problems with low resolution and weak anti-interference [5,6]. Pellen et al. (2012) and Gao et al. (2014) used lidar to detect underwater targets, which improved resolution and anti-interference performance, but the recognition effect was poor, and propagation loss was large [7,8]. With the development of machine learning, dictionary learning has been combined with sonar images in underwater target detection. Azimi-Sadjadi et al. (2017) proposed a subspace-based underwater sonar image detection method that solved the propagation loss but had low generalization ability [9]. Similarly, Lu et al. (2019) improved the recognition rate by using sparse representation to identify underwater targets, but signal loss caused instability in the algorithm’s performance [10]. In summary, traditional marine debris identification methods have shortcomings such as poor recognition effects, large propagation loss, weak anti-interference, low generalization ability, and performance instability and are difficult to extend to all scene types.

With the development of deep learning, convolutional neural networks (CNNs) have been widely used in image processing. Valdenegro-Toro (2016) trained a CNN classifier to identify marine debris, but it identified fewer types and overlooked environmental disturbances [11]. Xian et al. (2018) developed an underwater man-made object recognition system. Due to the use of synthetic underwater images, authenticity was lacking, and the model was complex [12]. Hong et al. (2020) used a classifier trained with enhanced data to classify and identify marine debris and applied it to a real environment, but there were few recognition categories [13]. Politikos et al. (2021) utilized region-based CNNs to detect submarine debris in a real environment, expanding marine debris categories, but their approach had a low recognition rate [14]. Wei et al. (2022) proposed an improved U-Net-based architecture, which enriched the semantic segmentation dataset of marine debris. However, the recognition speed was slow [15]. Sinthia et al. (2023) improved the YOLOv8 model to detect marine debris, with good recognition effects but weak generalization ability [16]. In summary, marine debris identification based on deep learning has the advantages of automatic feature learning, adaptability to different types of marine debris, and strong scalability, but it requires a large amount of data support. Current deep learning algorithms have shortcomings such as poor recognition and classification effects, slow recognition speed, complex models, and weak generalization capabilities.

This article proposes a marine debris identification and classification algorithm based on improved YOLOv8. The main contributions are as follows:

(1): Because small targets in marine debris occupy few pixels in an image and have unclear and missing features, the CloFormer module with a dual-branch architecture is introduced in the backbone network to enhance the perception of image information;
(2): The C2f-SCConv module is proposed to enhance feature representation capabilities, addressing the problem that recognition is easily affected by factors such as underwater suspended matter, light intensity, and biological habits, resulting in overlap and damage of debris and organisms in an image, and hence feature confusion and redundancy;
(3): WIoU v3 with weight factors is used as the bounding box loss function to reduce harmful gradients caused by low-quality samples while managing samples of different quality;
(4): Simulation experiments show that the proposed algorithm has strong generalization performance, good recognition and classification effects, fast recognition, and low complexity.

The remaining paper is structured as follows. Section 2 introduces the dataset used in this study and discusses the strengths and limitations of the classic YOLOv8. Section 3 outlines the improvements made to YOLOv8. Section 4 details the experiment and provides an in-depth analysis of the results. Finally, Section 5 summarizes the conclusions of this research.

2. Materials and Methods

2.1. Dataset Preparation

The TrashCan dataset compiled by Hong et al. [17] was used for training. The images were sourced from the Japanese Marine Geosciences and Technology Bureau’s deep sea image electronic library. The authors extracted debris data from nearly 1000 videos of various lengths captured by underwater vehicles. The dataset includes 7212 annotated images, labeled into two subsets: TrashCan-Instance and TrashCan-Material.

TrashCan-Instance, with an 84:16 training-validation split, contains 6065 and 1147 images, respectively, and 9540 and 2588 labels across 22 categories. These labels include rov (artificial objects deliberately placed in the scene), plants, eels, unknown instances, nets, cups, bottles, pipes, snack wrappers, and clothing. TrashCan-Material, split into 83:17 training and validation sets, comprises 6008 and 1204 images, respectively, with 9741 and 2595 labels across 16 categories. These labels include plants, fish, eels, metals, plastics, rubbers, wood, fishing gear, paper, and fabric (Table 1).

2.2. Classic YOLOv8 Network

Released by Ultralytics in 2023, YOLOv8 has four advantages: (1) the anchor-free structure solves problems that anchor boxes may encounter with non-standard-shaped objects; (2) cutting-edge data enhancement technology enhances the robustness and generalization capabilities of the model; (3) adaptive training strategies to optimize the learning rate and balanced loss function can improve model performance; (4) the flexible architecture enables users to easily adjust the structure and parameters to adapt to a variety of target detection tasks.

The YOLOv8 network consists of an Input, Backbone, Neck, and Head, as shown in Figure 1.

YOLOv8 has shown excellent performance in image recognition in fields such as industrial defects [18], agricultural pests [19], and medical imaging [20], due to its efficient target detection and real-time processing. However, there are three problems in the case of marine debris identification: (1) Because some C2f modules in the backbone network extract features too many times, direct use will lead to degradation of the feature extraction function and cause insufficient fusion of key information during feature fusion. (2) Because the neck C2f modules are located behind the splicing layer, the direct stacking of features from different layers will cause redundancy and interference in the feature integration process and will affect feature optimization. (3) Model performance decreases because the CIoU loss function is unable to handle samples of different quality in the TrashCan dataset.

3. The Proposed Approach

3.1. Improved YOLOv8 Network

To address the above-mentioned issues, the classic YOLOv8 network is enhanced, as shown in red in Figure 2. The lightweight CloFormer module [21], featuring context-aware local enhancement, is integrated into backbone layers 4 and 6 to boost the C2f module’s feature extraction. This mechanism realizes the deep mining of both high-frequency local information (such as edges and texture of marine debris) and low-frequency global information (such as the overall structure and spatial layout of the image), improving focus on debris features while minimizing background interference. Consequently, feature fusion becomes more accurate and effective. The SCConv [22] module is integrated into the four C2f modules in the neck, optimizing the fine reconstruction of features across spatial and channel dimensions. The former eliminates redundant spatial information to make the feature map more compact and richer in key information, while the latter enables SCConv to further reduce the interference between channels by optimizing the channel correlations, thereby enhancing the overall feature consistency and discrimination. This dual-dimension optimization decreases feature dimension during integration, lightening the computational load on subsequent processing layers and significantly improving feature expression. WIoU v3 is used as the bounding box regression loss function, and its wise gradient gain distribution method is used to reduce the competitiveness of high-quality anchor boxes and the interfering gradients generated by low-quality samples, which improves the recognition effect.

3.2. Improved Backbone C2f Structure

Before the improvement, the input backbone C2f module divided the feature map into two parts. Feature extraction in the bottleneck was limited to small debris targets and struggled to differentiate complex background information (Figure 3a). For the convolution block in its internal bottleneck, the CloFormer module is introduced, realizing the C2f-CloFormer module with a dual-branch architecture, as shown in Figure 3b, which focuses on the small targets themselves and screens useful background information to better understand and represent features in the input image, improving feature extraction capabilities and the feature fusion effect on small targets in marine debris images.

The Clo block structure in CloFormer is shown in Figure 4. The global branch uses downsampling for K (key) and V (value) and ordinary attention for Q (query), K, and V to capture low-frequency global information. The local branch adopts the attention-style convolution operator AttnConv. To aggregate high-frequency local information, depth-wise convolution (DWconv) with shared weights is implemented to extract local representations. The Hadamard product of Q and K is computed, followed by transformations to generate context-aware weights that enhance local features. The outputs of the global branch and local branch are fused, allowing the model to capture both high- and low-frequency information. This process is defined in Formulas (1)–(5).

Y_{g l o b a l} = A t t n (Q, P o o l (K), P o o l (V))

(1)

Q, K, V = F_{c} (P_{i n})

(2)

V_{o} = D W c o n v (V) {, Q}_{o} = D W c o n v (Q), K_{o} = D W c o n v (K)

(3)

Y_{l o c a l} = T a n h (\frac{F_{c} (S w i s h (F_{c} (Q_{o} {⊙ w}_{o})))}{\sqrt{d}}) ⊙ V_{o}

(4)

Y_{o} = F_{c} (C o n c a t ((Y_{g l o b a l}, Y_{l o c a l})))

(5)

where

Y_{global}

is the output of the global branch; Attn is the attention mechanism; Pool is downsampling;

P_{in}

is the input of AttnConv;

F_{c}

is a fully connected layer;

V_{o}

,

Q_{o}

, and

K_{o}

are the outputs after depth-wise convolution;

Y_{local}

is the output of the local branch; Tanh and Swish are activation functions;

⊙

is the Hadamard product; d is the number of token channels; and

Y_{o}

is the integrated output of the local and global branches.

3.3. Improved Neck C2f Structure

The classic neck C2f module has a significant increase in feature dimension during feature integration, complicating the feature representation problem during feature optimization, as shown in Figure 5a. Multiple SCConv modules were introduced to replace the bottleneck in the C2f module, creating a multi-branch structure called the C2f-SCConv module. This module reduces redundant information and enhances feature representation, as illustrated in Figure 5b. This can increase the representation ability of features in feature integration and make feature optimization more flexible through self-adaptive adjustment of the spatial structure and channel relationship of features, which can improve the recognition of overlapping targets and damaged targets in marine debris images.

Figure 6 and Figure 7 show the spatial reconstruction unit (SRU) and channel reconstruction unit (CRU), respectively, which constitute SCConv. The SRU suppresses the spatial redundancy of feature maps through Separate-Reconstruct, and the CRU reduces channel redundancy through Split-Transform-Fuse.

As shown in Figure 6, the SRU first separates these information-rich feature maps from those with less information corresponding to the spatial content and uses cross-reconstruction to fully combine the two information features after weighting to obtain cross-reconstructed features

X^{w 1}

and

X^{w 2}

, which are connected to obtain a spatially refined feature map,

X^{w}

.

As shown in Figure 7, the CRU divides the feature map

X^{w}

into the main part, the upper road, and a supplementary part, the lower road; adds group-wise convolution (GWC) and point-wise convolution (PWC) in the upper road to obtain

Y_{1}

; and splices the lower road to obtain

Y_{2}

after point convolution. Through concatenation and operations such as softmax on S1 and S2 after global average pooling (P), we obtain refined channel feature Y.

3.4. Loss Function

YOLOv8 uses DFL and CIoU to calculate the regression loss of the bounding box. The CIoU and bounding box regression loss functions are defined as Formulas (6) and (7) [23,24].

C I o U = I o U - \frac{ρ^{2} (b, b^{g t})}{c^{2}} - α ν

(6)

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν

(7)

where IoU is the overlap rate between the predicted and real boxes,

ρ^{2} (b, b^{gt})

is the Euclidean distance between their center points,

α

is a balance parameter,

ν

indicates the consistency of the aspect ratio, and

c

is the diagonal distance of the minimum enclosed area covering two bounding boxes.

CIoU has three disadvantages: (1) overemphasizing a factor in formula (6) may cause bias in the model during training; (2) CIoU may not accurately reflect the quality of the prediction frame when the target is occluded by other objects; and (3) it is difficult for the model to learn the correct prediction method for targets with extreme aspect ratios (e.g., eels, pipes, wood).

CIoU does not consider the impact of low-quality samples in the training data. We replace it with WIoU v3 [25] in the YOLOv8 model of marine debris identification. WIoU v3 employs a unique gradient allocation mechanism that assigns smaller gradient gains to small and large outliers, effectively identifying prediction errors caused by poor sample quality and reducing harmful gradients generated by low-quality samples. In addition, WIoU v3 adjusts the gradient size and direction, enabling the model to prioritize high-quality representative samples during optimization. The improved loss function is shown in Formulas (8)–(10).

R_{W I o U} = e x p (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{({W_{g}}^{2} + {H_{g}}^{2})}^{*}}) \in [1, e]

(8)

L_{W I o U v 1} = R_{W I o U} L_{I o U}, L_{I o U} \in [0,1]

(9)

L_{W I o U v 3} = r L_{W I o U v 1}, r = \frac{β}{δ α^{β - δ}}, β = \frac{{L^{*}}_{I o U}}{\bar{L_{I o U}}} \in [0, + \infty)

(10)

where x and y are the center point coordinates of the prediction box,

x_{gt}

and

y_{gt}

are the center point coordinates of the real box,

W_{g}

and

H_{g}

are the respective width and height of the minimum bounding box, * indicates the separation operation,

R_{WIoU}

is the amplification factor based on the center point distance between the predicted box and the real box,

L_{IoU}

is the IoU loss, r is the gradient gain,

α

and

δ

are hyperparameters,

β

is the outlier degree,

{L^{*}}_{IoU}

is the monotonic focusing coefficient, and

\bar{L_{IoU}}

is the average loss value.

4. Experiment and Results

4.1. Experimental Environment and Parameter Settings

The experimental platform was built on Python 3.7, PyTorch 1.10, CUDA 10.2, and CUDNN 7.6, using a 24-GB NVIDIA GeForce RTX 3090 equipped with Windows 10. The model was trained using Stochastic Gradient Descent (SGD) optimization, with respective initial and final learning rates of 0.001 and 0.0001; weight decay and momentum set to 0.0003 and 0.91, respectively; and a batch size of 16. During the training process of up to 200 epochs, if the model did not have better results within 20 consecutive epochs, the training ended early.

4.2. Evaluation Indicator and Training Process

To comprehensively and objectively evaluate the proposed algorithm, training was carried out based on the instance and material of the TrashCan dataset, and the accuracy, recall, F1 score, frames per second (FPS), model size (Size), [email protected], [email protected], and mAP [0.5:0.95] were used to evaluate the performance of the model, where [email protected] and [email protected] are the mean average accuracies on all categories with respective IoU thresholds of 0.5 and 0.75; and mAP [0.5:0.95] is the average degree of the mAP under different IoU thresholds when the IoU threshold is increased from 0.5 to 0.95 in steps of 0.05. The relevant formulas are shown in Formulas (11)–(14).

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

R e c a l l = \frac{T P}{T P + F N}

(12)

F 1 = \frac{2 \times Recall \times Precision}{Recall + Precision}

(13)

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{N}, A P = \int_{0}^{1} P (R) d R

(14)

where TP (true positive) is the number of correctly predicted positive cases, FP (false positive) is the number of negative cases predicted as positive, and FN (false negative) is the number of positive cases predicted as negative. P is precision and R is recall. AP stands for the average detection precision for one object category. N is the total number of target categories. mAP is the average accuracy for all categories.

The above indicators are employed to assess the trained model. The proposed model is trained as shown in Algorithm 1.

Algorithm 1 Training steps of the model

1: Basic parameters: ϴ = (hyper-parameters = {image size = (640, 640, 3), epoch = 200, batch-size = 16, Do not use pre-trained weights, optimizer = SGD with learning rate = 0.001, validation = per epoch, batch-size = 16}
2: Evaluation indicator: [P, R, F1, [email protected], [email protected], mAP [0.5:0.95], Size (MB), FPS].
3: Create data loaders for Trashcan_train, Trashcan _val, and Trashcan _test.
4: Initialize model training:
5: For epoch = 1 to total_epochs:
For iteration = 1 to number_of_iterations_per_epoch.
6: Feed Trashcan_train and Trashcan_val into the model.
7: for iteration

\in

[1, number_of_iterations_per_epoch] do
8: Feed Trashcan_train and Trashcan_val into the model.
9: Split batches into images M and labels N.
10: Forward M through the model to get output tensor T.
11: Compute the loss L and apply it to the label tensor N and the output tensor T.
12: The gradient of the loss L relative to the model parameters is obtained.
13: Update the model weights using the SGD optimizer.
14: Obtain model = α
15: end for
16: Load α for model testing
17: Initialize the predicted list X and the ground truth label list Y as empty.
18: for the batch in the Trashcan test
19: Split batches into images M and labels N.
20: Pass M through α to get output tensor T.
21: Add the output tensor T and label tensor N to the lists X and Y, respectively.
22: end for
23: Output [P, R, F1, [email protected], [email protected], mAP [0.5:0.95], Size (MB), FPS].

4.3. TrashCan-Instance Dataset Simulation Analysis

To validate the proposed model, it was trained and evaluated on the TrashCan-Instance dataset and compared with current algorithms, including a similar single-stage algorithm and a two-stage algorithm biased toward accuracy. The results are shown in Table 2 and Table 3.

(1): The comparison algorithms in Table 2 improve accuracy in different ways. As can be seen in the table, our model has superior [email protected], [email protected], and mAP [0.5:0.95] values, with overall improvements of 4.5%, 4.9%, and 4.9%, respectively, relative to the best-performing MLDet [29].
(2): Hong [17] uses a two-stage algorithm with high accuracy, and SSD [30], YOLOTrashCan [31], and Improved YOLOv5 [32] are fast single-stage algorithms. As can be seen from Table 3, the [email protected] of the proposed model is the best, the size is second only to that reported by the Improved YOLOv5 [32], the FPS is second only to that of SSD [30], and the difference is not great.
(3): It can be seen from Table 2 and Table 3 that the proposed model achieves a balance in evaluation indicators and realizes better recognition and classification effects with less running time.

To intuitively compare the effectiveness of each improvement, we used a YOLOv8m baseline to verify the effectiveness of the proposed model. The results of these ablation experiments are shown in Table 4, from which it can be seen that, except for the FPS of the proposed model, which is slightly worse than the original, other indicators are improved. The precision, recall, F1 score, and [email protected] are increased by 6%, 5%, 6%, and 5.7%, respectively, and the size is reduced by 6.4 MB. Figure 8 shows that the proposed model achieves high accuracy with fewer training cycles, indicating faster convergence and enhanced recognition performance.

A comparison of the [email protected] for each category between the improved YOLOv8 model and the unimproved YOLOv8m model in the TrashCan-Instance dataset is shown in Figure 9.

(1): As can be seen from Figure 9, the recognition rate of the proposed model exceeds 40% in the categories of clothing and crabs, and it is improved to varying degrees in the other 16 categories.
(2): The recognition rates of can, branch, wreckage, and tarp declined slightly, perhaps due to the higher feature discrimination of the improved model for other small target categories and low feature discrimination caused by being too sensitive to some labels in target cans, which are smaller than the other object types in the images. Moreover, due to the large number of categories in the dataset and the low number of labels for branch, wreckage, and tarp, the model may be biased toward learning categories with a large number of labels during training, resulting in a slight reduction in the recognition rate.

To evaluate the proposed model’s capability to handle low-resolution images and resist interference, this study selected low-quality data with noise from the TrashCan-Instance dataset and conducted heatmap experiments, as depicted in Figure 10.

(1): The heat maps in Figure 10b,c demonstrate that the proposed model displays brighter and more concentrated warm colors for the target, effectively capturing important features from low-resolution images and providing more accurate target position information.
(2): In the heatmap in Figure 10c, the proposed model exhibits reduced brightness in noisy areas, indicating a significant decrease in the impact of suspended particle noise during image processing, showcasing the model’s anti-interference ability in noisy environments.

To visualize the marine debris recognition effect of the model, it was compared with the YOLOv8m model on the TrashCan-Instance dataset, with results as shown in Figure 11. The frame line indicates the position of an object in an image, and the object category and recognition rate are indicated above the data.

(1): From Figure 11b,c, it can be seen that the YOLOv8m model missed the detection of starfish and plants and mistakenly detected tarp and rov. However, the recognition rate of the proposed model was slightly less than that of YOLOv8m for bags. In all other categories, this model improved the recognition rate, with no missed or false detections.
(2): Overall, the proposed model showed good results, better than those of the YOLOv8m model on the TrashCan-Instance dataset, and could accurately identify and classify multiple targets and information-damaged targets in a variety of complex underwater environments.

4.4. Simulation Analysis of TrashCan-Material Dataset

To verify the generalization performance of the proposed model, experiments similar to those in Section 4.3 were performed on the TrashCan-Material dataset, with results as shown in Table 5 and Table 6.

(1): The comparison models in Table 5 address various problems in the recognition of marine debris images, using different algorithms. It can be seen that all indicators of the proposed model exceed those of the comparison models, with improvements of 5%, 6.3%, and 3.5% compared to the existing best MLDet [29], respectively.
(2): Table 6 shows the comparison of three algorithms with the proposed model in terms of speed and accuracy. It can be seen that the [email protected] of the proposed model exceeds those of the compared models, and the size is much smaller. The FPS of the proposed model is 13 less than that of SSD [30] but is higher than those of the other two models.
(3): From the analysis of Table 5 and Table 6, the proposed model achieved comparatively good results in terms of recognition speed and effect.

To visually assess the improvement effect of the proposed model, it was compared with the unimproved YOLOv8m model in ablation experiments on the TrashCan-Material dataset, with results as shown in Table 7. While the FPS of the proposed model was slightly worse, other indicators improved. Precision increased by 3%, recall by 6%, F1 score by 5%, and [email protected] by 5.5%, and size was reduced by 6.4 MB. In addition, Figure 12 demonstrates that the proposed model also performs well in recognition.

A comparison between the [email protected] for each category of the improved YOLOv8 model and the unimproved YOLOv8m model in the TrashCan-Material dataset is shown in Figure 13, from which we make the following observations:

(1): Using the proposed model, the recognition rate of paper was the most increased, by 25%, and the rates for the other 12 categories were improved to varying degrees;
(2): The recognition rates of fabric, other trash, and plastic decreased slightly, perhaps due to insufficient learning of the features of these categories; however, the improved module improved its ability to distinguish between other categories, resulting in the model being significantly affected by the similarity of features in the three categories of fabric, other garbage, and plastic when training; this, in turn, led to a slight decrease in the recognition rate.

Figure 14 depicts the heatmap experiment conducted on the TrashCan-Material dataset. Consistent with the findings in Section 4.3, the proposed model demonstrates a capacity to manage low resolution and exhibit strong anti-interference.

To illustrate the effectiveness of the proposed algorithm, the visualization results on the TrashCan-Material dataset test are shown in Figure 15. We make the following observations:

(1): If there are multiple, small, and overlapping targets, the unmodified YOLOv8m is prone to miss detection of fish, other animals, starfish, and eels, while the proposed model does not have this problem and can identify all with a high recognition rate;
(2): Small target starfish that are not marked in the original image can be identified by the proposed model, indicating good feature learning potential.

4.5. Analysis and Summary

We observe the following from the test results on the TrashCan-Instance and TrashCan-Material datasets:

(1): From Section 4.3 and Section 4.4, it can be seen that the proposed model shows excellent performance on both subsets, indicating good generalization ability.
(2): The constructed network model and loss function have certain effects on the extraction and fusion of marine debris image features, the integration and optimization of features, and the suppression of harmful gradients, reflecting certain progress in terms of recognition and classification effects, recognition speed, and model complexity. In addition, the model exhibits a certain ability to identify targets in low-resolution images while resisting interference.
(3): The FPS of the proposed model is slightly lower than that of the classic and unimproved models, reflecting the trade-off of reasoning efficiency for better recognition results. Size has not reached the optimal level, possibly because no more lightweight improvements have been made to the YOLOv8 model. Subsequently, we will consider applying model pruning technology to remove weights or connections that contribute little to model performance.

5. Conclusions

This study proposes a marine debris recognition and classification algorithm based on an improved YOLOv8, addressing issues of poor recognition and classification performance, slow recognition speed, model complexity, and weak generalization capabilities of existing marine debris. Experimental results indicate that the proposed model achieves a [email protected] and speed of 72% and 66 FPS, respectively, on the TrashCan-Instance dataset, and an [email protected] and speed of 66.70% and 71 FPS, respectively, on the TrashCan-Material dataset. Our design also reduced the model size by 12.9%. Visual assessment reveals effective recognition and classification in complex and variable underwater environments, significantly minimizing missed and false detections.

Future research will target categories affected by feature similarity and those with limited labels. By employing image enhancement and model pruning techniques, we aim to tackle identification challenges arising from poor original image quality and large model sizes, further enhancing marine debris identification and classification.

Author Contributions

Conceptualization, W.J. and L.Y.; methodology, L.Y. and Y.B.; validation, W.J. and L.Y.; data curation, W.J. and L.Y.; investigation, L.Y. and Y.B.; writing—original draft preparation, L.Y.; writing—review and editing, W.J.; supervision, W.J.; funding acquisition, W.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Xihua University Science and Technology Innovation Competition Project for Postgraduate Students under grant no. YK20240002, Sichuan Science and Technology Program under grant no. 2021JDJQ0027, and the Natural Science Foundation of China under grant no. 61875166.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

Wenbo Jiang would like to acknowledge the Sichuan Provincial Academic and Technical Leader Training Plan and the Overseas Training Plan of Xihua University (September 2014–September 2015, University of Michigan, Ann Arbor, MI, USA).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Abbreviations

YOLOv8	You Only Look Once Version 8
CloFormer	Clo block Transformer
WIoU v3	Wise Intersection over Union version 3
C2f	Coarse to Fine
SPPF	Spatial Pyramid Pooling with less FLOPs
CIoU	Complete Intersection of Union
DFL	Distribution Focal Loss
BCE	Binary Cross-Entropy
Bbox	Bounding Box
Cls	Classification

References

Yu, J.; Liu, J. Exploring governance policy of marine fishery litter in China: Evolution, challenges and prospects. Mar. Pollut. Bull. 2023, 188, 114606. [Google Scholar] [CrossRef] [PubMed]
Patra, S.; Khurshid, M.; Basir, A.; Mishra, P.; Ramanamurthy, M.V. Marine litter management: A sustainable action plan and recommendations for the South Asian Seas region. Mar. Policy 2023, 157, 105854. [Google Scholar] [CrossRef]
Zhang, B.; Ji, D.; Liu, S.; Zhu, X.; Xu, W. Autonomous underwater vehicle navigation: A review. Ocean Eng. 2023, 273, 113861. [Google Scholar] [CrossRef]
Chen, Z.; Jiao, W.; Ren, K.; Yu, J.; Tian, Y.; Chen, K.; Zhang, X. A survey of research status on the environmental adaptation technologies for marine robots. Ocean Eng. 2023, 286, 115650. [Google Scholar] [CrossRef]
Zhang, L.; Huanq, J.; Jin, Y.; Hau, Y.; Jianq, M.; Zhang, Q. Waveform diversity based sonar system for target localization. J. Syst. Eng. Electron. 2010, 21, 186–190. [Google Scholar] [CrossRef]
Tucker, J.D.; Azimi-Sadjadi, M.R. Coherence-based underwater target detection from multiple disparate sonar platforms. IEEE J. Ocean. Eng. 2011, 36, 37–51. [Google Scholar] [CrossRef]
Pellen, F.; Jezequel, V.; Zion, G.; Le Jeune, B. Detection of an underwater target through modulated lidar experiments at grazing incidence in a deep wave basin. Appl. Opt. 2012, 51, 7690–7700. [Google Scholar] [CrossRef] [PubMed]
Gao, J.; Sun, J.; Wang, Q. Experiments of ocean surface waves and underwater target detection imaging using a slit Streak Tube Imaging Lidar. Optik 2014, 125, 5199–5201. [Google Scholar] [CrossRef]
Azimi-Sadjadi, M.R.; Klausner, N.; Kopacz, J. Detection of underwater targets using a subspace-based method with learning. IEEE J. Ocean. Eng. 2017, 42, 869–879. [Google Scholar] [CrossRef]
Yao, L.; Du, X. Identification of underwater targets based on sparse representation. IEEE Access 2019, 8, 215–228. [Google Scholar] [CrossRef]
Valdenegro-Toro, M. Submerged marine debris detection with autonomous underwater vehicles. In Proceedings of the 2016 International Conference on Robotics and Automation for Humanitarian Applications (RAHA), Amritapuri, India, 18–20 December 2016; pp. 1–7. [Google Scholar]
Yu, X.; Xing, X.; Zheng, H.; Fu, X.; Huang, Y.; Ding, X. Man-made object recognition from underwater optical images using deep learning and transfer learning. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1852–1856. [Google Scholar]
Hong, J.; Fulton, M.; Sattar, J. A generative approach towards improved robotic detection of marine litter. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–4 June 2020; pp. 10525–10531. [Google Scholar]
Politikos, D.V.; Fakiris, E.; Davvetas, A.; Klampanos, I.A.; Papatheodorou, G. Automatic detection of seafloor marine litter using towed camera images and deep learning. Mar. Pollut. Bull. 2021, 164, 111974. [Google Scholar] [CrossRef]
Wei, L.; Kong, S.; Wu, Y.; Yu, J. Image semantic segmentation of underwater garbage with modified U-Net architecture model. Sensors 2022, 22, 6546. [Google Scholar] [CrossRef] [PubMed]
Sinthia, A.K.; Rasel, A.A.; Haque, M. Real-time Detection of Submerged Debris in Aquatic Ecosystems using YOLOv8. In Proceedings of the 2023 26th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh, 13–15 December 2023; pp. 1–6. [Google Scholar]
Hong, J.; Fulton, M.; Sattar, J. Trashcan: A semantically-segmented dataset towards visual detection of marine debris. arXiv 2020, arXiv:2007.08097. [Google Scholar]
Song, X.; Cao, S.; Zhang, J.; Hou, Z. Steel Surface Defect Detection Algorithm Based on YOLOv8. Electronics 2024, 13, 988. [Google Scholar] [CrossRef]
Uddin, M.S.; Mazumder, M.K.A.; Prity, A.J.; Mridha, M.F.; Alfarhood, S.; Safran, M.; Che, D. Cauli-Det: Enhancing cauliflower disease detection with modified YOLOv8. Front. Plant Sci. 2024, 15, 1373590. [Google Scholar] [CrossRef]
Lalinia, M.; Sahafi, A. Colorectal polyp detection in colonoscopy images using yolo-v8 network. Signal Image Video Process. 2024, 18, 2047–2058. [Google Scholar] [CrossRef]
Fan, Q.; Huang, H.; Guan, J.; He, R. Rethinking local perception in lightweight vision transformer. arXiv 2023, arXiv:2303.17803. [Google Scholar]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 6153–6162. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Fasterand better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Deng, H.; Ergu, D.; Liu, F.; Ma, B.; Cai, Y. An embeddable algorithm for automatic garbage detection based on complex marine environment. Sensors 2021, 21, 6391. [Google Scholar] [CrossRef]
Ali, M.; Khan, S. Underwater object detection enhancement via channel stabilization. In Proceedings of the 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Sydney, Australia, 30 November–2 December 2022; pp. 1–8. [Google Scholar]
Corrigan, B.C.; Tay, Z.Y.; Konovessis, D. Real-time instance segmentation for detection of underwater litter as a plastic source. J. Mar. Sci. Eng. 2023, 11, 1532. [Google Scholar] [CrossRef]
Ma, D.; Wei, J.; Li, Y.; Zhao, F.; Chen, X.; Hu, Y.; Yu, S.; He, T.; Jin, R.; Li, Z.; et al. MLDet: Towards efficient and accurate deep learning method for Marine Litter Detection. Ocean. Coast. Manag. 2023, 243, 106765. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
Zhou, W.; Zheng, F.; Yin, G.; Pang, Y.; Yi, J. Yolotrashcan: A deep learning marine debris detection network. IEEE Trans. Instrum. Meas. 2022, 72, 1–12. [Google Scholar] [CrossRef]
Liu, J.; Zhou, Y. Marine debris detection model based on the improved YOLOv5. In Proceedings of the 2023 3rd International Conference on Neural Networks, Information and Communication Engineering (NNICE), Guangzhou, China, 24–26 February 2023; pp. 725–728. [Google Scholar]
Zocco, F.; Lin, T.C.; Huang, C.I.; Wang, H.C.; Khyam, M.O.; Van, M. Towards more efficient efficientdets and real-time marine debris detection. IEEE Rob. Autom. Lett. 2023, 8, 2134–2141. [Google Scholar] [CrossRef]
Dai, L.; Liu, H.; Song, P.; Tang, H.; Ding, R.; Li, S. Edge-guided representation learning for underwater object detection. CAAI Trans. Intell. Technol. 2024. [CrossRef]
Dai, L.; Liu, H.; Song, P.; Liu, M. A gated cross-domain collaborative network for underwater object detection. Pattern Recognit. 2024, 149, 110222. [Google Scholar] [CrossRef]

Figure 1. Classic YOLOv8 network structure.

Figure 2. Improved YOLOv8 network structure, where the yellow box represents the classic structure of YOLOv8, and the red box represents the improved structure in our model.

Figure 3. Comparison of the structures of two types of C2f before and after improvement, where the purple box represents the classic Bottleneck, and blue box represents the improved Bottleneck-CloFormer, the yellow box represents the convolutional layer, the green box represents the Split operation, and the red box represents the Concat operation. (a) Classic backbone C2f structure. (b) Improved backbone C2f-CloFormer structure.

Figure 4. The internal structure of Clo block in CloFormer, where the blue dashed box represents the Global branch module, the red dashed box represents the Local branch module, and the green dashed box represents the Context−aware module.

Figure 5. Comparison of two types of neck C2f structures before and after improvement, where the red-box represents a multi-branch structure composed of multiple SCConv modules. (a) Classic neck C2f structure. (b) Improved neck C2f-SCConv structure.

Figure 6. The internal structure of the spatial reconstruction unit.

Figure 7. The internal structure of the channel reconstruction unit.

Figure 8. Training curves on the TrashCan-Instance dataset.

Figure 9. Compare the two models on the TrashCan-Instance dataset for [email protected] on each category.

Figure 10. (a) Original images. Comparison of the heatmaps for (b) YOLOv8m and (c) the proposed model on the TrashCan-Instance dataset, where the brighter color indicates higher attention and the darker color indicates lower attention.

Figure 11. (a) Original images. Comparison of the recognition results of (b) YOLOv8m and (c) the proposed model on the Trashcan-Instance dataset.

Figure 12. Training curves on the TrashCan-Material dataset.

Figure 13. Comparison of the two models on the TrashCan-Material dataset for [email protected] in each category.

Figure 14. (a) Original images. Comparison of the heatmaps for (b) YOLOv8m and (c) proposed model on the TrashCan−Material dataset, where the brighter color indicates higher attention and the darker color indicates lower attention.

Figure 15. (a) Original images. Comparison of the recognition results of (b) YOLOv8m and (c) proposed model on the Trashcan−Material dataset.

Table 1. Division and classification statistics of TrashCan dataset.

TrashCan	Training Set	Validation Set	Training Set Label	Validation Set Label	Number of Categories
Instance	6065	1147	9540	2588	22
Material	6008	1204	9741	2595	16

Table 2. Comparison of multi-threshold mAP performance on TrashCan-Instance dataset.

Model	[email protected] (%)	[email protected] (%)	mAP [0.5:0.95] (%)	Reference
Improved Mask R-CNN	65.00	48.10	44.10	Deng et al. (2021) [26]
IEM	63.09	48.38	44.03	Ali et al. (2022) [27]
YOLACT	58.80	42.50	37.70	Corrigan et al. (2023) [28]
MLDet	68.90	55.10	49.20	Ma et al. (2023) [29]
Improved YOLOv8	72.00	57.80	51.60	This study

Table 3. Comprehensive evaluation of model performance on the TrashCan-Instance dataset.

Model	Size (MB)	[email protected] (%)	FPS	Reference
Faster R-CNN	795	55.40	18	Hong et al. (2020) [17]
SSD	205	58.12	78	Liu et al. (2016) [30]
YOLOTrashCan	214	65.01	36	Zhou et al. (2022) [31]
Improved YOLOv5	17.2	67.00	61	Liu et al. (2023) [32]
Improved YOLOv8	43.2	72.00	66	This study

Table 4. Performance of ablation experiments conducted by integrating different modules with the TrashCan-Instance dataset.

Group	Model	Precision	Recall	F1	Size (MB)	[email protected] (%)	FPS
1	YOLOv8m	0.73	0.62	0.67	49.6	66.30	89
2	+ CloFormer	0.78	0.62	0.69	46.9	68.00	61
3	+ SCConv	0.79	0.63	0.70	45.7	68.90	73
4	+ WIoU v3	0.75	0.62	0.68	49.6	67.50	89
5	+ CloFormer + SCConv	0.80	0.65	0.72	43.2	71.10	66
6	+ CloFormer + WIoU v3	0.79	0.63	0.70	46.9	69.00	61
7	+ SCConv + WIoU v3	0.76	0.64	0.70	45.7	69.30	73
8	Our Model	0.79	0.67	0.73	43.2	72.00	66

Note: + represents the modules added by each group.

Table 5. Comparison of multi-threshold mAP performance on TrashCan-Material dataset.

Model	[email protected] (%)	[email protected] (%)	mAP [0.5:0.95] (%)	Reference
IEM	56.70	38.68	36.11	Ali et al. (2022) [27]
EfficientDets	27.80	20.90	18.60	Zocco et al. (2023) [33]
ERL-Net	58.90	/	37.00	Dai et al. (2024) [34]
GCC-Net	61.20	/	41.30	Dai et al. (2024) [35]
MLDet	63.50	45.70	42.30	Ma et al. (2023) [29]
Improved YOLOv8	66.70	48.60	43.80	This study

Table 6. Comprehensive evaluation of model performance on the TrashCan-Material dataset.

Model	Size (MB)	[email protected] (%)	FPS	Reference
Mask R-CNN	795	54.00	21	Hong et al. (2020) [17]
SSD	194	55.80	84	Liu et al. (2016) [30]
YOLOTrashCan	214	58.66	37	Zhou et al. (2022) [31]
Improved YOLOv8	43.2	66.70	71	This study

Table 7. Performance of ablation experiments conducted by incorporating different modules into the TrashCan-Material dataset.

Group	Model	Precision	Recall	F1	Size (MB)	[email protected] (%)	FPS
1	YOLOv8m	0.70	0.58	0.63	49.6	61.20	90
2	+ CloFormer	0.69	0.59	0.63	46.9	63.40	69
3	+ SCConv	0.71	0.59	0.64	45.7	63.60	79
4	+ WIoU v3	0.69	0.58	0.63	49.6	63.30	90
5	+ CloFormer + SCConv	0.74	0.60	0.66	43.2	65.30	71
6	+ CloFormer + WIoU v3	0.71	0.60	0.65	46.9	63.90	69
7	+ SCConv + WIoU v3	0.72	0.61	0.66	45.7	64.90	79
8	Our Model	0.73	0.64	0.68	43.2	66.70	71

Note: + represents the modules added by each group.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, W.; Yang, L.; Bu, Y. Research on the Identification and Classification of Marine Debris Based on Improved YOLOv8. J. Mar. Sci. Eng. 2024, 12, 1748. https://doi.org/10.3390/jmse12101748

AMA Style

Jiang W, Yang L, Bu Y. Research on the Identification and Classification of Marine Debris Based on Improved YOLOv8. Journal of Marine Science and Engineering. 2024; 12(10):1748. https://doi.org/10.3390/jmse12101748

Chicago/Turabian Style

Jiang, Wenbo, Lusong Yang, and Yun Bu. 2024. "Research on the Identification and Classification of Marine Debris Based on Improved YOLOv8" Journal of Marine Science and Engineering 12, no. 10: 1748. https://doi.org/10.3390/jmse12101748

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Research on the Identification and Classification of Marine Debris Based on Improved YOLOv8

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Preparation

2.2. Classic YOLOv8 Network

3. The Proposed Approach

3.1. Improved YOLOv8 Network

3.2. Improved Backbone C2f Structure

3.3. Improved Neck C2f Structure

3.4. Loss Function

4. Experiment and Results

4.1. Experimental Environment and Parameter Settings

4.2. Evaluation Indicator and Training Process

4.3. TrashCan-Instance Dataset Simulation Analysis

4.4. Simulation Analysis of TrashCan-Material Dataset

4.5. Analysis and Summary

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI