1. Introduction
In the contemporary marine environment, a substantial amount of pollution is attributed to waste, with the predominant contributor being plastic debris. According to statistics, approximately 15 million tons of plastic are annually discharged into the oceans, and this figure exhibits exponential growth on an annual basis [
1]. The prolonged deposition of plastic waste in water leads to its decomposition into minuscule particles that are imperceptible to the naked eye or entanglement with underwater organisms, causing detrimental impacts on ecological environments and species diversity. This critical issue compels us to intensify efforts in cleaning marine plastic litter [
2]. Currently, two primary methods are employed for marine litter removal: manual cleaning, which suffers from low efficiency, high costs, and potential safety hazards for workers, primarily targeting surface litter; and the utilization of intelligent devices, such as underwater robots, for cleaning. The latter approach offers advantages of high efficiency, reliability, and safety, capable of addressing both surface and underwater litter, thereby gradually emerging as the mainstream option in litter cleaning operations [
3]. An exemplary object detection algorithm could be considered as the eyes of underwater robots, furnishing them with real-time and precise object information to guide the completion of litter collection tasks [
4]. Due to the heightened complexity of underwater environments in practical applications, the difficulty of object detection in marine settings increases. Therefore, research specifically addressing underwater object detection methods for marine litter becomes particularly crucial [
5].
Object detection is a computer vision problem that aims to identify and locate certain items inside pictures or video footage. Object detection is the process of recognizing various objects in a picture and locating each one, which is often represented by bounding boxes. The progression of object detection may be categorized into two phases: the era of conventional object identification algorithms (1998–2012) and the era of object detection algorithms using deep learning (2012–present). Traditionally object detection relied on manual feature extraction. By contrast, object identification systems that use deep learning employ convolutional neural networks (CNNs) to automatically extract the characteristics of the target, learning these features by matching the training data. This approach replaced the manual design of filters, offering improved generalization ability and robustness. As a result, we gradually stopped using traditional object detection algorithms [
6]. Currently, deep-learning-based object detection algorithms could be categorized into two main types: region-proposal-based algorithms (two-stage detection algorithms) and regression-based algorithms (one-stage detection algorithms). The former, represented by the R-FCN [
7] and the R-CNN [
8] series of algorithms, initially identifies potential target regions and then proceeds with classification. While these algorithms exhibit high precision, their detection speed is relatively slow, making them less appropriate for practical production applications. At the same time, the latter, exemplified by SSD [
9] (single-shot multi-box detectors; region-based detection strategies) and the YOLO [
10] (You Only Look Once: a widely used model known for its speed and precision; it was first introduced by Joseph Redmon et al. in 2016 and has since undergone several iterations) series of algorithms, where we directly extracted features and performed object classification and localization using CNNs. This type of algorithm has been widely embraced in object detection due to its faster detection speed [
11].
Underwater robots rely on underwater video footage or images for underwater litter detection [
12]. Common methods for video and image acquisition include satellite remote sensing, sonar, and optical cameras. However, satellite remote sensing is primarily used for detecting large debris targets on the water surface and is ineffective for underwater debris detection. Additionally, it is challenging to deploy satellite remote sensing on mobile platforms such as underwater robots. While sonar can detect underwater debris and is suitable for deployment on mobile devices, its high cost and susceptibility to noise interference limit its generalizability. Optical cameras, due to their economical, stable, and flexible characteristics, are the preferred choice for video or image capture [
13]. Nevertheless, the intricate underwater space and lighting circumstances often lead to indistinct optical pictures, posing challenges to the localization and identification of underwater litter [
14]. Chen et al. [
15] proposed a small target detection network, SWIPENet, which incorporates the sample reweighting algorithm IMA. This approach aimed to mitigate the impact of underwater environmental noise on detection results. Lin et al. [
16] introduced an underwater image-enhancement method, RoIMix, which synthesizes enhanced samples from multiple images as training data. The image-enhancement effect is evident. These image-enhancement methods effectively address challenges in complex underwater images, such as uneven lighting, color distortion, noise, and other issues, thereby providing significant assistance in underwater litter detection. Currently, numerous deep-learning-based object detection methods are employed for underwater litter detection. Tian et al. [
4] suggested a method for detecting underwater litter for underwater robots improvement upon YOLOv4, attaining rapid and accurate object detection. To address the issues of limited storage space and computational capabilities in underwater mobile devices, a underwater litter detection algorithm was proposed by Wu et al. [
17]. This algorithm was based on an improved YOLOv5s algorithm, ensuring high detection precision while reducing the model size. Serious inter-class similarity (deformation) and intra-class variability (discoloration) exist in plastic litter deposited in the ocean [
18]. This means that the attributes of marine litter of the same category are no longer uniform, and diverse categories of marine litter may also display resemblances in the photos taken of them. To deal with this problem, Ma et al. [
19] presented a very effective and accurate deep learning technique called MLDet to tackle the problem of detecting marine litter. This method does not rely on general detection frameworks and has demonstrated favorable results.
In order to balance the precision and real-time detection performance of the object detection algorithm and effectively address the inter-class similarity and intra-class variability of underwater litter, we chose to improve the YOLOv7-tiny model from the YOLOv7 [
20] series. We incorporated a series of efficient components designed for the detection of underwater plastic litter, resulting in the proposed YOLOv7t-CEBC model. Compared to previous object detection models, YOLOv7t-CEBC demonstrates improvements in both precision and detection speed, making it suitable for portable mobile devices. Experimental validation on underwater litter images confirms the efficacy of the improved model for detecting underwater litter. In this article, we will discuss what makes YOLOv7t-CEBC stand out and how it compares to other object detection algorithms. The innovation of this article can be summarized as follows:
- (1)
The ConvNeXt Block (CNeB) [
21] full-convolution module was incorporated into the backbone network, making use of its simplicity and efficiency in full convolution. By drawing inspiration from the structural advantages of models like Swin Transformer [
22], ResNeXt [
23], and MobileNetV2 [
24], the model’s ability to learn from input feature maps was enhanced.
- (2)
The introduction of the EMA (efficient multiscale attention) [
25] mechanism in the backbone network resulted in an improved feature extraction capacity for capturing global target features. The inclusion of the Biformer (bi-level routing attention) [
26] attention mechanism in the head network leads to enhanced detection performance for small and densely packed targets.
- (3)
In the head network, the upsampling layer was replaced with the universal upsampling operator CARAFE (content-aware reassembly of features) [
27]. A larger receptive field was provided by CARAFE during feature reassembly, and the reassembly process was guided based on input information. Superior performance was achieved compared to regular upsampling, with minimal additional parameters and computational overhead being introduced.
An overview of the sections of this article is provided:
Section 2 presents detailed descriptions of the YOLOv7-tiny algorithm; the dataset used in the experiments is introduced.
Section 3 presents the proposed YOLOv7t-CEBC model.
Section 4 validates the effectiveness of the YOLOv7t-CEBC model through experiments using an underwater litter dataset and analyzes the limitations of the model.
Section 5 provides a summary of this article.
3. Methods
To address the impact of complex underwater environments on optical images and the influence of inter-class similarity and intra-class variability of underwater litter on detection precision, several components designed for marine litter detection were introduced into the improved YOLOv7t-CEBC. The specific introductions made are discussed in the following subsections.
3.1. CNeB (ConvNeXt Block)
ConvNeXt Block [
21] is a convolution-based visual model proposed by the FAIR team. Building upon the design principles of Swin Transformer [
28], ConvNeXt [
31] introduces improvements by adjusting the stacking sequence of ResNet50 to (3, 3, 9, 3) and replacing the downsampling module from stem to patchify. Borrowing the concept of group convolution from ResNeXt [
32], ConvNeXt adopts a more aggressive depthwise convolution, adjusting the initial channel count from 64 to 96, consistent with Swin Transformer [
28]. Additionally, ConvNeXt incorporates the inverted bottleneck module from MobileNetV2 [
24] and learnt the technique used by Swin Transformer to change the convolution kernel of depthwise conv from 3 × 3 to 7 × 7. In terms of activation functions and normalization layers, ConvNeXt reduces the use of activation functions and batch normalization (BN), replacing Relu with GELU as the activation function and swapping BN with layer normalization (LN), used in the transformer. Lastly, a separate downsampling layer is introduced, composed of layer normalization and a convolution layer with a size of 2 and a stride of 2. A structure diagram of ConvNeXt Block is shown in
Figure 4, which synthesizes the aforementioned improvements, offering multifaceted optimizations for model performance enhancement.
ConvNeXt is a purely convolutional model that leverages the advantages of various state-of-the-art models, demonstrating faster inference speed and higher accuracy while retaining the inherent simplicity and efficiency of standard ConvNets. Exploiting these outstanding features, we integrated ConvNeXt into the backbone network of YOLOv7t-CEBC to improve the extraction of characteristics and learning capabilities of the backbone network for better handling of underwater litter, especially addressing the non-rigidity and susceptibility-to-deformation characteristics of plastic waste. Moreover, this fusion method efficiently mitigated the problem of information loss, allowing the network to attain increased depth without experiencing gradient vanishing. Additionally, it enhanced the network’s sensitivity to variations in network weights. Consequently, this enhanced the overall detection efficiency and precision of the model.
3.2. EMA (Efficient Multiscale Attention)
EMA, a novel and efficient multiscale attention mechanism proposed in [
25], does not require dimensionality reduction. It primarily relies on channel attention mechanisms and embeds spatial attention mechanisms into channel attention mechanisms to enhance feature fusion (
Figure 5 showed the EMA network structure). EMA utilizes feature grouping and selected the shared component of the 1 × 1 convolution from the CA [
33] attention mechanism as the 1 × 1 branch of EMA. This branch decomposed the input tensor into two parallel 1D feature encoding vectors. The input tensor
, where
C represented the number of input channels, and
H and
W represented the spatial dimensions of the input features, performing 1D global average pooling along the horizontal dimension direction in the
C dimension at height
H (
), can be expressed as:
Similarly, performing 1D global average pooling along the vertical dimension direction in the
C dimension at width
W (
) to encode global information along the vertical dimension direction can be expressed as:
Among them, represents the input feature at channel c; and , respectively, represent the positions along the width and height dimensions in the context of the pooling operation. These two 1D global average pooling operations are designed to encode global information in two spatial dimension directions and capture the long-distance spatial interactions in different dimension directions, helping the network improve its understanding of feature images. Then, two branches are concatenated and processed, applying the Sigmoid to recalibrate the weights of each channel. A 3 × 3 convolution is parallelly placed next to the 1 × 1 branch because the 3 × 3 branch in EMA serves to expand the branch network’s receptive field, capturing short-range interactions in space and aggregating multiscale spatial structural information. Furthermore, these parallel substructures assisted the network in avoiding more sequential processing and deeper architectures, effectively establishing both short- and long-range feature dependencies, resulting in improved training and inference speed, and ultimately achieving better performance.
Next, a method for aggregating cross-spatial information in different spatial dimensions was used to achieve a richer feature aggregation. Utilizing 2D global average pooling (
), as follows:
This was applied separately to encode the 1 × 1 branch and the 3 × 3 branch. After applying softmax functions to the results, they were fused separately with the original outputs of the 3 × 3 branch and the 1 × 1 branch. This process generates two spatial attention maps, preserving the complete and precise spatial positional information, which is crucial for capturing pixel-level relationships. Finally, the two generated feature maps were aggregated, and after passing through a sigmoid function, they were then multiplied with the input feature map, resulting in a feature map with weight redistribution.
The goal of EMA is to minimize computational costs while maintaining information for every channel. This was accomplished by transforming a section of the batch dimension into the channel dimension, avoiding some form of dimensionality reduction through generic convolutions. Additionally, the channel dimension was partitioned into several sub-features, guaranteeing an equitable distribution of spatial semantic data within each feature group. In addition to encoding global information in each parallel branch to adjust the weights of each channel, the output features from the two parallel branches are also combined through cross-dimensional interactions to capture associations between pixels at the pairwise level.
The ELAN-T module borrows the design of the ELAN module [
34] and consists of two parts: the initial branch undergoes a 1 × 1 convolution operation to adjust the channel count, while the second branch utilizes a 1 × 1 convolution module to change the channel number, followed by two 3 × 3 convolution modules for feature extraction. The final feature extraction result is obtained by aggregating four features (as shown in
Figure 6). We integrated EMA into the ELAN-T module’s second branch before the 3 × 3 convolution, forming the ELAN-EMA module (as illustrated in
Figure 6), aiming to retain the feature information inputted into the 3 × 3 convolution channels of the second branch and alleviate feature degradation during the feature extraction process. Such modifications make the network topology more efficient, avoiding information loss caused by dimension reduction through generic convolutions and enhancing the network’s robustness. This customized modification for underwater litter detection better preserves learned litter features, strengthens cross-channel feature fusion, and reduces computational and parameter overheads.
3.3. BiFormer (Bi-Level Routing Attention)
BiFormer (submitted in [
26]), normally designed based on the innovative dual-layer routing attention, achieved content-aware sparse patterns in a query-adaptive manner. It employs dual-layer routing attention as the fundamental building block. In other words, BiFormer is a dynamic, query-aware sparse attention mechanism. The main concept was to eliminate a large portion of useless key–value pairs at a high-level, while keeping only a small group of routing areas and then applying fine-grained token-to-token attention within these routing regions to better capture relationships and context information between tokens.
BiFormer utilizes sparsity to optimize computation and memory usage. It exclusively relies on GPU-friendly dense matrix multiplication to enable more adaptable computation allocation and content awareness, making it sparsely dynamic-query-aware. Due to the fact that BiFormer focused on a subset of related tags in a query-adaptive manner without being disturbed by other irrelevant tag items, it performed well in terms of performance and computational efficiency, making it particularly suitable for detecting small and dense targets. We added BiFormer to the back of the ELAN-T module to form the ELAN-BIF module, as shown in
Figure 6. The purpose was to enhance deep network feature extraction and fusion, enhancing the network’s capability in detecting small-sized and densely packed litter.
3.4. CARAFE (Content-Aware Reassembly of Features)
The CARAFE method, a lightweight general-purpose upsampling operator (submitted in [
27]), involves two main stages: the kernel prediction module and the generation of image results through the content-aware reassembly module. Given an input feature map with dimensions
and an upsampling factor of
, and assuming that the predicted size of the upsampling kernel is
(larger kernels would result in a larger receptive field and an increased computational complexity), the upsampling kernel prediction begins by compressing the channel count of the input image to
through convolutional operations. Subsequently, the compressed feature map is encoded using a convolutional layer with a kernel size of
. The encoded result is then unfolded spatially to obtain a collection of upsampling kernels with dimensions
. Following this, normalization is applied to the upsampling kernels to ensure that the sum of convolutional kernel weights equals 1. Then, to handle the upsampling, the content-aware reassembly module is utilized. The input feature map was remapped so that it corresponded to each position in the output feature map, extracting a
region centered at that position. The output value was obtained by calculating the dot product between the extracted region and the predicted upsampling kernel at this point. (Various channels share the same upsampling kernel at the same place.) Ultimately, this process yielded a feature map that had the form of
.
Figure 7 shows the workflow of the universal upsampling operator CARAFE.
We replaced the upsampling module in the YOLOv7t-CEBC head network with the CARAFE module. The CARAFE module covers a wider range of input information and can more finely utilize the surrounding details during the upsampling process. It can utilize content information from the lower layers to predict recombination kernels and recombine features within predefined neighboring regions. Based on content information, CARAFE can adaptively and optimally use the recombination kernels at different positions. This enables CARAFE to design corresponding feature recombination processes according to the different shapes of litter features, thereby enhancing the shape perception capability of YOLOv7t-CEBC. Compared to mainstream upsampling operations such as interpolation or deconvolution, CARAFE achieves a better performance and can minimize parameter count as much as possible, maintaining a lightweight network model.
3.5. YOLOv7t-CEBC
After incorporating various modules designed for underwater litter detection, the detailed structure of the improved YOLOv7t-CEBC is illustrated in
Figure 8. (Red boxes represent modules designed specifically for underwater litter detection.)
4. Experiment
This section offers a comprehensive account of the configuration of the experiment, comprising the arrangement of the environment, hyperparameters, evaluation criteria, and analysis of the experimental results. The experimental findings demonstrate that the YOLOv7t-CEBC model significantly enhanced the precision of underwater object identification while just slightly increasing the parameter count. This has been confirmed to be effective and superior in underwater detection situations.
4.1. Experimental Environment
The experimental platform employed an Intel® Core™ i9-10900X X-series processor CPU @ 3.7 GHz, with an NVIDIA GeForce RTX 4090 GPU utilized for graphics processing, boasting a memory size of 24 GB. The system operated on a 64-bit Windows 11 operating system. The experimental runtime environment comprised PyTorch, CUDA version 11.8, CUDNN version 8.2.2, and Python compiler version 3.10.
4.2. Experimental Parameter Setting
For each batch of experiments in this investigation, identical starting training conditions were established. The model underwent training for 500 epochs, with input images resized prior to experimentation, and hyperparameters such as learning rate, momentum, and weight decay were set, and the Adam optimizer was used; specific parameters are indicated in
Table 1.
4.3. Model Evaluation Metrics
The primary metrics used for object detection are precision, recall, IOU, and AP and mAP values. The IOU indicator calculated how much the bounding boxes in the original image and the anticipated bounding boxes generated by the algorithm overlapped. This metric was commonly employed to assess the precision of the detection model. A higher IOU value indicates a greater overlap between the model’s detection results and the actual scenario, resulting in improved detection performance. The calculation method was to calculate the ratio of the intersection and union between the detection results and the actual situation, as follows:
The experiment established the IOU threshold. When the IOU value, which measures the overlap between the detection result and the true value, exceeds the specified threshold, the detection result can be considered as a true positive (TP), indicating accurate target recognition. On the contrary, when the IOU value is less than the threshold, the detection result can be considered to be a false positive (FP), indicating an error in target recognition. The number of undetected targets would be called false negative (FN). This statistic represents the number of ground truths that do not have related detection results.
Precision can be defined as the ratio of correctly identified positive instances in the recognition image, expressed as a percentage.
The recall rate refers to the ratio of correctly identified positive samples in the test set to all positive samples.
AP and mAP comprehensively consider precision and recall and are typically used to assess the effectiveness of models across multiple categories. The precision–recall rate (PR) curve represents the relationship between the precision and recall rates of a classifier. The precision is plotted on the vertical axis, while the recall rate is plotted on the horizontal axis. This curve illustrates how well the classifier can detect positive examples and include all of them. The average precision (AP) is a numerical measure of the area under the precision–recall rate (PR) curve. A higher AP value indicates the superior performance of the classifier.
Object detection models typically identify multiple sorts of targets, with each type having the ability to generate a precision–recall (PR) curve for calculating the average precision (AP) value. The term “mAP” is determined by averaging the precisions across all categories.
4.4. Experimental Results and Analysis
4.4.1. Experimental Results
The proposed YOLOv7t-CEBC model was experimentally evaluated for its detection performance on the Deep Plastic dataset, and the experimental results are shown in
Figure 9. The results demonstrate that the improved model exhibited enhancements in detection across all categories when compared to the baseline model. The most noticeable improvement is observed in the precision of detecting “plastic”, which had seen an increase of 7.2% in average precision (AP). The model’s overall average precision (mAP) was computed to be 81.8%.
4.4.2. Comparative of Different Object Detection Models
To further validate the superiority of the improved YOLOv7t-CEBC model, we compared it well-known object detection algorithms such as SSD, Faster-RCNN, Retinanet, YOLOv3, YOLOv4, YOLOv5s, YOLOXs, YOLOv8n, and the original algorithm YOLOv7-tiny. The Deep Plastic dataset served as the experimental dataset, and the evaluation included metrics such as accuracy, recall,
[email protected], model computational complexity (GFLOPs), parameters count, and frame rate (FPS) (refer to
Table 2 for details). From
Table 2, it is evident that our proposed YOLOv7t-CEBC significantly outperforms all other object detection algorithms in terms of precision and recall. Regarding the detection precision metric
[email protected], our improved algorithm showed a 3.8% increase compared to the original YOLOv7-tiny, attaining a clear advantage over other detection algorithms. In terms of computational complexity (GFLOPs), YOLOv7t-CEBC surpasses YOLOv8n, YOLOv7-tiny, YOLOXs, and YOLOv5s. The parameter count of YOLOv7t-CEBC was also higher than YOLOv7-tiny and YOLOv8n. The increase was attributed to the integration of modules specifically designed for underwater litter detection, resulting in an expansion of YOLOv7t-CEBC’s overall size. However, the improvements in mAP and outstanding performance in other evaluation metrics compensate for these drawbacks. The algorithm achieved a detection speed of 118 fps, slightly lower than the original YOLOv7-tiny but significantly surpassing current mainstream detection algorithms, ensuring good real-time performance. The experimental results robustly demonstrate the outstanding performance of YOLOv7t-CEBC in underwater litter detection, making it more suitable for marine litter detection than other mainstream object detectors.
4.4.3. Ablation Experiment of Deep Plastic Dataset
This study used ablation experiments to assess the efficacy of various enhanced model performances. The ablation experimental results are shown in
Table 3.
Table 3 clearly demonstrates that adding the crucial CNeB module to the backbone network increased the parameter count by 1.46M compared to the baseline model, resulting in a 2.1% improvement in mAP. Furthermore, the addition of the EMA and Biformer attention mechanisms on this basis resulted in a 0.7% enhancement in mean average precision (mAP). In contrast to the baseline model, there is an increase of 0.25M parameters. Lastly, by replacing the upsampling module of the improved model with the CARAFE upsampling operator, precision was enhanced further. In comparison with the baseline model, the change led to a 3.8% rise in mAP with an additional 0.89M parameters. If only the CNeB module and CARAFE upsampling operator were added to the baseline model, the mAP was improved by only 0.6% compared to the baseline model, but the parameter count increased by 35%. In comparison to the final improved model, the addition of the EMA and Biformer attention mechanisms not only reduced the parameter count and network complexity but also contributed to enhanced model precision. Throughout the entire ablation experiment, there was a slight decrease in detection speed, resulting in a final speed of 118FPS, which was sufficient to meet the model’s rapid detection requirements.
Furthermore, the values of
and
in CAREFE also affected the final results. Comparative experiments showed that (see
Table 4) increasing
requires a larger
because the content encoder needed a larger receptive field to predict a larger recombination kernel. Simultaneously increasing both
and
could improve detection precision, whereas increasing only one of them did not. We have summarized a formula:
The experimental results indicated that three combinations of (,)—(1, 3), (3, 5), and (5, 7)—were all good choices. The larger the kernel size used, the better the results, and increasing the kernel size did not lead to a significant increase in the number of parameters, nor did it introduce a substantial impact on detection speed. Therefore, the CAREFE upsampling operator with a kernel combination of (5, 7) was chosen as the model improvement module.
4.5. Model Performance Discussion
Deep learning networks are often seen as inscrutable experiments, posing difficulties in interpretation. Comprehending the model’s recognition process is essential for examining its internal operations, structure, training data, feature extraction, and prediction processes [
35]. In our experiments, we introduced a technique for visualizing the attention of deep learning models called Grad-CAM [
36]. This method leverages the model’s gradient information to backpropagate to the input image, generating a class activation map (CAM) that highlights the input image regions most crucial for the model’s predictions. The Grad-CAM heatmaps are illustrated in
Figure 10. The results demonstrated that the enhanced model exhibits a stronger capability in recognizing and extracting underwater litter features compared to the original model. Moreover, it was found to be less susceptible to the influence of complex underwater environmental factors. Therefore, the improved model was better equipped to handle the inter-class similarity and intra-class variability of underwater garbage, while effectively filtering out irrelevant information.
To better evaluate the visual capability of the enhanced algorithm, we implemented detection on the validation set of the dataset. Experimental findings show that, in comparison with the traditional YOLOv7-tiny algorithm, the presented YOLOv7t-CEBC performed better in harsh underwater scenarios. As shown in
Figure 11, an analysis of the detection outcomes on the Deep Plastic dataset between YOLOv7-tiny and YOLOv7t-CEBC demonstrated that the proposed YOLOv7t-CEBC exhibited significantly improved detection performance over YOLOv7-tiny. It not only accurately detected more targets but also achieved higher precision.
However, the model still had limitations; for example, YOLOv7t-CEBC still faced challenges in error detection and omission detection in complex underwater environments, as shown in
Figure 12. Thus, it requires further research for improvement. Furthermore, due to the influence of time span and lighting conditions, significant changes were observed in the shape and color of underwater litter. Despite attempts to integrate modules for underwater litter detection, the impact of inter-class similarity and intra-class variability in underwater litter on detection precision could not be fully resolved, and the situation of missed detection and false alarms by the model persisted. The collection of pictures of marine litter with different storage times and different forms to enrich our dataset is thus necessary to enhance the model’s capacity for both migration and detection. Due to the current focus of our dataset on shallow-water underwater litter detection, the influence of factors such as light and noise on the target images is relatively weak and is less impactful for the outcomes of the detection. Consequently, the method suggested in the article does not involve image processing. Furthermore, mainstream image-enhancement methods are primarily conducted offline and have not been deployed on real-time detection models. Therefore, there is an urgent need for advanced image-enhancement algorithms to be deployed on detection models for real-time image processing, seeking to fulfill the demands of deepwater object detection in intricate circumstances. The “tiny” network architecture is a targeted solution designed specifically for resource-constrained environments, with the aim of achieving efficient detection performance with limited resources. When underwater robots perform tasks, they typically utilize portable computing devices, where computational resources are severely constrained. Thus, there is an urgent need for such micro-network architectures to assist in achieving the required functionality, thereby enhancing the overall efficiency of the robots. Although we utilized a “tiny” network structure as the foundational architecture improvement, there remains room for further lightweight optimization. These improvement strategies will be the focal point of our next phase of work.
5. Conclusions
To be able to tackle the complex issue of detecting litter in underwater environments, this study introduced a series of components specifically designed for underwater litter detection, proposing the underwater litter detection algorithm YOLOv7t-CEBC. Firstly, the CNeB module was incorporated into the backbone network to enhance the learning capability for underwater litter features, expedited network convergence, and elevated detection precision. Secondly, the EMA attention mechanism and Biformer attention mechanism were introduced to improve the ability of the network for obtaining targets’ global characteristics and densely clustered litter targets, effectively reducing the network’s complexity. Finally, the conventional upsampling module was replaced with CAREFE to enlarge the receptive field, better utilize surrounding information, and enhance the deep network’s ability to extract underwater litter features. Experiments were conducted using the underwater litter dataset, Deep Plastic, and YOLOv7t-CEBC was compared with state-of-the-art object detection algorithms. The results demonstrated that the YOLOv7t-CEBC model successfully attained a mean average precision (mAP) of 81.8% for detecting litter in intricate underwater settings, with a detection speed of 118 FPS, surpassing the most advanced object detection models and meeting real-time requirements. Currently, the method proposed in this paper largely fulfills the requirements for shallow-water underwater litter detection by underwater robots and guides the mechanical arm in collecting underwater litter. In the future, to improve the model’s generalization performance and detection capacity, we will concentrate on gathering a large and varied dataset of underwater litter. Advanced image processing techniques will be employed to restore underwater optical images, thereby improving detection precision and extending the robot’s working range to deeper waters. Additionally, efforts will be made to lightweight the model to alleviate the computational burden on underwater robots, assisting them in executing litter cleaning tasks more efficiently. In summary, this research is of paramount importance for underwater litter detection and subsequent cleanup efforts. Through the continuous refinement of model functionalities, this research will make significant contributions to the protection of the marine environment in the future.