EMR-YOLO: A Multi-Scale Benthic Organism Detection Algorithm for Degraded Underwater Visual Features and Computationally Constrained Environments

Zou, Dehua; Zhao, Songhao; Zhou, Jingchun; Liu, Guangqiang; Jiang, Zhiying; Xu, Minyi; Fu, Xianping; Liu, Siyuan

doi:10.3390/jmse13091617

Open AccessArticle

EMR-YOLO: A Multi-Scale Benthic Organism Detection Algorithm for Degraded Underwater Visual Features and Computationally Constrained Environments

by

Dehua Zou

^1,†,

Songhao Zhao

^1,†

,

Jingchun Zhou

²,

Guangqiang Liu

³,

Zhiying Jiang

²,

Minyi Xu

¹

,

Xianping Fu

² and

Siyuan Liu

^1,4,*

¹

College of Marine Engineering, Dalian Maritime University, Dalian 116026, China

²

School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China

³

China Academy of Transportation Sciences, Beijing 100029, China

⁴

State Key Laboratory of Maritime Technology and Safety, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Mar. Sci. Eng. 2025, 13(9), 1617; https://doi.org/10.3390/jmse13091617

Submission received: 8 July 2025 / Revised: 13 August 2025 / Accepted: 22 August 2025 / Published: 24 August 2025

(This article belongs to the Topic Applications and Development of Underwater Robotics and Underwater Vision Technology, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Marine benthic organism detection (BOD) is essential for underwater robotics and seabed resource management but suffers from motion blur, perspective distortion, and background clutter in dynamic underwater environments. To address visual feature degradation and computational constraints, we, in this paper, introduce EMR-YOLO, a deep learning based multi-scale BOD method. To handle the diverse sizes and morphologies of benthic organisms, we propose an Efficient Detection Sparse Head (EDSHead), which combines a unified attention mechanism and dynamic sparse operators to enhance spatial modeling. For robust feature extraction under resource limitations, we design a lightweight Multi-Branch Fusion Downsampling (MBFDown) module that utilizes cross-stage feature fusion and multi-branch architecture to capture rich gradient information. Additionally, a Regional Two-Level Routing Attention (RTRA) mechanism is developed to mitigate background noise and sharpen focus on target regions. The experimental results demonstrate that EMR-YOLO achieves improvements of 2.33%, 1.50%, and 4.12% in AP, AP₅₀, and AP₇₅, respectively, outperforming state-of-the-art methods while maintaining efficiency.

Keywords:

benthic organism detection; efficient dynamic sparse detection head; multibranch fusion downsampling; regional two-level routing attention

1. Introduction

The detection, recognition, and interpretation of targets and environmental information in underwater scenes using vision, acoustics, and other sensing modalities are crucial for the perception and decision-making processes of underwater robots [1]. Detection, as a core task, is extensively applied in pipeline inspection, seabed exploration, and scientific investigations and exhibits significant potential for benthic organism detection (BOD) [2]. Although generalized detection methods, e.g., Faster R-CNN, the YOLO series models and DETR, have made substantial progress in natural image detection and have shown great potential in BOD, they still encounter numerous challenges:

Benthic organisms (BOs) exhibit significant variation in size, often coexisting across multiple scales. The high proportion of small targets [3], coupled with changes in viewing angle, occlusion, and blurring, might cause the geometric deformation of underwater targets, thereby increasing the difficulty of detection. Given the reliance on embedded hardware with limited computational performance and storage space, underwater robots must strike a balance between computational overhead and detection accuracy to achieve real-time detection with high precision. The selective absorption and scattering of light by water cause image degradation, characterized by color deviation, low contrast, and blurred details [4]. Furthermore, some BOs are attached to habitats with similar morphological structures, which exacerbates the difficulty in distinguishing targets from the background and further compromises feature expression.

The detection accuracy of BOD is highly dependent on the extraction of effective features. Traditional machine learning-based methods, such as Haar [5], the Histogram of Oriented Gradients (HOG) [6], and Oriented FAST and Rotated BRIEF (ORB) [7], rely on artificially designed a prior features that typically cover only low-level information such as texture, edge, and color. These methods are often designed for specific scenes and have limited generalization ability, making it difficult to extract effective features in complex underwater environments. Additionally, the sliding window strategy employed by these methods has low computational efficiency and cumbersome processing flow, which makes it challenging to meet the real-time demands of resource-constrained scenarios.

Convolutional Neural Networks (CNNs) are core tools in deep learning for target detection tasks. Based on the detection process and the use of prior anchor boxes, existing deep learning-based target detection algorithms can be categorized into three types: multi-stage methods, single-stage methods using anchor boxes, and single-stage methods without anchor boxes [8]. Multi-stage methods, such as R-CNN, Fast R-CNN, and Mask R-CNN, use a Region Proposal Network (RPN) to generate candidate regions, followed by feature extraction and classification therein. These methods achieve high accuracy but suffer from slow inference and a complex structure, making them unsuitable for real-time detection. Additionally, candidate regions based on low-resolution feature maps are often susceptible to the misdetection and missed detection of small targets [9]. Single-stage methods using anchor boxes, such as YOLO and SSD, transform detection into an end-toend regression problem, significantly improving detection speed. However, the lack of a discriminative feature construction process might result in lower accuracy, especially in scenarios involving overlapping or occluded objects. Single-stage methods without anchor boxes detect keypoints or centroids to locate the target, reducing the dependence on anchor frame settings. However, the absence of prior anchor boxes and information guidance leads to poor detection accuracy, particularly for targets with blurred details and severe deformation [10]. Although several works have successfully applied deep learning to BOD [11,12,13,14], it has yet to achieve a balance between detection accuracy and model complexity, and most applications are complicated in design and difficult to deploy in underwater devices.

To address the aforementioned limitations, this paper presents a multi-scale BOD method, i.e., EMR-YOLO, designed for underwater visual feature degradation and resourceconstrained environments. The proposed method focuses on the synergistic optimization of three key components: the backbone network, detection head, and attention mechanism. The main contributions are summarized as follows:

To improve the efficiency of information flow and gradient propagation, a lightweight downsampling convolution module (MBFDown) is designed, which employs a multibranch structure and cross-stage feature fusion strategy. It can effectively reduce the number of parameters and computational complexity while maintaining robust feature extraction capabilities.
To suppress the feature representation of complex underwater backgrounds, we propose a Two-Stage Region-Aware Channel Attention mechanism (RTRA). By integrating the ability to mitigate background noise, this mechanism enables the detector to focus on target regions, thereby reducing missed detections and enhancing semantic understanding and generalization in complex environments.
For the adaptive perception capability of anchor-free detection heads, we innovatively design an efficient dynamic sparse detection head (EDSHead), which integrates a unified attention mechanism with a dynamic sparse operator, thereby improving the multi-scale fusion capability for the spatial modeling of BOs with variable sizes and morphologies.

2. Related Work

2.1. Learning with Multi-Scale and Morphological Diversity

To boost multi-scale perception, most target detection algorithms introduce the Feature Pyramid Network (FPN) [15] during the feature extraction stage to fuse high-level semantics and low-level spatial details. The Feature Pyramid Transformer (FPT) [16] extends the FPN structure by incorporating three types of transformer modules (Self, Grounding, and Rendering), achieving the non-local enhancement of multi-scale features and semantic detail complementarity and improving the detection accuracy of small targets. The Recursive Pyramid Network (RFP) [17] enhances multi-scale perception by feeding the output features of the FPN back into the backbone network at various stages. However, to tackle variable target morphology, existing methods mainly enhance spatial modeling capability by improving the convolution operator. Dai et al. [18] proposes deformable convolutions (DCNs), which adjust sampling positions through learnable offsets to adaptively match target geometric variations. Zhu et al. [19] introduces a modulation mechanism to further optimize the DCN’s ability to focus on the target, and Yu et al. [20] uses dilated convolution to expand the receptive field to enhance contextual perception.

Despite the achievements of these methods, there are still limitations. On the one hand, traditional pyramid structures are prone to semantic drift and feature redundancy during cross-scale information transfer, making it difficult to fully model the complex relationships between scales. On the other hand, existing deformable or dilated convolution operators still have limited robustness when facing drastic changes in target morphology.

2.2. Learning with Computational Constraints

Currently, model compression and lightweight network architecture design are the two primary approaches to reduce computational overhead. In model compression, knowledge distillation [21] reduces model size and computation without significant performance loss by migrating knowledge from large teacher models to lightweight student models, while network pruning [22] makes the model structure sparser and more efficient by reducing parameters and connections in the neural network. Lightweight network architecture design aims to reduce complexity intrinsically. For instance, SqueezeNet [23] effectively compresses the number of parameters by stacking Fire Modules and interspersing them with maximal pooling operations; MobileNet [24] adopts depthwise separable convolution to significantly reduce the computational overhead; and ShuffleNet [25] reduces computational complexity and performs inter-channel information interaction through channel shuffling operations, achieving efficient feature representation and computational acceleration. These methods provide effective support for deploying models in resource-constrained scenarios.

Existing methods have several limitations: Model compression methods rely on pretrained complex models or fine pruning strategies, which may bring about accuracy loss during the compression process. Moreover, existing lightweight architectures focus on simplifying the overall network structure, but are insufficiently optimized for specific functional modules. This can lead to information loss or under-expression in the early stages of information flow, thereby limiting the further enhancement of feature extraction capabilities.

2.3. Learning with Visual Feature Degradation

Previous work has attempted to deal with degraded features by recovering degraded information and enhancing feature representation through image enhancement algorithms. Based on a CNN, Li et al. [26] introduces an enhancement network grounded in prior knowledge of underwater degraded scenes, which is capable of simulating multiple types of degradation and recovering images and videos; Wu et al. [27] proposes a two-stage underwater image enhancement network, UWCNN-SD, which initially enhances the highfrequency and low-frequency information of the original image respectively, and then optimizes the color and detail expression through a refinement module. Generative Adversarial Networks (GANs) have also demonstrated significant potential in underwater image enhancement. Li et al. [28] employs weakly supervised color transfer networks to learn cross-domain mapping relationships, effectively correcting color distortion and reducing the reliance on paired training samples. Guo et al. [29] combines multi-scale residual module with spectral normalization to develop a multi-scale dense GAN, which corrects the color bias and enhances image details, achieving robust performance across diverse degradation scenarios. Additionally, some studies have focused on enhancing the extraction of underwater features to improve detection accuracy. Liu et al. [30] embeds selective kernel convolution into ResNet101 and optimizes the anchor box design based on BOs’ features to enhance the ability to perceive fuzzy targets. Xu et al. [31] adopts the VOVDarkNet backbone structure, utilizing cross-layer aggregation and multi-branch strategies to fully leverage intermediate layer features, thereby improving the representation of fuzzy target texture and shape.

Most existing methods rely on training for specific degradation types and lack effective modeling of target diversity features in dynamic underwater environments. Particularly for low-contrast targets in complex backgrounds, these methods often fail to provide sufficient robustness, leading to misdetection or missed detections.

3. EMR-YOLO Scheme

While YOLOv8 has been widely adopted in industry and demonstrates numerous mature application cases, its role as a general-purpose detector leaves certain limitations when applied to BOD [32]. Specifically, its standard downsampling blocks introduce redundant computation and inflate model complexity; its feature representation remains vulnerable to interference from dynamic underwater backgrounds; and its fixed-scale detection head struggles to accommodate large variations in the scale and morphology of BOs. To systematically optimize for these limitations, we propose three targeted modules for each core issue. MBFDown reduces computational resource demands, RTRA suppresses background noise and mitigates visual feature degradation, and EDSHead enhances adaptability to diverse target scales and morphologies. Based on the YOLOv8 framework, we integrate these three modules to develop the EMR-YOLO architecture, as illustrated in Figure 1.

3.1. Multi-Branch Fusion Downsampling

Underwater robots rely on embedded hardware, with limited computational capacity and storage space, to the perform real-time detection of BO. To reduce the computational burden while preserving detection performance, we propose Multi-Branch Fusion Downsampling (MBFDown)—a lightweight module that maintains robust feature extraction capabilities. MBFDown is deployed as the downsampling module for each stage in the backbone and neck networks, serving as a replacement for the standard convolution modules in YOLOv8 to enhance efficiency. As shown in Figure 2, when the MBFDown module processes the input feature map, it first downsamples the features through the

2 \times 2

average pooling layer to smooth the features and reduce information loss.

Subsequently, the feature map is divided into three parts along the channel dimension: the first part extracts local features by

3 \times 3

convolution; the second part retains the original features; and the third part enhances the key features by

3 \times 3

max-pooling and then adjusts the number of channels by

1 \times 1

convolution, which significantly reduces the computational load. Finally, the three parts of the features are concatenated along the channel dimension, resulting in an output with

2 C

channels and a feature map size that is half of the input image. The calculation formula is shown in Equation (1):

Output = Concat [Conv (X_{1}), X_{2}, Conv (MaxPool (X_{3}))]

(1)

where

X_{1}

,

X_{2}

, and

X_{3}

represent the feature maps of the input three branches, respectively.

3.2. Region-Wise Two-Stage Routing Attention

RTRA is integrated into the baseline model to enhance the representation of BO features in underwater feature degradation environments. It is a novel lightweight attention mechanism that achieves the efficient allocation of computational resources in a dynamic, query-aware manner. Specifically, RTRA employs a two-stage attention mechanism that operates at both coarse-grained and fine-grained levels. Initially, it filters out most irrelevant regions at the region level. Subsequently, it applies fine-grained token-to-token attention within the union of the filtered regions, combined with depthwise separable convolution to achieve an efficient and lightweight attention mechanism. Through hierarchical filtering and focusing, RTRA effectively filters out irrelevant regions, mitigates the interference of background noise, dense distribution, and motion blur on feature extraction, reduces computational cost, and enhances the model’s detection capability. RTRA is realized as follows, and the structure is shown in Figure 3.

For a given 2D input feature map

X \in R^{H \times W \times C}

, we first partition it into

S \times S

non-overlapping regions and reshape X into

X^{r} \in R^{S^{2} \times \frac{H W}{S^{2}} \times C}

. This ensures that each region contains

\frac{H W}{S^{2}}

feature vectors. Subsequently, we perform linear projection to reduce the dimensionality of

X^{r}

to obtain Query, Key, and Value, denoted as

Q, K, V \in R^{S^{2} \times \frac{H W}{S^{2}} \times C}

.

Q = X^{r} W^{q}, K = X^{r} W^{k}, V = X^{r} W^{v}

(2)

The projection weights corresponding to Query, Key, and Value are denoted as

W^{q}, W^{k},

and

W^{v} \in R^{C \times C}

, respectively.

First Stage: Coarse-Grained Attention. Firstly, depth-separable convolution is applied to the feature vectors of each region to extract features, yielding the region-level query and key

Q^{r}, K^{r} \in R^{S^{2} \times \frac{H W}{S^{2}} \times C}

. By capturing spatial information on a per-channel basis and subsequently fusing channel information point-wise, the most representative feature vectors of the regions are formed, effectively reducing computational load and the number of parameters.

Q^{r} = DWConv (Q), K^{r} = DWConv (K)

(3)

We then apply matrix multiplication to

Q^{r}

and the transpose of

K^{r}

to compute the semantic affinity matrix

A^{r} \in R^{S^{2} \times S^{2}}

, which indicates the degree of semantic relatedness between regions. Next, we use the row-by-row top-k operator to select the k most relevant regions for each region from

A^{r}

, resulting in the region index matrix

I^{r} \in N^{S^{2} \times k}

. The i-th row of

I^{r}

contains the indices of the k most relevant regions for the i-th region, i.e.,

I_{(i, 1)}^{r}, I_{(i, 2)}^{r}, \dots, I_{(i, k)}^{r}

.

Second Stage: Fine-Grained Attention. For each Query in region i, all Key–Value pairs are computed over the selected k routing regions. Since these regions are distributed throughout the feature map, and modern GPUs achieve efficient computation by loading blocks of consecutive bytes, we first aggregate the required key and value tensors:

K^{g} = g a t h e r (K, I^{r}), V^{g} = g a t h e r (V, I^{r})

(4)

where

K^{g}, V^{g} \in R^{S^{2} \times \frac{K H W}{S^{2}} \times C}

is used to aggregate the key and value tensors. The attention mechanism is then applied to

K^{g}

and

V^{g}

.

O = A t t e n t i o n (Q, K^{g}, V^{g}) + L C E (V)

(5)

Local Context Enhancement (LCE) [33] is introduced here to compensate for feature information that may be lost during region filtering. LCE is parameterized using depthseparable convolution to capture finer-grained local features.

3.3. Effcient Dynamic Sparse Head

Affected by factors such as image scale variation, view angle change, and blurring in the underwater environment, the static sampling method of traditional convolution struggles to adapt to the complex changes in target morphology and scale [18]. Although existing detection heads are designed for large, medium, and small targets with corresponding scales, the baseline still lacks sufficient multi-scale modeling capability due to the fixed and manually set scales. Meanwhile, limited computational resources restrict model complexity, resulting in the poor generalization of the existing detection heads in underwater scenes.

EDSHead is proposed to address the aforementioned challenges. By incorporating the efficient sparse attention operator Deformable Convolution v4 (DCNv4) and deploying self-attention mechanisms across the level, spatial, and channel dimensions, the detection head is enabled to adaptively focus on key target regions. This enhances both the multiscale dynamic detection capability and the feature fusion capability. The structure of the EDSHead Block, the core module of EDSHead, is depicted in Figure 4.

For the output feature maps P3, P4, and P5 of the feature pyramid, we can align the feature maps of different scales to the intermediate scale through upsampling and downsampling. The feature pyramid can then be represented as a three-dimensional tensor

F \in R^{L \times S \times C}

. By introducing attention mechanisms to each of the three dimensions of this tensor, we obtain

W (F) = Π_{C} (Π_{S} (Π_{L} (F) \cdot F) \cdot F) \cdot F

(6)

where

π_{L}

,

π_{S}

, and

π_{C}

correspond to scale-aware, spatial-aware, and task-aware attention functions, which act on dimensions L, S, and C, respectively, with expressions shown in (7)–(9).

Π_{L} (F) \cdot F = σ (f (\frac{1}{S C} \sum_{S, C} F)) \cdot F

(7)

where

f (\cdot)

is a linear function approximated with a

1 \times 1

convolutional layer, and

σ (x) = max (0, min (1, \frac{x + 1}{2}))

is a hard-sigmoid function.

\frac{1}{S C} \sum_{S, C} F

represents the Average Pooling operation performed across both the spatial and channel dimensions for each feature level.

DCNv4 was introduced in

π_{S}

to enhance the spatial feature modeling capability. This operator enhances the standard convolution by adding learnable offsets to its sampling points, allowing it to adapt to the geometric variations in objects; a more detailed discussion of its principles is provided in Section 5.3.

π_{s}

can be expressed as

Π_{S} (F) \cdot F = \frac{1}{L} \sum_{l = 1}^{L} \sum_{g = 1}^{G} \sum_{k = 1}^{K} w_{g} m_{g k} x_{g} (p + p_{k} + Δ p_{g k})

(8)

where G denotes the number of subgroups, and the number of channels per group

C^{'} = C / G

;

w_{g} \in R^{C \times C^{'}}

denotes the reduced-dimensional weight matrix;

m_{g k} \in R

denotes the modulation scalar of the k-th sample point in the g-th group of the convolution kernel; and

x_{g} \in R^{C^{'} \times H \times W}

denotes the input feature map of the slice. The definitions of p,

p_{k}

, and

Δ p_{g k}

are provided in Section 5.3.

Π_{C} (F) \cdot F = max (α^{1} (F) \cdot F_{c} + β^{1} (F), α^{2} (F) \cdot F_{c} + β^{2} (F))

(9)

where

F_{c}

denotes the feature map of the c-th channel. The

max (\cdot, \cdot)

function used here represents the Dyrelu [34], a piecewise linear activation function composed of two segments. The shape of this function is dynamically controlled by the parameters

α^{1} (F)

,

β^{1} (F)

, and

α^{2} (F)

,

β^{2} (F)

, which define the slopes and intercepts of the two linear components, respectively.

{[α^{1} (F), α^{2} (F), β^{1} (F), β^{2} (F)]}^{T} = θ (\cdot)

is the hyperfunction for learning to control the activation thresholds.

θ (\cdot)

is computed in a similar way as in Dyrelu, where global average pooling is first applied to downscale in the

L \times S

dimension, followed by processing through two fully connected layers. Since the three attention mechanisms act sequentially, Equation

(6)

can be nested multiple times. By stacking the EDSHead Block, the depth of the network is increased to improve efficiency.

4. Experimental Studies and Comparisons

4.1. Dataset and Configurations

Dataset: The Detecting Underwater Objects (DUO) dataset [35] is selected to evaluate the effectiveness of the proposed method. To ensure the quality of the training samples, we re-annotate poorly and invalidly labeled images, and subsequently remove severely blurred or duplicate samples based on Laplacian variance (threshold

< 100

) and the Perceptual Hash algorithm (Hamming distance

\leq 5

). Ultimately, 5760 images are retained, representing 74% of the original dataset. This indicates that only a small fraction of the images in the new dataset are similar. The dataset comprises four types of objects, Holothurian, Echinus, Scallop, and Starfish, and is randomly split into training and testing sets in a 7:3 ratio. Figure 5 illustrates the number and distribution of instances across different scales in the dataset.

Configurations: All experiments are conducted on a workstation running Ubuntu 22.04. The workstation is equipped with an Intel Core i7 processor, three NVIDIA RTX A6000 GPUs, and 32GB of RAM. The experiments are implemented using PyTorch 1.12.1, CUDA 11.6, and cuDNN version 8.3.2. Moreover, the model is optimized using the AdamW optimizer, starting with an initial learning rate of 0.01 and a batch size of 16. The number of training epochs is set to 200. Specifically, the hyperparameters k and S are set to 4 and 7, respectively. To ensure the stability and reliability of the experimental results, all metrics are averaged over five independent runs.

4.2. Efficient Dynamic Sparse Head Performance

To effectively demonstrate the efficiency and superiority of the proposed EDSHead, comparisons are made in terms of both Dyhead and depth. YOLOv8 is used as a benchmark with a resolution of

640 \times 640

.

A P

,

A P_{50}

, and

A P_{75}

are used as indicators, and the comparison results are shown in Table 1. Compared with the baseline, the Dyhead-based scheme can improve detection accuracy because the unified attention mechanism significantly enhances the representation ability of the detection head. In contrast, the proposed EDSHead-based scheme further improves detection accuracy compared to Dyhead. This is attributed to the depthwise separable convolution and grouped convolution techniques, which more effectively extract key semantic information. Additionally, we evaluate the efficiency of EDSHead by controlling the network depth. We adjust the number of stacks of EDSHead blocks (e.g., 1, 2, 3, 4, 5, 6 blocks) and analyze their performance versus computational costs (GFLOPs) at different depths. The results show that as the number of blocks increases, the detection accuracy does not consistently improve but rather decreases when the number of blocks is greater than or equal to 4, despite a significant rise in computational overhead. In contrast, EDSHead at block = 1 achieves optimal accuracy while maintaining the lowest computational cost, indicating the efficiency of its structural design. It is worth noting that compared to Dyhead, EDSHead achieves higher accuracy with lower computational cost on

A P

,

A P_{50}

,

A P_{75}

by 0.39%, 0.46%, and 0.51%, respectively. This indicates that EDSHead is more expressive and easier to converge, which is suitable for application scenarios with limited computational resources.

4.3. Attention Performance Comparisons

To fully demonstrate the superiority and justification of the RTRA, we conduct a comprehensive comparison using five representative attention-based detection methods, including MLCA [36], GAM [37], CA [38], CBAM [39], and ECA [40], with the results presented in Table 2. RTRA performs best on

A P

,

A P_{50}

, and

A P_{75}

. MLCA integrates channel and spatial attention mechanisms to enhance both local and global information representation, achieving accurate detection across all categories, second only to the RTRA framework. In contrast, GAM is limited by its focus on local channel–space relationships and lacks the ability to model long-range dependencies, making it difficult to capture the global semantic information of benthic organisms in complex backgrounds. CA’s global pooling technique has limited effectiveness in suppressing noisy regions, hindering the accurate extraction of BO features. CBAM’s spatial attention, which relies on max-pooling and Avg-pooling operations along the channel dimension, tends to overlook small-scale targets, thereby limiting detection performance. ECA achieves efficient channel attention modeling through 1D convolution without dimensionality reduction, resulting in optimal detection performance for the Scallop category.

4.4. Ablation Study

To systematically validate the effectiveness of each proposed component and analyze both their individual contributions and combined effects, we conducted a comprehensive ablation study. Starting from the YOLOv8n baseline, we evaluate the impact of integrating each of the three modules—EDSHead, MBFDown, and RTRA—both independently and in various combinations. This experimental design enables us to quantify the incremental gains introduced by each module and to examine their synergistic interactions in contributing to the overall performance of the EMR-YOLO model. The results are shown in Table 3.

As shown in Table 3, when individually added to YOLOv8n, each proposed module contributes positively to performance. The introduction of EDSHead leads to improvements of 1.1% in

A P

and 1.38% in

A P_{50}

, highlighting substantial accuracy gains through enhanced adaptability to multi-scale and multi-morphology BOs. It should be noted that the MBFDown module, while delivering a 0.17% gain in

A P

, also reduces parameter count and computational load by

0.48 M

and

1 G

, respectively, confirming the efficiency of its lightweight design. The RTRA module also increases

A P

by 0.59%, demonstrating its effectiveness in suppressing background noise. This study also reveals the synergistic effects between modules. For example, introducing the RTRA attention mechanism on top of EDSHead increases

A P

to 54.57%, representing a 1.84% improvement over the baseline. This gain exceeds the sum of their individual contributions, indicating that the two modules work effectively together to improve context modeling in complex scenarios. Ultimately, the complete EMR-YOLO scheme achieved respective improvements of 2.33%, 1.50%, and 4.12% on

A P

,

A P_{50}

, and

A P_{75}

, with only a modest increase in parameters and computation, demonstrating an excellent balance between detection accuracy and model complexity.

To visually demonstrate the cumulative and synergistic effects of our key modules, we select four representative configurations from Table 3 (rows 1, 2, 6, and 8) to plot their Precision–Recall (P-R) curves in Figure 6. Specifically, these four configurations are (1) Baseline, original YOLOv8; (2) Baseline + EDSHead; (3) Baseline + EDSHead + RTRA; and (4) the final EMR-YOLO model. The P-R curves clearly show that each successive addition progressively shifts the curve towards the top-right corner, indicating consistent improvements in detection performance. Notably, EMR-YOLO achieves a superior balance of precision and recall for the Echinus and Scallop categories. Although precision gains are limited for Starfish and Holothurian, the increase in recall indicates that the scheme can enhance the effective detection capability of the model.

4.5. Comparisons with Typical Object Detection Approaches

Furthermore, to thoroughly demonstrate the superiority of the proposed EMR-YOLO scheme, we compare it with a range of typical detection methods, including the twostage network Faster R-CNN [41], the single-stage networks SSD [42] and RetinaNet [43], CenterNet [10], FCOS [44], PAA [45] YOLOv3 [46], YOLOv4 [47], YOLOv5, YOLOv6 [48], YOLOv7 [49], YOLOv8, YOLOv9 [50], YOLOv10 [51], and YOLOv11. The comparison results are presented in Table 4.

As can be seen in the table, the detection accuracies of the two-stage scheme Faster R-CNN are 26.32%, 29.46%, and 33.07% lower than those of the proposed EDS-YOLO in terms of

A P

,

A P_{50}

, and

A P_{75}

metrics, respectively. This indicates that Faster R-CNN’s accuracy is significantly lower than that of EMR-YOLO in BOD. This is because Faster R-CNN relies on region proposals with fixed anchor frames, making it difficult to adapt to the variable morphology and scale differences in BOs. Additionally, the two-stage architecture restricts efficient optimization, resulting in an insufficient ability to accurately model small and occluded targets. In contrast, SSD based on the VGG16 backbone simplifies the detection process and significantly reduces model complexity to only 12.9% of that of Faster RCNN, but this comes at the cost of lower accuracy. For anchor-free detection methods, RetinaNet addresses the class imbalance problem by introducing Focal Loss, improving the accuracy of small object detection. However, its high computational complexity makes it less efficient in terms of speed and resource consumption compared to some lightweight models. CenterNet achieves target detection by predicting the center point, width, and height of the target. By simplifying the detection process and reducing computation, it strikes a balance between accuracy and efficiency. CenterNet excels in detecting Echinus and Starfish, achieving

A P

values of 83.51% and 77.86%, respectively. Although FCOS and PAA achieve relatively high accuracy with

A P_{50}

scores of 72.08 and 74.24, respectively, both models suffer from large parameter sizes and high computational complexity.

For the YOLO series, YOLOv3, as an earlier version, has demonstrated good performance, with its AP value exceeding 60% in all categories. However, its substantial computational overhead prevents it from achieving the high efficiency of EMR-YOLO. YOLOv4 and YOLOv5 have further improved accuracy. YOLOv5 in particular, while maintaining a lower number of parameters and computational complexity, achieves accuracy comparable to YOLOv4. This is attributed to the introduction of CSPDarknet53 as the backbone, which enhances feature extraction and reduces model complexity, and the adaptive anchor frame construction strategy, which improves detection capability. The performance improvement from YOLOv6 to YOLOv11 is even more obvious. YOLOv6 introduces the reconfigurable parameterized backbone EfficientRep and the RepPAN, which significantly boosts detection accuracy. YOLOv7 enhances the learning capacity of the convolutional network through the extended Efficient Layer Aggregate Network and the dynamic label assignment strategy. For YOLOv8, YOLOv9, YOLOv10, and YOLOv11, these versions achieve very high AP values, close to or exceeding 80% across all categories. Their success is attributed to further optimization in the network structure, loss function, and training strategy. YOLOv8 integrates a Cross-Stage Partial Network with a two-way Fusion module and the task-aligned allocation strategy, enhancing feature extraction and localization performance in complex underwater environments, achieving higher detection accuracy for Echinus and Starfish categories. YOLOv9, with the introduction of the GELAN backbone and PGI mechanism, further improves the model’s expressive ability and detection accuracy on a lightweight design basis, matching the accuracy of YOLOv8. YOLOv10 and YOLOv11, through the introduction of the Dual-Path Fusion Network and Faster Feature Pyramid Networks, respectively, enhance the detection of multi-scale BOs.

Compared with existing mainstream methods, the proposed EMR-YOLO achieves superior detection performance in Scallop and Starfish target detection. In terms of comprehensive metrics, including

A P

,

A P_{50}

, and

A P_{75}

, the EMR-YOLO framework outperforms other state-of-the-art algorithms. Additionally, EMR-YOLO features a smaller model size and lower computational overhead, making it well suited for edge deployment in complex underwater environments.

4.6. Visualization of Test Results

To verify the generality of the proposed EMR-YOLO scheme in real underwater environments, we selected four typical underwater scenarios for testing: background and BO confusion (the first column), motion blur (the second column), the dense distribution of BOs (the third column), and the coexistence of multi-scale BOs (the fourth column), as shown in Figure 7. The results indicate that the EMR-YOLO scheme outperforms similar methods in scenarios involving background confusion and multi-scale coexistence. Although not all BOs are correctly detected, the EMR-YOLO scheme effectively addresses the deformation problem of densely occluded targets, such as Echinus and Holothurian, due to its superior morphological and multi-scale modeling capabilities. In contrast, in the scenario of multi-scale BO coexistence, YOLOv7 achieves the highest detection precision and accuracy. This is primarily attributed to its introduction of the E-ELAN structure and label allocation strategy, which significantly enhances the feature expression capability of the deep convolutional network and improves detection performance across different scales.

5. Discussion

5.1. MBFDown: Light Weighting and Efficient Feature Extraction

The main architecture of MBFDown draws on the design concepts of Inception [52,53,54] and CSPNet [55]. Inception enhances network feature representation by extracting multiscale features in the same layer in parallel using different scale convolutions and pooling operations, but parallel computation with large convolution kernels brings high computational cost.

CSPNet is a lightweight model whose main feature is to reduce computational redundancy by partitioning the feature graph. The architecture utilizes a cross-stage feature fusion strategy and a gradient truncation strategy to optimize the gradient information flow, thus achieving richer gradient combinations, enhancing the variability of learned features within different layers, and significantly reducing the computational effort. Combining the advantages of both, a lightweight and efficient MBFDown downsampling module is designed. A structural comparison between CSPNet and MBFDown is presented in Figure 8.

5.2. RTRA: Enhancing Underwater Background Adaptability

To enhance the model’s ability to represent target features in degraded underwater visual environments, we propose Regional Two-Level Routing Attention (RTRA). The implementation of RTRA is similar to Biformer [56], where it first filters regions and then applies attention within the union of the filtered regions. Comparing to static sparse attention methods, Biformer introduces an additional step to determine regions, which results in higher computational overhead. However, RTRA utilizes the depthwise separable convolution to obtain region-level Query and Key. This improvement allows RTRA to more effectively mitigate background noise while maintaining lower computational overhead, thereby reducing missed detections and enhancing semantic understanding in dynamic underwater environments.

5.3. EDSHead: Improving Multi-Scale Target Perception Capability

EDSHead integrates the concepts of DCNv4 [57] and Dynamic Head [58]. By incorporating the efficient sparse attention operator DCNv4 and deploying self-attention mechanisms across the level, spatial, and channel dimensions, the detection head is enabled to adaptively focus on key target regions.

DCNv4 was introduced in

π_{S}

to enhance the spatial feature modeling capability. To better understand its role, the principle of DCN is first reviewed.

As shown in Figure 9, the sampling locations for a standard

3 \times 3

convolutional kernel are as follows:

I = {(- 1, - 1), (- 1, 0), \dots, (0, 1), (1, 1)}

where I denotes the set of positions of sampled points.

For each location p on the output feature map, y.

y (p) = \sum_{p_{k} \in I} w (p_{k}) \cdot x (p + p_{k})

(10)

where w denotes weight and x represents the pixel value at the corresponding location in the input feature map.

In DCN, regular grid I is augmented by offsets

{Δ p_{k} ∣ k = 1, \dots, K}

, where

K = | I |

, and the sampled positions,

\tilde{p}

, after the offset can be represented as

\tilde{p} = p_{k} + Δ p_{k}

(11)

Then, sampling point p at each position on y can be expressed as

y (p) = \sum_{p_{k} \in I} w (p_{k}) \cdot x (p + p_{k} + Δ p_{k})

(12)

DCNv4 is a deformable convolutional network different from DCN, and the main improvements include the following: (1) introducing a modulation scalar

Δ m

to regulate the feature amplitude at different spatial locations, thereby enhancing the ability of DCN to control the spatial support region; (2) introducing a group training mechanism that divides the spatial aggregation process into G groups, with each group independently learning sampling offsets

Δ p_{k}

and modulation scalars

m_{k}

to improve the module’s feature expression capability; (3) having all sampling points of the same convolutional kernel share the projection weight, w, which reduces the number of parameters and improves the efficiency of the model; and (4) optimizing the DCN memory access process to reduce redundant operations and merge memory instructions, thereby significantly improving the processing speed.

6. Limitations and Future Work

Despite the promising results achieved in this study, several avenues for future improvement remain. A notable limitation lies in the absence of real-world deployment and evaluation on embedded hardware platforms. While we report theoretical metrics such as FLOPs and parameter count, these do not fully capture the practical performance of the model. To address this, further experiments on actual underwater embedded devices are planned to comprehensively assess the practical performance of the model, including inference speed, latency, and power consumption. Another important consideration is the reliance on a single dataset. Although DUO is a comprehensive and challenging benchmark, and Section 4.6 further discusses the applicability of the model to real-world scenarios, this alone is insufficient to fully demonstrate the generalizability of the proposed method. Future work will extend the evaluation of EMR-YOLO to additional public underwater datasets to more thoroughly assess its generalization.

7. Conclusions

By systematically analyzing the limitations of YOLOv8 in BOD, this paper proposes EMR-YOLO, an efficient multi-scale detection algorithm designed for degraded underwater visual features and computationally constrained environments. To address these existing challenges and limitations, we design three targeted modules to optimize the baseline model. To tackle the variable sizes and morphologies of BOs, we design EDSHead, which integrates a unified attention detection head with dynamic sparse convolution to enhance spatial perception and multi-scale feature extraction capabilities. To mitigate the issue of limited computational resources underwater, we introduce MBFDown, a lightweight downsampling convolution that reduces the model’s parameter count and complexity while preserving feature expression capabilities. To combat the degraded underwater features, we propose RTRA, a lightweight attention mechanism that suppresses environmental feature expression and effectively mitigates the impact of underwater interference on detection performance. The experimental results demonstrate that EMR-YOLO achieves improvements of 2.33%, 1.50%, and 4.12% in

A P

,

A P_{50}

, and

A P_{75}

, respectively, on the DUO dataset with minimal computational overhead. The algorithm also exhibits superior performance across various complex underwater environments, reflecting a well-balanced trade-off between accuracy and complexity.

Author Contributions

Conceptualization, D.Z. and S.Z.; Data Curation, G.L.; Funding Acquisition, S.L.; Investigation, D.Z., S.Z. and J.Z.; Methodology, D.Z. and S.Z.; Supervision, Z.J., X.F. and S.L.; Validation, D.Z. and S.Z.; Writing—Original Draft, D.Z. and S.Z.; Writing—Review and Editing, D.Z., S.Z. and M.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation (NSF) of China under with (corresponding author: Siyuan Liu).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article. The experiments in this study were conducted on a refined version of the publicly available Detecting Underwater Objects (DUO) dataset. The original dataset can be accessed at https://github.com/chongweiliu/DUO (accessed on 21 February 2025). Our detailed data preprocessing methodology, including the criteria used for filtering images, is described in Section 4.1. Further information regarding the dataset is available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, H.; Tang, Q.; Li, J.; Zhang, W.; Bao, X.; Zhu, H.; Wang, G. A review on underwater autonomous environmental perception and target grasp, the challenge of robotic organism capture. Ocean Eng. 2020, 195, 106644. [Google Scholar] [CrossRef]
Chen, T.; Wang, N. Shuffled Grouping Cross-Channel Attention-Based Bilateral-Filter-Interpolation Deformable ConvNet with Applications to Benthonic Organism Detection. IEEE Trans. Artif. Intell. 2024, 5, 4506–4518. [Google Scholar] [CrossRef]
Dakhil, R.A.; Khayeat, A.R.H. Review on Deep Learning Techniques for Underwater Object Detection. In Proceedings of the Data Science and Machine Learning, Copenhagen, Denmark, 17–18 September 2022; pp. 49–63. [Google Scholar]
Zhang, D.; Wu, C.; Zhou, J.; Zhang, W.; Li, C.; Lin, Z. Hierarchical attention aggregation with multi-resolution feature learning for GAN-based underwater image enhancement. Eng. Appl. Artif. Intell. 2023, 125, 106743. [Google Scholar] [CrossRef]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; p. I. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Xiao, J.; Guo, H.; Zhou, J.; Zhao, T.; Yu, Q.; Chen, Y.; Wang, Z. Tiny object detection with context enhancement and feature purification. Expert Syst. Appl. 2023, 211, 118665. [Google Scholar] [CrossRef]
Xu, S.; Zhang, M.; Song, W.; Mei, H.; He, Q.; Liotta, A. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing 2023, 527, 204–232. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar]
Song, P.; Li, P.; Dai, L.; Wang, T.; Chen, Z. Boosting R-CNN: Reweighting R-CNN samples by RPN’s error for underwater object detection. Neurocomputing 2023, 530, 150–164. [Google Scholar] [CrossRef]
Zhang, J.; Zhu, L.; Xu, L.; Xie, Q. MFFSSD: An Enhanced SSD for Underwater Object Detection. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020; pp. 5938–5943. [Google Scholar]
Hua, X.; Cui, X.; Xu, X.; Qiu, S.; Liang, Y.; Bao, X.; Li, Z. Underwater object detection algorithm based on feature enhancement and progressive dynamic aggregation strategy. Pattern Recognit. 2023, 139, 109511. [Google Scholar] [CrossRef]
Chen, L.; Zhou, F.; Wang, S.; Dong, J.; Li, N.; Ma, H.; Wang, X.; Zhou, H. SWIPENET: Object detection in noisy underwater scenes. Pattern Recognit. 2022, 132, 108926. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Fu, X.; Liu, Y.; Liu, Y. A case study of utilizing YOLOT based quantitative detection algorithm for marine benthos. Ecol. Inform. 2022, 70, 101603. [Google Scholar] [CrossRef]
Qiao, S.; Chen, L.C.; Yuille, A. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. arXiv 2020, arXiv:2006.02334. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. arXiv 2018, arXiv:1811.11168. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2016, arXiv:1511.07122. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; Darrell, T. Rethinking the Value of Network Pruning. arXiv 2019, arXiv:1810.05270. [Google Scholar] [CrossRef]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. arXiv 2017, arXiv:1707.01083. [Google Scholar] [CrossRef]
Li, C.; Anwar, S.; Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 2020, 98, 107038. [Google Scholar] [CrossRef]
Wu, S.; Luo, T.; Jiang, G.; Yu, M.; Xu, H.; Zhu, Z.; Song, Y. A Two-Stage Underwater Enhancement Network Based on Structure Decomposition and Characteristics of Underwater Imaging. IEEE J. Ocean. Eng. 2021, 46, 1213–1227. [Google Scholar] [CrossRef]
Li, C.; Guo, J.; Guo, C. Emerging from Water: Underwater Image Color Correction Based on Weakly Supervised Color Transfer. IEEE Signal Process. Lett. 2018, 25, 323–327. [Google Scholar] [CrossRef]
Guo, Y.; Li, H.; Zhuang, P. Underwater Image Enhancement Using a Multiscale Dense Generative Adversarial Network. IEEE J. Ocean. Eng. 2020, 45, 862–870. [Google Scholar] [CrossRef]
Liu, Y.; Wang, S. A quantitative detection algorithm based on improved faster R-CNN for marine benthos. Ecol. Inform. 2021, 61, 101228. [Google Scholar] [CrossRef]
Xu, X.; Liu, Y.; Lyu, L.; Yan, P.; Zhang, J. MAD-YOLO: A quantitative detection algorithm for dense small-scale marine benthos. Ecol. Inform. 2023, 75, 102022. [Google Scholar] [CrossRef]
Chen, L.; Huang, Y.; Dong, J.; Xu, Q.; Kwong, S.; Lu, H.; Lu, H.; Li, C. Underwater Object Detection in the Era of Artificial Intelligence: Current, Challenge, and Future. arXiv 2024, arXiv:2410.05577. [Google Scholar] [CrossRef]
Ren, S.; Zhou, D.; He, S.; Feng, J.; Wang, X. Shunted Self-Attention via Multi-Scale Token Aggregation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10843–10852. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic ReLU. arXiv 2020, arXiv:2003.10027. [Google Scholar] [CrossRef]
Liu, C.; Li, H.; Wang, S.; Zhu, M.; Wang, D.; Fan, X.; Wang, Z. A Dataset and Benchmark of Underwater Object Detection for Robot Picking. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Bornstein, T.; Lange, D.; Münchmeyer, J.; Woollam, J.; Rietbrock, A.; Barcheck, G.; Grevemeyer, I.; Tilmann, F. PickBlue: Seismic phase picking for ocean bottom seismometers with deep learning. arXiv 2023, arXiv:2304.06635. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. arXiv 2021, arXiv:2103.02907. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. arXiv 2020, arXiv:1910.03151. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]
Kim, K.; Lee, H.S. Probabilistic Anchor Assignment with IoU Prediction for Object Detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXV. pp. 355–371. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv 2016, arXiv:1602.07261. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. arXiv 2015, arXiv:1512.00567. [Google Scholar] [CrossRef]
Wang, C.Y.; Mark Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. arXiv 2023, arXiv:2303.08810. [Google Scholar]
Xiong, Y.; Li, Z.; Chen, Y.; Wang, F.; Zhu, X.; Luo, J.; Wang, W.; Lu, T.; Li, H.; Qiao, Y.; et al. Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications. arXiv 2024, arXiv:2401.06197. [Google Scholar] [CrossRef]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. arXiv 2021, arXiv:2106.08322. [Google Scholar] [CrossRef]

Figure 1. An overview of the EMR-YOLO architecture. The model adopts a standard backbone–neck–head design and integrates three key modules: MBFDown, RTRA, and EDSHead. Rightside diagrams show an overview of the RTRA and EDSHead modules.

Figure 2. The structure of the MBFDown module.

Figure 3. The detailed structure of the proposed RTRA. RTRA enhances target focus and reduces computational cost by applying attention to semantically relevant regions selected through top-k filtering and DWConv-based scoring.

Figure 4. The architecture of the proposed EDSHead. EDSHead processes multi-scale features using a unified attention mechanism across three dimensions: scale, space (via DCNv4), and channel.

Figure 5. Statistical distribution of instances in the dataset. (a) Statistical distribution of dataset categories and instance counts. (b) Relative size distribution of instances. (c) Relative area distribution of instances.

Figure 6. Precision–Recall curves for four ablation configurations, corresponding to the four benthic organism categories in the DUO dataset: (a) Echinus, (b) Scallop, (c) Starfish, and (d) Holothurian.

Figure 7. Comparison of real-world detection results.

Figure 8. (a) The structure of CSPNet (b). The structure of MBFDown.

Figure 9. (a) Sampling positions of standard 3 × 3 convolution. (b) Sampling positions of DCN. The green dots represent the standard sampling points, and the red dots represent the offset sampling points.

Table 1. Comparison of AP within different detect heads.

Scheme	Block	Para	GFLOPs	AP	${AP}_{50}$	${AP}_{75}$
Baseline +Dyhead	—	3.01	8.1	52.73	86.91	58.31
Baseline +Dyhead	—	5.19	17.6	53.74	87.83	60.42
+EDSHead	1	4.75	14.6	53.83	88.29	60.43
	2	6.24	16.2	53.92	87.71	60.41
	3	7.73	17.8	53.91	87.86	60.40
	4	9.22	19.3	53.82	87.21	60.36
	5	10.70	20.3	53.56	87.22	60.32
	6	12.93	23.6	49.64	83.20	54.44

Bold values in the tables indicate the best results for each metric; subsequent tables follow the same convention.

Table 2. Comparisons of AP and mAP within different attention approaches.

Scheme	AP				AP	AP₅₀	AP₇₅
Scheme	Echinus	Scallop	Starfish	Holothurian	AP	AP₅₀	AP₇₅
MLCA	92.17	88.78	93.02	75.17	53.12	87.29	59.83
GAM	92.35	88.53	92.67	74.32	53.19	86.96	59.01
CA	92.13	88.72	93.12	75.12	53.14	87.27	59.78
CBAM	91.78	88.76	92.64	75.42	53.23	87.15	59.73
ECA	92.23	88.91	92.91	74.23	52.96	87.07	59.17
RTRA	92.26	89.74	92.95	75.49	53.31	87.61	59.84

Table 3. Ablation studies of different components.

YOLOv8	EDSHead	MBFDown	RTRA	Para	GFLOPs	AP	${AP}_{50}$	${AP}_{75}$
✓				3.01	8.1	52.73	86.91	58.31
✓	✓			4.75	14.6	53.83	88.29	60.43
✓		✓		2.53	7.1	52.90	87.12	58.91
✓			✓	4.01	14.1	53.32	87.61	59.84
✓	✓	✓		4.31	13.6	53.67	87.92	61.12
✓	✓		✓	5.49	15.6	54.57	88.40	61.27
✓		✓	✓	4.01	14.0	53.34	87.56	59.66
✓	✓	✓	✓	5.05	14.7	55.06	88.41	62.43

Table 4. Comparison with typical object detection methods.

Scheme	Backbone	Para	GFLOPs	AP				AP	${AP}_{50}$	${AP}_{75}$
Scheme	Backbone	Para	GFLOPs	Echinus	Scallop	Starfish	Holothurian	AP	${AP}_{50}$	${AP}_{75}$
Faster RCNN	VGG16	41.32	251.4	68.11	40.16	69.88	48.97	28.74	59.19	29.36
SSD	VGG16	26.44	32.5	44.39	34.66	49.81	30.17	22.06	39.09	24.81
RetinaNet	ResNet50	34.13	100.5	78.61	62.18	72.29	65.15	33.49	68.81	31.17
CenterNet	Hourglass104	11.76	15.6	83.51	72.48	77.86	61.62	38.93	73.87	36.58
FCOS	ResNet18	19.14	39.12	77.64	66.42	76.29	67.96	47.92	72.08	53.72
PAA	ResNet18	18.86	38.41	79.27	70.13	80.42	67.12	52.42	74.24	59.28
YOLOv3	Darknet53	61.53	193.8	79.88	64.27	80.96	60.05	30.38	69.17	24.88
YOLOv4	CSPDarknet53	52.59	119.8	84.97	71.94	82.03	73.66	33.31	76.84	31.36
YOLOv5-S	CSPDarknet53	7.34	16.6	84.96	73.14	83.91	74.69	44.08	76.71	42.05
YOLOv6-N	EfficientRep	4.32	11.1	87.71	74.55	84.64	71.98	45.01	77.87	43.96
YOLOv7-Tiny	E-ELAN	6.01	13.1	88.15	75.68	85.21	75.07	46.84	78.41	45.09
YOLOv8-N	CSPDarknet53	3.01	8.1	91.80	85.34	91.93	78.58	52.73	86.91	58.31
YOLOv9-T	CSPDarknet53	7.12	29.3	92.39	84.99	92.59	80.14	51.97	87.66	58.09
YOLOv10-S	CSPDarknet53	7.23	21.4	94.25	86.36	91.15	80.89	52.06	87.81	59.12
YOLOv11-S	CSPDarknet53	9.52	21.7	95.06	85.25	93.41	79.98	52.19	88.01	59.47
Ours	CSPDarknet53	5.05	14.7	93.38	88.78	93.52	78.92	55.06	88.41	62.43

To ensure fairness, each YOLO model variant used in the comparison (e.g., YOLOv5-S, YOLOv8-N) was selected to match EMR-YOLO in scale and complexity. All models were evaluated with a fixed input resolution of 640 × 640.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zou, D.; Zhao, S.; Zhou, J.; Liu, G.; Jiang, Z.; Xu, M.; Fu, X.; Liu, S. EMR-YOLO: A Multi-Scale Benthic Organism Detection Algorithm for Degraded Underwater Visual Features and Computationally Constrained Environments. J. Mar. Sci. Eng. 2025, 13, 1617. https://doi.org/10.3390/jmse13091617

AMA Style

Zou D, Zhao S, Zhou J, Liu G, Jiang Z, Xu M, Fu X, Liu S. EMR-YOLO: A Multi-Scale Benthic Organism Detection Algorithm for Degraded Underwater Visual Features and Computationally Constrained Environments. Journal of Marine Science and Engineering. 2025; 13(9):1617. https://doi.org/10.3390/jmse13091617

Chicago/Turabian Style

Zou, Dehua, Songhao Zhao, Jingchun Zhou, Guangqiang Liu, Zhiying Jiang, Minyi Xu, Xianping Fu, and Siyuan Liu. 2025. "EMR-YOLO: A Multi-Scale Benthic Organism Detection Algorithm for Degraded Underwater Visual Features and Computationally Constrained Environments" Journal of Marine Science and Engineering 13, no. 9: 1617. https://doi.org/10.3390/jmse13091617

APA Style

Zou, D., Zhao, S., Zhou, J., Liu, G., Jiang, Z., Xu, M., Fu, X., & Liu, S. (2025). EMR-YOLO: A Multi-Scale Benthic Organism Detection Algorithm for Degraded Underwater Visual Features and Computationally Constrained Environments. Journal of Marine Science and Engineering, 13(9), 1617. https://doi.org/10.3390/jmse13091617

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EMR-YOLO: A Multi-Scale Benthic Organism Detection Algorithm for Degraded Underwater Visual Features and Computationally Constrained Environments

Abstract

1. Introduction

2. Related Work

2.1. Learning with Multi-Scale and Morphological Diversity

2.2. Learning with Computational Constraints

2.3. Learning with Visual Feature Degradation

3. EMR-YOLO Scheme

3.1. Multi-Branch Fusion Downsampling

3.2. Region-Wise Two-Stage Routing Attention

3.3. Effcient Dynamic Sparse Head

4. Experimental Studies and Comparisons

4.1. Dataset and Configurations

4.2. Efficient Dynamic Sparse Head Performance

4.3. Attention Performance Comparisons

4.4. Ablation Study

4.5. Comparisons with Typical Object Detection Approaches

4.6. Visualization of Test Results

5. Discussion

5.1. MBFDown: Light Weighting and Efficient Feature Extraction

5.2. RTRA: Enhancing Underwater Background Adaptability

5.3. EDSHead: Improving Multi-Scale Target Perception Capability

6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI