1. Introduction
With the maturity of target detection technology on the ground, the development of underwater resource exploitation, marine ecological protection, and underwater robots has become a trend, and the vigorous growth of the marine economy is a major guarantee of national security [
1]. Target detection is crucial for the protection and exploitation of seabed organisms as well as resource systems, and underwater image processing has been widely used in a variety of underwater scenarios, playing an important role in exploring underwater scenarios. Nevertheless, dynamic underwater environments make damaged underwater areas more complex and unpredictable, and the development of underwater robots requires underwater target detection algorithms with superior computational and storage capacity, which is exacerbated by the limited algorithmic storage and computational capacity at this stage. Secondly, underwater environments suffer from several problems such as low light, dense distribution of organisms, low contrast, and small object size (see
Figure 1). Because of these challenges, underwater images suffer from blurred image boundaries and loss of texture information, making underwater target detection more difficult [
2].
In target detection techniques, there are two categories of algorithms: two-stage algorithms and single-stage algorithms. Two-stage algorithms first extract the candidate frame based on the image and then do a quadratic correction based on the candidate region to obtain the detection point result. Common two-stage target detection algorithms: R-CNN [
3], Fast R-CNN [
4], and Faster R-CNN [
5]. Single-stage target detection algorithms omit the extraction of candidate frames and directly classify and localize targets on the feature map. Typical single-stage target detection algorithms: SSD [
6] and the YOLO (You Only Look Once) [
7,
8] series. Compared to the two-stage target detection algorithms, the single-stage target detection algorithms have lower detection accuracy, but their detection speed is fast, with relatively low computational overhead and good real-time performance. The YOLO series of target detectors are widely used in academic and industrial scenarios due to their high performance and efficiency, and they are proven to be a classical approach for first-level detection networks in practical applications. However, despite the YOLO algorithm’s strong performance in real-time target detection, its effectiveness is confined to particular application scenarios, such as dynamic environments underwater.
Underwater target detection differs from the ground environment in that the underwater environment is more complex, and the difficulty of creating and labeling datasets increases by orders of magnitude. Underwater targets are generally small, making them hard to capture and recognize, which inevitably results in decreased effectiveness of certain target detection algorithms. In addition to accuracy, computation, and parameterization are two indispensable components for porting target detection algorithms to underwater mobile devices. The development of detection algorithms suitable for underwater scenarios should be a combination of detection accuracy and practical applications. Recent research has been conducted to develop models specialized for underwater detection. Liu et al. [
9] studied YWnet to resolve the problem related to small and blurred target sizes in underwater environments. Yeh et al. [
10] introduced a lightweight algorithm for detecting underwater targets, which integrates a color conversion technique with a neural network to address the issue of color absorption. Nonetheless, these techniques often depend significantly on the natural quality of underwater images, and excessive enhancement may result in the loss of important image details, which can adversely affect detection accuracy. Existing methods mostly ignore significant computational overheads when addressing underwater environmental challenges, which hinders their practical deployment.
To address the above problems, Xu et al. [
11] proposed the use of inverted residual blocks and group convolution [
12] to extract deep features, which enables the feature extraction network to give priority to channels that contain a lot of information and discard insignificant ones, which simplifies the differentiation between perceptual targets and background information, further improves the accuracy of the model, and significantly reduces the memory footprint required for inference. Although it greatly decreases the memory needed for inference and improves accuracy, it occupies more computational consumption and parameter storage, limiting the storage and computational requirements for underwater image processing. Tang [
13] explored the localization capabilities of small target segmentation and fuzzy target boundaries in medical image processing and found that the ability of the Global-to-Local Spatial Aggregation (GLSA) module to aggregate and represent both global and local spatial features provides significantly improved localization results for both large and small underwater targets, but its huge number of parameters potentially limits its use in resource-constrained scenarios. Chen [
14] developed a streamlined backbone network using the Deformable Convolutional Network (DCN) v3 [
15] to fuse feature maps from different scales and achieved a mAP@0.5 of 86.1% on the DUO dataset. Their research, however, added to the computational complexity and curtailed its application in scenarios with real-time and resource constraints. Detail-enhanced convolution (DEConv) [
16] promotes feature learning and can integrate prior information to supplement the original information and enhance representation capabilities, but it is often not advisable to significantly reduce model parameters and computation at the expense of accuracy.
In order to optimize the above problems, the paper presents a BGLE-YOLOv8 model that is lightweight, building upon structural improvements in YOLOv8. The main points of the work are as follows:
To achieve a lighter YOLOv8 backbone network, the parameter count and computational load are designed to be lower than the original 3 × 3 convolution kernel for EMC convolution. Inheriting the grouping idea of Group Convolution, multi-scale feature information is extracted more efficiently by the backbone network during the feature extraction process.
Introducing the BIG module. The BIG module can reduce the error information generated in the high-level features as the detection depth increases. In the neck network, its local spatial detail information and global spatial semantic information extracted from the backbone part are able to perform fast and efficient multi-scale feature fusion.
The LSH module is introduced into the architecture because it references the fact that shared convolution can drastically lower the parameter count and enables the use of a Scale layer to scale the features when dealing with the problem of inconsistent underwater target scales detected by each detection head. Detail-enhanced convolution and group normalization are also designed to improve its detail-capturing ability and minimize the accuracy loss while keeping the number of detection head parameters and computational effort smaller.
The remainder of the paper is arranged as listed below.
Section 2 presents the overall network structure as well as its individual components, fully exploring the role played by each group block in the overall network structure and the correlation between each module.
Section 3 provides the experimental results, which fully confirm the validity of the proposed model through ablation and comparison experiments.
Section 4 presents a discussion, summarizes the proposed model and its limitations, and suggests an outlook for the improvement of the subsequent benchmark model. Finally,
Section 5 provides the research conclusion.
4. Discussion
4.1. Findings
This paper proposes a novel underwater fish detection method called BGLE-YOLO. It retains redundant feature maps while strengthening the correlation between feature maps by introducing EMC. Incorporating the EMC module alone can greatly decrease the parameters and computations of BGLE-YOLO. The BIG module is designed within the feature fusion network to enhance the long-range spatial relationships between pixels and effectively extract local features of interest in small underwater fish images. GN (group normalization) and DEConv (detail enhancement convolution) in the LSH module are crucial for the training of underwater target detection tasks, the balance between accuracy and parameters, and the problem of visual quality degradation in capturing images in terms of contrast or color distortion.
Ablation experiments conducted systematically verify the effectiveness of each innovative module. According to the DUO dataset ablation experiment, after the introduction of the EMC module, the mAP@0.5:0.95 of the model reached 65.5%, and the number of parameters and computational complexity were reduced by 9% and 6%, respectively. After incorporating BIG, mAP@0.5:0.95 rose to 65.7%, computational complexity dropped by 0.3GFLOPs, and the number of parameters was reduced by 21%, indicating that BIG played a key role in capturing global and local spatial features and optimizing convolution kernel parameters. Finally, the group normalization (GN) and detail enhancement convolution (DEConv) submodules in the LSH module significantly improved the performance of BGLE-YOLO. Activating the RPC position embedding DEConv and combining it with GN achieved the best accuracy, with mAP@0.5 and mAP@0.5:0.95 reaching 84.2% and 65%, respectively, and maintained the lowest level of computational complexity and parameters, which were reduced by 47% and 23%, respectively. In the ablation experiment of the RUOD dataset, BGLE-YOLO also showed a strong lightweight advantage. After adding EMC, the accuracy of the model increased to 85.4%, and the parameter level and calculation amount were reduced by 10% and 6%, respectively. With the introduction of the BIG and LSH modules, the mAP@0.5 of the model reached 84.1, and the number of parameters and calculation amount were lowered by 47% and 23%, respectively. The experiments above fully confirmed the effectiveness of these modules in enhancing model performance and reducing parameters and calculations.
4.2. Limitations and Future Works
Despite significant performance improvements on the RUOD and DUO datasets, BGLE-YOLO still faces some challenges. First, the accuracy of the model is equal to or slightly lower than the baseline model on both the RUOD and DUO datasets and higher accuracy may be required in some practical application scenarios. In addition, the generalization ability and robustness of BGLE-YOLO need to be further verified, and the model should be made more applicable to other datasets. Finally, with the development of underwater vision technology, more complex underwater environments come one after another, such as turbidity, fog effects, and color deviation. The subsequent series of work should focus on improving accuracy as much as possible with lower model parameters and computational burden. In addition, sufficient experiments should be combined with embedded or edge devices to verify effectiveness in different environments.