DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer

Wang, Zhuowei; Ruan, Zhukang; Chen, Chong

doi:10.3390/jmse12060864

Open AccessArticle

DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer

by

Zhuowei Wang

,

Zhukang Ruan

and

Chong Chen

^*

School of Computer Science and Technology, Guangdong University of Technology, No.100, Outer Ring West Road, Guangzhou University Town, Xiaoguwei Street, Panyu District, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(6), 864; https://doi.org/10.3390/jmse12060864

Submission received: 19 April 2024 / Revised: 20 May 2024 / Accepted: 21 May 2024 / Published: 22 May 2024

(This article belongs to the Section Physical Oceanography)

Download

Browse Figures

Versions Notes

Abstract

:

Due to the complexity of underwater environments and the lack of training samples, the application of target detection algorithms to the underwater environment has yet to provide satisfactory results. It is crucial to design specialized underwater target recognition algorithms for different underwater tasks. In order to achieve this goal, we created a dataset of freshwater fish captured from multiple angles and lighting conditions, aiming to improve underwater target detection of freshwater fish in natural environments. We propose a method suitable for underwater target detection, called DyFish-DETR (Dynamic Fish Detection with Transformers). In DyFish-DETR, we propose a DyFishNet (Dynamic Fish Net) to better extract fish body texture features. A Slim Hybrid Encoder is designed to fuse fish body feature information. The results of ablation experiments show that DyFishNet can effectively improve the mean Average Precision (mAP) of model detection. The Slim Hybrid Encoder can effectively improve Frame Per Second (FPS). Both DyFishNet and the Slim Hybrid Encoder can reduce model parameters and Floating Point Operations (FLOPs). In our proposed freshwater fish dataset, DyFish-DETR achieved a mAP of 96.6%. The benchmarking experimental results show that the Average Precision (AP) and Average Recall (AR) of DyFish-DETR are higher than several state-of-the-art methods. Additionally, DyFish-DETR, respectively, achieved 99%, 98.8%, and 83.2% mAP in other underwater datasets.

Keywords:

underwater target detection; fish recognition; deep learning; convolutional neural network

1. Introduction

The application of computer vision technology for exploring the hitherto uncharted underwater realm constitutes one of the most dynamically advancing research frontiers [1,2]. Fish detection in underwater object detection tasks has been a focal point of attention [3]. Ensuring efficient and accurate fish detection is crucial for aquaculture. The task of fish detection faces challenges as different fish species exhibit high similarity in their features, and fish, being non-rigid objects, can twist their bodies while swimming, making detection difficult. Limited training samples make it challenging to develop an automated fish classification system that can operate on underwater images with various backgrounds and lighting conditions.

In recent years, researchers have mainly utilized deep learning-based methods for fish detection tasks [4]. Due to their ability to automatically extract features, Convolutional Neural Networks (CNNs) have the advantage of efficiently extracting features and performing multi-feature analysis [5]. Prasetyo et al. [6] proposed a new residual network Multi-Level Residual (MLR) fusion to incorporate high-level and low-level features of fish. Cai et al. [7] employed a method combining YOLOv3 and MobileNetv1 for fish detection. Raza et al. [8] improved YOLOv3 and utilized k-means clustering to enhance anchor boxes and improve the accuracy of fish detection. Wang et al. [9] introduced a deep learning method that combines SPP (Spatial Pyramid Pooling) and DenseNet (Dense neural network) for identifying species of freshwater fish images. To train an accurate fish recognition model from underwater images, Zhang et al. [10] presented a method called AdvFish. Differences in color and fish scales form distinct fish body texture features. Fish are non-rigid objects, and the twisting of the body when swimming results in various forms. Previous studies mainly focused on using advanced deep learning methods to fuse features of fish to enhance detection accuracy. However, prior research did not address the challenges in extracting texture features from fish bodies and the challenges posed by the twisting of the body during fish movement.

Underwater fish detection faces the challenge of limited labeled data [11,12]. Allken et al. [13] proposed a method based on real image simulation in deep visual images to expand training samples, achieving a 94% classification accuracy for cod, Atlantic herring, and Atlantic mackerel. Banan et al. [14] applied a pre-trained VGG16 deep learning model on the ImageNet dataset for fish species recognition, achieving an average classification accuracy of 100% for four species of Asian carps. The above studies mainly use deep learning techniques to expand training samples or address data imbalance issues. Previous research has mainly focused on fish detection in marine areas. Fish detection in natural freshwater environments faces even greater challenges. Very few researchers focus on the detection of freshwater fish, and training samples for freshwater fish are extremely scarce.

There are two main frameworks for deep learning-based object detection: two-stage and one-stage. In the two-stage framework, a region proposal stage is performed first, which involves identifying regions with a high probability of containing objects. These proposed regions are then input to a convolutional neural network (CNN) to detect objects within the regions. Examples of two-stage algorithms include R-CNN, R-FCN, Fast RCNN, Faster RCNN, and Mask RCNN. On the other hand, one-stage algorithms such as the YOLO series [15,16,17,18,19,20,21,22], SSD, and RetinaNet combine the region proposal process with object detection. These methods are known for their accuracy and speed.

Recently, models based on Transformers, such as ViT (Vision Transformer) [23], Swin Transformer [24], DETR (End-to-End Object Detection with Transformers) [25] and DINO (DETR with Improved deNoising anchOr boxes) [26], have shown significant advantages across various visual task domains. These models can capture long-range dependencies between objects, enabling Transformer-based detectors to achieve performance comparable to or better than the most fine-grained classical detectors. In the field of object detection, researchers have utilized encoder–decoder structures with Transformers to construct a series of Transformer-based object detection models.

In this paper, we focus on the task of detecting fish in freshwater environments, considering the texture features of fish bodies as well as challenges such as the distortion of the bodies when fish swim. We proposed an underwater target detection method that can be utilized for freshwater fish detection as well as various other underwater detection tasks. Our main contributions can be summarized as follows:

We construct a freshwater fish dataset. We use underwater cameras in natural underwater environments to shoot and make an underwater fish dataset containing both tilapia and koi freshwater fish. The produced dataset records the activity status of fish in underwater environments.
We design a feature extraction network called DyFishNet (Dynamic Fish Net) to better extract fish body texture features and solve the problem of body distortion when fish are swimming. We design a Slim Hybrid Encoder to merge fish features and reduce computation.
We propose a method for underwater object detection, called DyFish-DETR (Dynamic Fish Detection with Transformers). DyFish-DETR consists of DyFishNet as the backbone and Slim Hybrid Encoder as the Neck. In the fish detection task, the experimental results show that DyFish-DETR has better accuracy compared with the baseline method while reducing the number of parameters.

2. Related Work

2.1. End-to-End Object Detection with Transformers

The DETR (Detection Transformer) model represents an innovative end-to-end approach to object detection, departing from conventional methodologies by eschewing the reliance on anchor boxes and the Non-Maximum Suppression (NMS) step. This paradigm shift stems from recasting object detection as a set prediction problem. By leveraging Transformer architectures, DETR integrates image features and class semantics via its encoder, subsequently employing a distinctive decoder to directly infer a set of bounding box coordinates alongside their associated class labels. This transformation not only attains remarkable performance in object detection tasks but also streamlines traditionally intricate procedures, thereby facilitating model training and enhancing interpretability. Consequently, DETR stands as a pivotal advancement that harmonizes efficiency and effectiveness in contemporary object detection frameworks.

First, DETR uses an encoder to encode the input image and target class information into a series of feature vectors. Then, a decoder is used to generate predictions for the target boxes, including their positions and classes. During the training phase, the model learns the interactions between different elements with the help of multi-head self-attention mechanisms and positional encoding. In the inference phase, the final object detection results are obtained by optimizing a Hungarian matching loss while generating target boxes.

The DETR model has been applied to many downstream tasks. The combination of attention mechanisms with Transformers has been applied in video object detection tasks and has achieved good results [27]. Ickler et al. [28] discussed the feasibility of using the DETR model for volumetric medical object detection. The high computational cost of DETR limits its practical application. To address the high computational cost of DETR, the authors of [29] designed the RT-DETR (Real-Time DEtection TRansformer) as an efficient hybrid encoder that efficiently processes multi-scale features by decoupling intra-scale interactions and cross-scale fusion. They also proposed IoU-aware query selection to further improve performance by providing higher-quality initial object queries to the decoder.

Previous research [4,6,7,8,9] in underwater target detection has focused on employing one-stage detection algorithms, which have proven effective in achieving substantial detection accuracies. The exploration of DETR’s potential in this domain remains limited. Despite this, DETR has demonstrated superior detection performance compared to both conventional one-stage and two-stage approaches in selected applications, highlighting its prowess. Given this untapped potential, our work aims to stimulate inquiry into the application of DETR models within the realm of underwater target detection. Consequently, the DETR framework has been elected for the specific application of underwater fish detection, thereby contributing to the expansion of knowledge in this emerging frontier.

2.2. Convolution Module

Tubular structures are a crucial type of structure in various fields like clinical settings and nature, and their precise segmentation ensures the accuracy and efficiency of downstream tasks. Tubular structures exhibit delicate and slender local structural features along with complex and diverse global morphological features. Qi et al. [30] noted the slender and continuous characteristics of tubular structures and used this information to enhance perception simultaneously in three stages of neural networks: feature extraction, feature fusion, and loss constraints. They designed Dynamic Snake Convolution, a multi-perspective feature fusion strategy with continuous topological constraint loss. Experimental results have shown that the proposed Dynamic Snake Convolution provides better accuracy and continuity in tubular structure segmentation tasks.

The trunk, fins, and tail of a fish are rich in fish scales and fish bones. The colors of fish scales, fish bones, and fish body result in a surface texture of the fish formed by complex tubular structures, as shown in Figure 1. Additionally, fish are non-rigid objects, and the body distortions during swimming lead to richer variations in the body texture. Therefore, Dynamic Snake Convolution focuses on the slender and continuous characteristics of tubular structures, making it suitable for better feature extraction of fish.

The Transformer architecture in DETR requires a large amount of computational resources for training and inference, leading to high computational costs and making it unsuitable for application in resource-constrained environments. Object detection is an important downstream task in computer vision. Real-time detection requires large models; however, lightweight models constructed with a large number of depthwise separable convolutional layers cannot achieve sufficient accuracy. GSConv (Guided Spatially-Sparse Convolution) [31] combines DWConv (Depthwise Separable Convolution) [32] and Conv to reduce the weight of the model while maintaining accuracy. GSConv strikes a good balance between model accuracy and speed. The single aggregation module VoV-GSCSP (Visual-Object Validity-guided Cross Stage Partial Network) [31] is based on GSConv and regular bottleneck modules. The VoV-GSCSP module reduces the complexity of computation and network structure while retaining sufficient accuracy.

SSFF (Scale Sequence Feature Fusion module) [33] is a novel scale sequence feature fusion method that combines the high-dimensional information of deep feature maps with the detailed information of shallow feature maps more effectively. In this process, the image size changes during downsampling, but features that are invariant to scale remain unchanged. Scale space is built along the scale axis of the image, representing not just one scale but various scale ranges that the target can have. Scale denotes the details of an image. A blurry image may lose details, but the structural features of the image can be preserved. The tubular texture on the surface of fish is usually fine and complex, influenced by factors such as the distance of the capture and the clarity of the water. SSFF can enhance the network’s ability to extract multi-scale information. This advantage of SSFF is particularly useful for detecting fine tubular textures on the body surface of fish in underwater fish detection tasks.

2.3. Attention Mechanism

Many researchers choose to combine attention mechanisms with neural networks to improve the detection accuracy of underwater targets [34,35]. Sun et al. [36] attempted to use the Swin Transformer to design a new network as the backbone for underwater target detection, achieving performance similar to the cascade R-CNN with the ResNeXt101 backbone.

The attention mechanisms commonly used in computer vision are traditionally categorized into three types: spatial attention, channel attention, and hybrid attention. In addition to these common attention mechanisms in computer vision, Transformers utilize self-attention mechanisms that can effectively capture global information. The multi-head structure used in Transformers enables better fusion and expressive capabilities, allowing feature maps to be integrated across multiple spatial scales. Transformer mechanisms have been applied in computer vision, such as the Vision Transformer and Swin Transformer. However, Transformer modules require substantial computation, making them primarily suitable for large networks and challenging to use in mobile networks.

The general attention mechanism has high computational costs and a large demand for memory. Bi-Level Routing Attention [37] is a novel dynamic sparse attention mechanism. It efficiently allocates computation in a dynamic and query-aware manner. The core idea of Bi-Level Routing Attention is to filter out the most irrelevant key-value pairs at a coarse region level. This is achieved by first constructing and pruning a region-level directed graph, and then applying fine-grained token-to-token attention in the routing region’s join.

Adding Bi-Level Routing Attention to the DETR model can make the model capture global and local feature information, improve the performance of the model, and dynamically calculate, allocate, and perceive content.

In summary, there is a dearth of research exploring the use of DETR for underwater fish detection. The application of the DETR model to this domain faces challenges like significant computational demands and a cumbersome number of model parameters. Additionally, previous fish detection studies overlooked the complexities stemming from fish texture distortion. To address these limitations, DyFish-DETR is introduced as a potential solution.

3. Methodology

3.1. Freshwater Fish Dataset

We focus on underwater target detection of freshwater fish. In order to deal with the scarcity of fish data sets and make data sets with clear fish texture, we captured fish images using RGB cameras submerged in water at Zhujiang Park in Guangzhou, Guangdong Province, China. A dataset of freshwater fish in a real natural water environment was created. The fish in the images belong to two species, koi and tilapia.

Tilapia and koi are both freshwater fish. Tilapia is a common edible fish species. In the fisheries industry, tilapia is a commonly raised fish, with many farms breeding tilapia found in natural rivers and ponds. Koi, on the other hand, are typically kept for ornamental purposes and are often found in park ponds or lakes. There are some distinct differences in the external features of tilapia and koi. Tilapia have smooth, hard scales covering their bodies, with small scales that are closely packed and do not reflect light like those of koi. Tilapia typically have an oval or slightly flattened body shape. They are relatively small compared to adult koi. Tilapia have various fins, including dorsal fins, anal fins, tail fins, and pectoral fins. Their fins are relatively short and sturdy, usually lacking the broad and unique shapes seen in koi. Koi fish are known for their colorful hues and distinctive patterns, while tilapia tend to have simpler colors, often appearing in shades of gray, silver, or brown. Koi are larger and more rounded compared to tilapia.

The created freshwater fish dataset contains multiple angles and parts of fish photographed under different brightness levels, as shown in Figure 2. We can clearly find that due to the color of the fish body and fish scales, there are apparent tubular textures on the surface of the fish. The fish fins and tail also exhibit distinct striped tubular textures. When fish swim, they need to move their tail and fins to control the direction. From this, we can observe that besides the distortion of the fish body, the shape and texture of the fish fins and tail also undergo significant changes. To address the challenges mentioned above, we propose DyFishNet for better extraction of tubular texture features of fish and introduce DyFish-DETR to enhance the accuracy of underwater fish detection tasks.

3.2. Dynamic Fish Detection with Transformers (DyFish-DETR)

In underwater fish target detection tasks, challenges arise from the texture characteristics of fish bodies and the distortion of the body when fish swim. At the same time, the target detection algorithm DETR faces the challenge of high computational cost, making it unsuitable for use in resource-limited environments. Drawing on the RT-DETR model, we designed DyFish-DETR for underwater target detection tasks. DyFish-DETR mainly consists of two structures: a feature extraction network DyFishNet dedicated to extracting fish features and a Slim Hybrid Encoder used to fuse fish feature information and reduce computational costs. The proposed DyFish-DETR network architecture is shown in Figure 3.

3.2.1. Dynamic Fish Net (DyFishNet)

To address the challenges of extracting the tubular texture features of fish bodies and the torsion of the trunk during fish movement, we designed a feature extraction network called DyFishNet (Dynamic Fish Net). The structure of DyFishNet is shown in Figure 4. The CBR module refers to sequentially processing the feature maps using a convolutional layer, batch normalization layer, and ReLU activation function. The CB module includes only a convolutional layer and a batch normalization layer. The feature extraction process of DyFishNet is divided into 5 stages.

In Stage 1, an input image of size

c \times h \times w

undergoes a CBR layer with a

3 \times 3

convolutional kernel to obtain a feature map of size

c_{1} \times \frac{h}{2} \times \frac{w}{2}

. The obtained feature map is then processed through two CBR layers and a MaxPool layer to obtain a feature map of size

c_{2} \times \frac{h}{4} \times \frac{w}{4}

.

In Stage 2, a BasicBlock with Avg = False is used. The BasicBlock typically consists of a series of convolutional layers, batch normalization layers, and activation functions. It can increase the depth of the network, significantly reduce the number of parameters in deep learning models without sacrificing too much performance, and better capture the complex features of fish bodies in the input image.

In Stage 3, a BasicBlock with Avg = True is used. An AvgPool layer is introduced in the BasicBlock to downsample the input feature map, helping the model capture global information in the input data and retain the most important features. A feature map of size

c_{3} \times \frac{h}{8} \times \frac{w}{8}

is obtained.

In Stage 4, we designed the DSCBlock specifically to address the challenges posed by the tubular texture on fish and the body torsion. We incorporated DySnakeConv (Dynamic Snake Convolution) into the BasicBlock. The feature map processed by DySnakeConv is stacked with the one processed by the AvgPool layer to obtain a more refined feature map of size

c_{4} \times \frac{h}{16} \times \frac{w}{16}

.

Finally, in Stage 5, we designed an attention module to handle the feature map after DSCBlock processing. We included a BiLevelRoutingAttention in the BasicBlock. By leveraging attention mechanisms, the network can focus more on the tubular texture features and torsional changes in fish bodies in the feature map. The Bi-Level Routing Attention dynamically and query-awarely allocates computations effectively, allowing the network to obtain fine-grained feature maps at a lower cost.

3.2.2. Slim Hybrid Encoder

Although the introduction of the RT-DETR model has better met the real-time requirements of object detection, due to the use of the Transformer structure in RT-DETR, it requires a large number of computational resources for training and inference, making it unsuitable for applications in resource-constrained underwater environments. In order to further meet the real-time requirements of underwater detection, and to extract and fuse multi-scale features of fish, we propose the Slim Hybrid Encoder, as shown in Figure 5. When the feature map reaches the Neck, the channel dimension of the feature map reaches its maximum, while the width and height dimensions reach their minimum, eliminating the need for transformation. In the Slim Hybrid Encoder, we use GSConv to effectively handle concatenated feature maps from different levels, reduce redundancy, and maintain effective information transmission. This approach can reduce unnecessary computation and storage requirements, thereby achieving a lightweight model. Using VoV-GSCSP enables effective information interaction and aggregation between feature maps at different levels, achieving effective detection of multi-scale fish features. Additionally, the use of the VoV-GSCSP module reduces the complexity of computation and network structure while maintaining sufficient accuracy.

Due to the varying sizes of textures among different fish species, as well as differences in water clarity and distance, the textures and shapes of the same species of fish can also vary. In order to effectively capture features at different spatial scales, we introduced the SSFF module, which fuses feature maps from different levels or after being processed at different downsampling rates, enhancing the model’s ability to recognize objects of various scales. Additionally, in order for the Neck to effectively combine low-level, mid-level, and high-level features and improve the model’s understanding and analysis capabilities of complex scenes, we used the fusion module, the structure of which is shown in Figure 6.

4. Experiment and Results

4.1. Dataset Details

In the freshwater fish data set, more than one fish is captured in each photo, and these fish are captured at various angles and brightness. The purpose is to ensure that the data set is closest to the life state of fish under natural conditions. We used the open-source image annotation tool ‘Labelimg’ to create ground truth, as shown in Figure 7, selecting 4691 images for our dataset. Following a ratio of 7:1:2, we divided the images into training sets, validation sets, and testing sets.

To verify the performance of DyFish-DETR, we applied DyFish-DETR to three public underwater datasets separately. The Fish4Knowledge23 [38] dataset consists solely of underwater fish photos. This dataset was collected by the Taiwan Power Company, the Taiwan Ocean Research Institute, and Kenting National Park from 1 October 2010 to 30 September 2013, at underwater observatories in the Nanwan Strait, Orchid Island, and Houbihu Lake in Taiwan. The dataset includes images of 23 types of fish, totaling 27,370 fish images.

The Brackish [39] dataset was captured in the straits in northern Denmark and includes fish, crabs, and other marine creatures. The positions of the targets are annotated with bounding boxes. This dataset contains 14,518 images with annotations for 28,518 instances belonging to six categories. The brackish dataset mainly covers dim and fuzzy underwater scenes.

The RUOD [40] dataset covers general underwater scenes and encompasses a variety of underwater detection challenges. The target categories in this dataset include fish, divers, starfish, corals, sea turtles, sea urchins, sea cucumbers, scallops, squids, and jellyfish, making a total of 10 classes. In addition to regular training and test sets, the RUOD dataset also includes three environmental challenge test sets: fog, color deviation, and light interference. This allows for a comprehensive evaluation of detector performance.

4.2. Implementation Details

For model training and inference, we utilized Ubuntu 20.04.6 LTS, an AMD EPYC 7543P 32-Core CPU processor (Canonical, London, UK), and CUDA 12.0 (Nvidia, Santa Clara, CA, USA). The graphics processing unit (GPU) used was NVIDIA RTX A5000 with 24 GB of memory (Nvidia, Santa Clara, CA, USA). The network development framework employed was torch-2.0.1+cu117. The integrated development environment (IDE) used was PyCharm. We set the epoch to 200, batch size to 8, and image size to

640 \times 640

. The optimizer used was Adam with Weight Decay Correction (AdamW) with an initial learning rate of 0.0001 and weight decay of 0.0001.

4.3. Evaluation Metrics

We have chosen Precision, Recall, mean Average Precision (mAP), Parameters, Floating Point Operations (FLOPs) as the comparative metrics to evaluate the detection performance and determine the strengths and weaknesses of the model. Using IoU = 0.5 as the standard, precision and recall can be calculated using the following formulas Equations (1) and (2).

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

R e c a l l = \frac{T P}{T P + P N}

(2)

TP represents the number of true positive samples correctly identified as positive, FP represents the number of false positive samples incorrectly classified as positive, and FN represents the number of false negative samples incorrectly classified as negative.

{m A P}_{50}

is the area under the precision-recall (PR) curve formed by precision and recall. For

{m A P}_{50 : 95}

, the area under the PR curve is calculated by dividing it into 10 IoU thresholds ranging from 0.5 to 0.05 to 0.95 and then taking the average of the results. Frame per second (FPS) represents the number of images detected by the model per second.

4.4. Benchmarking Experiment

4.4.1. Comparison of Different Models in the Freshwater Fish Dataset

To demonstrate the detection performance of the DyFish-DETR model, this paper also conducted comparative experimental studies on other different object detection algorithms. In particular, experimental results on the freshwater fish dataset were compared with Sparse R-CNN, DINO, RT-DETR, and YOLOv8. Sparse R-CNN is a two-stage algorithm, YOLOv8 is a one-stage algorithm, while DINO and RT-DETR are both Transformer-based algorithms.

In order to comprehensively and rigorously compare the performance of different models in real scenarios, we used COCO evaluation metrics to compare the strengths and weaknesses among different models. Average Precision (AP) is calculated based on the Precision-Recall curve, incorporating both precision and recall. In COCO, mean Average Precision (mAP) is used to measure the overall model performance by averaging the AP values of all categories. Additionally, COCO introduces the concept of Intersection over Union (IoU) threshold, calculating AP for different IoU thresholds and reporting AP@[0.5:0.05:0.95], where AP values are computed for IoU from 0.5 to 0.95 in steps of 0.05 and averaged, reflecting the model performance under varying localization difficulties. Objects in COCO datasets vary in size, hence the evaluation metrics differentiate targets into small, medium, and large sizes, calculating corresponding AP values to ensure balanced performance of the model across different target sizes. COCO also provides Average Recall (AR) with a fixed number of detections to measure how well the model recalls correct targets when the number of output detection boxes is limited.

The experimental results are shown in Table 1. Since there are no small and medium-sized targets in the freshwater fish dataset, the values of these two metrics are both −1.000. It can be clearly observed that DyFish-DETR outperforms other object detection models in all evaluation metrics. In the IoU range from 0.50 to 0.95, DyFish-DETR has an AP of 76.6%, which is 2% higher than the second-best YOLOv8 in the experimental results. With a maximum detection count (maxDets) of 100, DyFish-DETR has an AR of 84.7%, which is 1.5% higher than the second-best DINO in the experimental results.

4.4.2. Comparison of Experimental Results in Different Underwater Datasets

Table 2 shows the experimental results of DyFish-DETR on the proposed freshwater fish dataset. The precision-recall curve is illustrated in Figure 8. DyFish-DETR has achieved excellent performance on various metrics such as P, R, and mAP. DyFish-DETR has reached an

{m A P}_{50}

of 96.6% in the freshwater fish dataset. In the detection of tilapia, DyFish-DETR’s evaluation metrics are higher than those of koi. Specifically, DyFish-DETR has a 3.8% higher P in detecting tilapia compared to koi. Figure 9 shows the detection results of DyFish-DETR in the freshwater fish dataset.

To further evaluate the proposed DyFish-DETR model’s performance in underwater fish detection tasks, we conducted experiments on the Fish4Knowledge23 dataset. The experimental results are shown in Table 3. The precision-recall curve is illustrated in Figure 10. DyFish-DETR achieved P of 98.3%, R of 98.4%,

{m A P}_{50}

of 99%, and

{m A P}_{50 : 95}

of 83.7% in the Fish4Knowledge23 dataset. The highest

{m A P}_{50}

value of 99.5% was achieved in the detection of various fish species such as Myripristis kuntee, Amphiprion clarkia, and Plectroglyphidodon dickii. In the detection of Zebrasoma scopas, the

{m A P}_{50}

value reached the lowest at 95.7%. Figure 11 shows the detection results of DyFish-DETR on Fish4Knowledge23. DyFish-DETR has demonstrated excellent performance in underwater fish species detection tasks.

We applied the DyFish-DETR model on the Brackish dataset to validate its performance in dim and blurry underwater scenes. The experimental results are presented in Table 4. The precision-recall curve is illustrated in Figure 12. In the Brackish dataset, DyFish-DETR achieved a P of 97.9%, R of 97.9%, and

{m A P}_{50}

of 98.8%, as well as

{m A P}_{50 : 95}

of 81.7%. Among the detection of Fish, Crab, and Starfish categories, DyFish-DETR reached its highest

{m A P}_{50}

value at 99.5%. Not only did DyFish-DETR demonstrate exceptional performance in fish detection, but it also proved capable of excellent detection for marine organisms such as Crabs and Starfish. Although achieving the lowest

{m A P}_{50}

in detecting Small fish compared to other species, the model still registered an impressive 97.5%

{m A P}_{50}

. Figure 13 shows the detection outcomes of DyFish-DETR on the Brackish dataset. DyFish-DETR excels in detecting objects in dark and blurry underwater environments and demonstrates equally commendable performance in the detection of other marine life forms.

To evaluate the performance of the DyFish-DETR model across various underwater scenarios, experiments were conducted on the RUOD dataset. The precision-recall curve is illustrated in Figure 14. The experimental results are summarized in Table 5, where DyFish-DETR achieved a P of 85.1%, R of 77.2%, and

{m A P}_{50}

of 83.2%, along with

{m A P}_{50 : 95}

of 57.9% on the RUOD dataset. Notably, DyFish-DETR attained its highest

{m A P}_{50}

score of 95.4% in the detection of cuttlefish. Due to the fact that the RUOD dataset does not provide a fine-grained categorization for fish species, the worst-case

{m A P}_{50}

for DyFish-DETR in the ’fish’ category was 68.6%. Upon overall analysis, DyFish-DETR demonstrates commendable performance in underwater object detection tasks under diverse underwater conditions. Figure 15 shows the detection results of DyFish-DETR on the RUOD dataset.

4.5. Ablation Experiment

In this section, we progressively substantiate the effectiveness of DyFishNet and the Slim Hybrid Encoder through ablation studies conducted on the freshwater fish dataset. Our proposed DyFish-DETR is an improvement based on the RT-DETR model. This section validates the rationality of our approach through ablation experiments to ascertain whether the added modules indeed contribute positively. RT-DETR employs either ResNet (Residual Network) or HGNet (Hierarchical Graph Network) as its backbone network, with an Efficient Hybrid Encoder serving as its Neck. The experimental results are presented in Table 6.

The first four rows of Table 6 display the initial experimental outcomes of the base RT-DETR. Among these, the best results were obtained when Res2Net50 was used as the backbone network. Hence, we utilized Res2Net50 as the backbone and the Slim Hybrid Encoder as the Neck to verify the performance of the Slim Hybrid Encoder. It is evident that while several evaluation metrics decreased compared to the original RT-DETR (Res2Net50 + Efficient Hybrid Encoder), the parameters were reduced by 9.2M, and the computational load decreased to 100 GFLOPs. In addition, the FPS of RT-DETR (Res2Net50 + Slim Hybrid Encoder) reached the highest value of 120. When DyFishNet was adopted as the backbone for RT-DETR, the improved model had a parameter count of 24.0M and a computation requirement of merely 62 GFLOPs. Finally, when DyFishNet and the Slim Hybrid Encoder were combined, they formed our proposed DyFish-DETR model. DyFish-DETR delivered the most outstanding experimental results, achieving a peak

{m A P}_{50}

of 96.6% and an

{m A P}_{50 : 95}

of 77.2%. Although the FPS of DyFish-DETR is only 58, it is only slightly lower than that of RT-DETR (Res2Net101 + Efficient Hybrid Encoder). In addition, the model parameter of DyFish-DETR is only 23.7M, and the GFLOPs is only 61. This conclusively demonstrates the efficiency and effectiveness of the DyFish-DETR model.

5. Conclusions

This study focuses on improving the reliability and effectiveness of identifying underwater objects, specifically freshwater fish, in their natural habitats. We have created an extensive image dataset of freshwater fish that meticulously documents the diverse textural characteristics of fish under different lighting situations and viewpoints. Introducing DyFish-DETR, our innovative detection method tailored for aquatic settings. It addresses the challenge posed by the changing tubular textures and shapes of swimming fish through the use of DyFishNet, a specially crafted network for extracting fish features. To optimize efficiency, the Slim Hybrid Encoder is incorporated in the intermediate stage to combine information from multiple scales while minimizing computational demands.

Experimental results show that DyFish-DETR is suitable for underwater fish detection tasks. In our proposed freshwater fish dataset, DyFish-DETR achieves a mean average precision of 96.6%. When applied to three public underwater datasets, the mean average precision of DyFish-DETR is 99%, 98.8%, and 83.2% respectively. Furthermore, in comparison to several benchmark algorithms, DyFish-DETR exhibits advantages, with the empirical results indicating higher AP and AR metrics.

In the future, we will continue to expand the freshwater fish dataset, planning to include underwater images of freshwater fish from more species and ecological environments to significantly enhance the diversity and comprehensiveness of the dataset. In upcoming research, we will not only focus on optimizing fish detection models but also work on developing a series of advanced technical models, including but not limited to high-precision real-time fish tracking systems, accurate and efficient fish quantity estimation methods, non-destructive fish weight estimation technology based on visual features, and deep learning for early automatic identification and diagnosis of fish diseases, all of which will provide strong data and technical support for fisheries resource management, ecological conservation, and related scientific research areas.

Author Contributions

Conceptualization, Z.W. and Z.R.; methodology, Z.W. and Z.R.; validation, Z.R.; writing—original draft preparation, Z.R.; writing—review and editing, Z.W. and C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangzhou Science and Technology Plan Project 202201011835.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data and results supporting the findings of this study can be obtained from the corresponding author upon reasonable request.

Acknowledgments

The authors greatly appreciate the constructive comments of the reviewers and editor.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, Y.; Zheng, W.; Du, X.; Yan, Z. Underwater Small Target Detection Based on YOLOX Combined with MobileViT and Double Coordinate Attention. J. Mar. Sci. Eng. 2023, 11, 1178. [Google Scholar] [CrossRef]
Yan, J.; Zhou, Z.; Zhou, D.; Su, B.; Xuanyuan, Z.; Tang, J.; Lai, Y.; Chen, J.; Liang, W. Underwater Object Detection Algorithm Based on Attention Mechanism and Cross-Stage Partial Fast Spatial Pyramidal Pooling. Front. Mar. Sci. 2022, 9, 1056300. [Google Scholar] [CrossRef]
Ahmed, S.; Aurpa, T.; Azad, M.A. Fish Disease Detection Using Image Based Machine Learning Technique in Aquaculture. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 5170–5182. [Google Scholar] [CrossRef]
Badawi, U.A. Fish Classification Using Extraction of Appropriate Feature Set. Int. J. Electr. Comput. Eng. 2022, 12, 2488. [Google Scholar] [CrossRef]
Shahi, T.B.; Xu, C.-Y.; Neupane, A.; Guo, W. Recent Advances in Crop Disease Detection Using UAV and Deep Learning Techniques. Remote Sens. 2023, 15, 2450. [Google Scholar] [CrossRef]
Prasetyo, E.; Suciati, N.; Fatichah, C. Multi-Level Residual Network VGGNet for Fish Species Classification. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 5286–5295. [Google Scholar] [CrossRef]
Cai, K.; Miao, X.; Wang, W.; Pang, H.; Liu, Y.; Song, J. A Modified YOLOv3 Model for Fish Detection Based on MobileNetv1 as Backbone. Aquac. Eng. 2020, 91, 102117. [Google Scholar] [CrossRef]
Raza, K.; Hong, S. Fast and Accurate Fish Detection Design with Improved YOLO-v3 Model and Transfer Learning. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 7–16. [Google Scholar] [CrossRef]
Wang, H.; Shi, Y.; Yue, Y.; Zhao, H. Study on Freshwater Fish Image Recognition Integrating SPP and DenseNet Network. In Proceedings of the 2020 IEEE International Conference on Mechatronics and Automation (ICMA), Beijing, China, 13–16 October 2020. [Google Scholar]
Zhang, Z.; Du, X.; Jin, L.; Wang, S.; Wang, L.; Liu, X. Large-Scale Underwater Fish Recognition via Deep Adversarial Learning. Knowl. Inf. Syst. 2022, 64, 353–379. [Google Scholar] [CrossRef]
Xu, X.; Li, W.; Duan, Q. Transfer Learning and SE-ResNet152 Networks-Based for Small-Scale Unbalanced Fish Species Identification. Comput. Electron. Agric. 2021, 180, 105878. [Google Scholar] [CrossRef]
Alaba, S.Y.; Nabi, M.M.; Shah, C.; Prior, J.; Campbell, M.D.; Wallace, F.; Ball, J.E.; Moorhead, R. Class-Aware Fish Species Recognition Using Deep Learning for an Imbalanced Dataset. Sensors 2022, 22, 8268. [Google Scholar] [CrossRef] [PubMed]
Allken, V.; Handegard, N.O.; Rosen, S.; Schreyeck, T.; Mahiout, T.; Malde, K. Fish Species Identification Using a Convolutional Neural Network Trained on Synthetic Data. ICES J. Mar. Sci. 2019, 76, 342–349. [Google Scholar] [CrossRef]
Banan, A.; Nasiri, A.; Taheri-Garavand, A. Deep Learning-Based Appearance Features Extraction for Automated Carp Species Identification. Aquac. Eng. 2020, 89, 102053. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv: Computer Vision and Pattern Recognition. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Mao, Q.-C.; Sun, H.-M.; Liu, Y.-B.; Jia, R.-S. Mini-YOLOv3: Real-Time Object Detector for Embedded Applications. IEEE Access 2019, 7, 133529–133538. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020, Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Fujitake, M. Video Sparse Transformer With Attention-Guided Memory for Video Object Detection. IEEE Access 2022, 10, 65886–65900. [Google Scholar] [CrossRef]
Ickler, M.K.; Baumgartner, M.; Roy, S.; Wald, T.; Maier-Hein, K.H. Taming Detection Transformers for Medical Object Detection. In BVM Workshop; Springer Fachmedien Wiesbaden: Wiesbaden, Germany, 2023; pp. 183–188. [Google Scholar]
Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat YOLOs on Real-Time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution Based on Topological Geometric Constraints for Tubular Structure Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Better Design Paradigm of Detector Architectures for Autonomous Vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Kang, M.; Ting, C.-M.; Ting, F.; Phan, R.-W. ASF-YOLO: A Novel YOLO Model with Attentional Scale Sequence Fusion for Cell Instance Segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Li, J.; Zhu, Y.; Chen, M.; Wang, Y.; Zhou, Z. Research on Underwater Small Target Detection Algorithm Based on Improved YOLOv3. In Proceedings of the 2022 16th IEEE International Conference on Signal Processing (ICSP), Beijing, China, 21–24 October 2022. [Google Scholar]
Zhai, X.; Wei, H.; He, Y.; Shang, Y.; Liu, C. Underwater Sea Cucumber Identification Based on Improved YOLOv5. Appl. Sci. 2022, 12, 9105. [Google Scholar] [CrossRef]
Sun, Y.; Wang, X.; Yao, L.; Qi, S.; Yi, H. Underwater Object Detection with Swin Transformer. In Proceedings of the 2022 4th International Conference on Data Intelligence and Security (ICDIS), Shenzhen, China, 24–26 August 2022. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Fisher, R.B.; Chen-Burger, Y.H.; Giordano, D.; Hardman, L.; Lin, F.P. (Eds.) Fish4Knowledge: Collecting and Analyzing Massive Coral Reef Fish Video Data; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Pedersen, M.; Lehotský, D.; Nikolov, I.; Moeslund, T.B. BrackishMOT: The Brackish Multi-Object Tracking Dataset. In Scandinavian Conference on Image Analysis; Springer Nature: Cham, Switzerland, 2023. [Google Scholar]
Fu, C.; Liu, R.; Fan, X.; Chen, P.; Fu, H.; Yuan, W.; Zhu, M.; Luo, Z. Rethinking General Underwater Object Detection: Datasets, Challenges, and Solutions. Neurocomputing 2023, 517, 243–256. [Google Scholar] [CrossRef]

Figure 1. Whole and partial pictures of fish in air and underwater scenes. Column (a) is an overall picture of the fish, showing the overall shape and body surface texture of the fish. Column (b) is a local detail map of the fish body, which has rich texture characteristics of the tube state. Column (c) is a partial detail of the fin with a strip-shaped tubular texture. Column (d) is a local detail map of a fish tail, which has rich texture and shape characteristics.

Figure 2. Overview of the proposed freshwater fish dataset. There are pictures of tilapia and koi fish with different angles and brightness. Columns (a,b) are pictures of koi fish at different brightness and angles. Columns (c,d) are pictures of koi fish at different brightness and angles.

Figure 3. Overview of DyFish-DETR. The detailed structures of DyFishNet and Slim Hybrid Encoder will be shown below.

Figure 4. The structure of DyFishNet.

Figure 5. The structure of the Slim Hybrid Encoder.

Figure 6. The structure of the fusion module.

Figure 7. Dataset labeling. Tilapia is in the purple box. Koi fish is in the green box.

Figure 8. The precision-recall curve of the DyFish-DETR in the freshwater fish dataset.

Figure 9. Detection results in Freshwater Fish dataset.

Figure 10. The precision-recall curve of the DyFish-DETR in Fish4Knowledge23 dataset.

Figure 11. Detection results in the Fish4Knowledge23 dataset.

Figure 12. The precision-recall curve of the DyFish-DETR in the Brackish dataset.

Figure 13. Detection results in the Brackish dataset.

Figure 14. The precision-recall curve of the DyFish-DETR in the RUOD dataset.

Figure 15. Detection results in the RUOD dataset.

Table 1. Results of comparative experiments.

	Sparse R-CNN	DINO	RT-DETR	YOLOv8	DyFish-DETR
${A P}_{[I o U = 0.50 : 0.95 ∣ a r e a = a l l ∣ m a x D e t s = 100]}$	64.1%	73.9%	74.6%	74.8%	76.6%
${A P}_{[I o U = 0.50 ∣ a r e a = a l l ∣ m a x D e t s = 100]}$	90.2%	94.6%	95.2%	95.1%	95.9%
${A P}_{[I o U = 0.75 ∣ a r e a = a l l ∣ m a x D e t s = 100]}$	73.3%	86%	86.7%	87.3%	87.8%
${A P}_{[I o U = 0.50 : 0.95 ∣ a r e a = s m a l l ∣ m a x D e t s = 100]}$	−1.000	−1.000	−1.000	−1.000	−1.000
${A P}_{[I o U = 0.50 : 0.95 ∣ a r e a = m e d i u m ∣ m a x D e t s = 100]}$	−1.000	−1.000	−1.000	−1.000	−1.000
${A P}_{[I o U = 0.50 : 0.95 ∣ a r e a = l a r g e ∣ m a x D e t s = 100]}$	64.1%	73.9%	74.6%	74.8%	76.6%
${A R}_{[I o U = 0.50 : 0.95 ∣ a r e a = a l l ∣ m a x D e t s = 1]}$	45.8%	49.5%	50.8%	50.6%	51.6%
${A R}_{[I o U = 0.50 : 0.95 ∣ a r e a = a l l ∣ m a x D e t s = 10]}$	75.8%	80.6%	80.1%	80.7%	81.5%
${A R}_{[I o U = 0.50 : 0.95 ∣ a r e a = a l l ∣ m a x D e t s = 100]}$	78.6%	83.2%	82.8%	80.8%	84.7%
${A R}_{[I o U = 0.50 : 0.95 ∣ a r e a = s m a l l ∣ m a x D e t s = 100]}$	−1.000	−1.000	−1.000	−1.000	−1.000
${A R}_{[I o U = 0.50 : 0.95 ∣ a r e a = m e d i u m ∣ m a x D e t s = 100]}$	−1.000	−1.000	−1.000	−1.000	−1.000
${A R}_{[I o U = 0.50 : 0.95 ∣ a r e a = l a r g e ∣ m a x D e t s = 100]}$	78.6%	83.2%	82.8%	80.8%	84.7%

Table 2. The results of DyFish-DETR in the freshwater fish dataset.

Class	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$
All	92.2%	92.2%	96.6%	77.2%
Koi	90.3%	91.9%	96.4%	75.6%
Tilapia	94.1%	92.5%	96.7%	78.7%

Table 3. The results of DyFish-DETR in the Fish4Knowledge23 dataset.

Class	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$
All	98.3%	98.4%	99%	83.7%
Dascyllus reticulatus	99%	89.4%	98.1%	83.9%
Myripristis kuntee	99.8%	100%	99.5%	84.4%
Amphiprion clarkia	99.9%	100%	99.5%	75.6%
Plectroglyphidodon dickii	99.8%	100%	99.5%	75.4%
Chromis chrysura	100%	98.1%	98.5%	79.3%
Lutjanus fulvus	97.3%	94.7%	98.6%	74.8%
Pomacentrus moluccensis	99.4%	100%	99.5%	86.7%
Abudefduf vaigiensis	99.2%	100%	99.5%	78.1%
Zebrasoma scopas	99.3%	90.0%	95.7%	79.2%
Chaetodon trifascialis	99.2%	100%	99.5%	84.5%
Acanthurus nigrofuscus	92.7%	95.7%	99.2%	85.9%
Siganus fuscescens	96.9%	100%	99.5%	89.5%
Canthigaster valentine	99.2%	100%	99.5%	75.9%
Balistapus undulates	97.2%	100%	99.5%	94.8%
Hemigymnus melapterus	97.0%	100%	99.5%	85.7%
Scolopsis bilineata	93.5%	100%	99.5%	81.1%
Ncoglyphidodon nigroris	96.8%	100%	99.5%	89.5%
Scaridae	98.2%	100%	99.5%	87.5%
Hemigymnus fasciatus	99.4%	100%	99.5%	86.3%
Chaetodon lunulatus	100%	99.3%	99.5%	81.8%
Pempheris vanicolensis	100%	100%	99.5%	99.5%
Neoniphon samara	99.5%	96.8%	96.5%	81.7%

Table 4. The results of DyFish-DETR in the Brackish dataset.

Class	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$
All	97.9%	97.9%	98.8%	81.7%
Fish	99.4%	97.8%	99.5%	84.2%
Small fish	97.6%	93.5%	97.5%	70.9%
Crab	99.8%	99.7%	99.5%	89.6%
Shrimp	93.1%	98.2%	98.4%	76.9%
Jellyfish	97.9%	98.3%	98.4%	72.3%
Starfish	99.9%	99.8%	99.5%	96.4%

Table 5. The results of DyFish-DETR in the RUOD dataset.

Class	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$
All	85.1%	77.2%	83.2%	57.9%
Holothurian	84.6%	66.2%	75.2%	41.1%
Echinus	86.6%	82.8%	88.9%	48.6%
Scallop	84.0%	69.3%	79.4%	49.1%
Starfish	85.6%	84.8%	88.6%	55.6%
Fish	80.4%	59.6%	68.6%	46.8%
Corals	74.4%	65.5%	72.0%	50.4%
Diver	91.4%	89.9%	94.1%	71.2%
Cuttlefish	96.2%	95.4%	97.8%	82.8%
Turtle	95.2%	90.3%	93.8%	77.3%
Jellyfish	72.5%	68.5%	73.6%	55.9%

Table 6. The results of ablation experiments.

Methods	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$	Parameters	GFLOPs	FPS
RT-DETR (Res2Net50 + Efficient Hybrid Encoder)	92.7%	91.6%	96.4%	77.2%	42.0M	130	115
RT-DETR (Res2Net101 + Efficient Hybrid Encoder)	92.0%	92.5%	96.2%	77.0%	74.7M	247	60
RT-DETR (HGNetv2-L + Efficient Hybrid Encoder)	93.2%	89.5%	95.2%	75.6%	32.0M	103	71
RT-DETR (HGNetv2-X + Efficient Hybrid Encoder)	94.1%	88.7%	95.7%	76.4%	65.5M	222	74
RT-DETR (Res2Net50 + Slim Hybrid Encoder)	90.9%	91.7%	95.9%	76.0%	32.8M	100	120
RT-DETR (DyFishNet + Efficient Hybrid Encoder)	91.7%	92.5%	96.3%	76.3%	24.0M	62	62
DyFish-DETR/RT-DETR (DyFishNet + Slim Hybrid Encoder)	92.2%	92.2%	96.6%	77.2%	23.7M	61	58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Ruan, Z.; Chen, C. DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer. J. Mar. Sci. Eng. 2024, 12, 864. https://doi.org/10.3390/jmse12060864

AMA Style

Wang Z, Ruan Z, Chen C. DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer. Journal of Marine Science and Engineering. 2024; 12(6):864. https://doi.org/10.3390/jmse12060864

Chicago/Turabian Style

Wang, Zhuowei, Zhukang Ruan, and Chong Chen. 2024. "DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer" Journal of Marine Science and Engineering 12, no. 6: 864. https://doi.org/10.3390/jmse12060864

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DyFish-DETR: Underwater Fish Image Recognition Based on Detection Transformer

Abstract

1. Introduction

2. Related Work

2.1. End-to-End Object Detection with Transformers

2.2. Convolution Module

2.3. Attention Mechanism

3. Methodology

3.1. Freshwater Fish Dataset

3.2. Dynamic Fish Detection with Transformers (DyFish-DETR)

3.2.1. Dynamic Fish Net (DyFishNet)

3.2.2. Slim Hybrid Encoder

4. Experiment and Results

4.1. Dataset Details

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Benchmarking Experiment

4.4.1. Comparison of Different Models in the Freshwater Fish Dataset

4.4.2. Comparison of Experimental Results in Different Underwater Datasets

4.5. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI