Next Article in Journal
Learning in Two-Scales Through LSTM-GPT2 Fusion Network: A Hybrid Approach for Time Series Anomaly Detection
Previous Article in Journal
Brief Review of Vibrothermography and Optical Thermography for Defect Quantification in CFRP Material
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

YOLOv8-TF: Transformer-Enhanced YOLOv8 for Underwater Fish Species Recognition with Class Imbalance Handling

1
Northern Gulf Institute, Mississippi State University, Starkville, MS 39759, USA
2
The School of Engineering and Applied Sciences, Western Kentucky University, Bowling Green, KY 42101, USA
3
Department of Electrical and Computer Engineering, James Worth Bagley College of Engineering, Mississippi State University, Starkville, MS 39762, USA
4
National Marine Fisheries Services, Southeast Fisheries Science Center, 3209 Frederic Street, Pascagoula, MS 39567, USA
5
NOAA Fisheries, 4700 Avenue U, Galveston, TX 77551, USA
*
Authors to whom correspondence should be addressed.
Sensors 2025, 25(6), 1846; https://doi.org/10.3390/s25061846
Submission received: 19 February 2025 / Revised: 8 March 2025 / Accepted: 14 March 2025 / Published: 16 March 2025
(This article belongs to the Section Sensing and Imaging)

Abstract

:
In video-based fish surveys, species recognition plays a vital role in stock assessments, ecosystem analysis, production management, and protection of endangered species. However, implementing fish species detection algorithms in underwater environments presents significant challenges due to factors such as varying lighting conditions, water turbidity, and the diverse appearances of fish species. In this work, a transformer-enhanced YOLOv8 (YOLOv8-TF) is proposed for underwater fish species recognition. The YOLOv8-TF enhances the performance of YOLOv8 by adjusting depth scales, incorporating a transformer block into the backbone and neck, and introducing a class-aware loss function to address class imbalance in the dataset. The class-aware loss considers the count of instances within each species and assigns a higher weight to species with fewer instances. This approach enables fish species recognition through object detection, encompassing the classification of each fish species and localization to estimate their position and size within an image. Experiments were conducted using the 2021 Southeast Area Monitoring and Assessment Program (SEAMAPD21) dataset, a detailed and extensive reef fish dataset from the Gulf of Mexico. The experimental results on SEAMAPD21 demonstrate that the YOLOv8-TF model, with a mean Average Precision (mAP)0.5 of 87.9% and mAP0.5–0.95 of 61.2%, achieves better detection results for underwater fish species recognition compared to state-of-the-art YOLO models. Additionally, experimental results on the publicly available datasets, such as Pascal VOC and MS COCO datasets demonstrate that the model outperforms existing approaches.

1. Introduction

Accurate fish species identification holds significant importance in fisheries management and environmental monitoring. It plays a vital role in various aspects, including the identification of endangered species, optimizing harvesting timing and size, ecosystem monitoring, and establishing efficient production management systems [1,2,3]. Accurate fish species recognition becomes even more crucial due to legal constraints on fishing methods, particularly for threatened or endangered species. Traditional approaches for identifying fish species often rely on manual, labor-intensive processes that consume substantial time. Furthermore, hook-based sampling can disturb the natural behavior of fish. Conventional methods present difficulties in ensuring the availability of reliable data to effectively manage sustainability in fisheries, monitoring federal fisheries, evaluating fish populations, and distinguishing between various species of fish. In contrast, leveraging computer vision, deep learning (DL) technology, and annotation tool offers the potential for robust models in the identification of fish species, leading to cost and time savings while enhancing the accuracy of identification [4,5,6,7].
Machine vision solutions have the potential to replace manual counting and species identification by offering comparable or even superior precision. Several methods for fish detection, including lidar [8,9], sonar [10], and RGB imaging [11], are available. When it comes to species identification in clear water, RGB imaging stands out as the favored choice. It enables a straightforward identification of fish using their color, texture, and geometry. Additionally, RGB imaging is not only cost-effective and lightweight, but is also environmentally friendly as it does not disrupt fish habitat. In earlier studies, the analysis of video frames [12] was performed on an individual basis to detect objects present in each frame. Recently, multiple camera systems have been employed to monitor fish stocks and assess the sustainability of marine ecosystems. Nevertheless, these systems face a common challenge of time-consuming manual processing bottlenecks [13]. Recent advancements in deep learning have demonstrated significant potential for extracting valuable information about marine ecology through object detection and classification [14,15,16]. Though RGB imaging offers advantages in clear water conditions, its effectiveness is significantly hindered by the challenges posed by underwater environments, such as low light and turbid conditions [17,18], occlusion [19], low-resolution images and videos, and the inherent difficulty in distinguishing fish from the background [20]. These factors can lead to blurry images and noisy data, making it difficult to accurately detect and classify objects, particularly smaller fish species. To address these issues, innovative approaches such as the SWIPENET+CMA framework have been proposed in [21]. By leveraging advanced DL techniques, this framework aims to improve the accuracy of object detection and classification in challenging underwater conditions. While optical cameras struggle in underwater environments characterized by poor lighting and turbidity, sonar cameras [22] offer a more reliable solution. Sonar systems can cover a wider area and penetrate deeper into the water column, making them suitable for large-scale underwater surveys. The integration of machine learning (ML) techniques with sonar data [23] has enabled the development of automated systems for analyzing sonar data, facilitating precise fish detection and classification [24]. However, the high cost of high-resolution 3D sonar cameras remains a significant barrier to their widespread adoption. The dynamic movement of fish introduces complexities such as shape variations and object overlaps, which pose challenges in accurately identifying and detecting underwater fish species. Another significant challenge for marine applications is marine snow, comprised of drifting organic and inorganic particles, significantly degrades underwater image clarity. Solutions targeting this issue include: end-to-end dual-channel frameworks that address multiple degradations including marine snow removal [25], deep learning approaches fusing spatial and Fourier domain information for effective marine snow restoration [26], and the introduction of a dedicated benchmarking dataset to facilitate further research in this area [27]. In addition, DL is a prominent field within computer vision, has found widespread application in addressing diverse challenges such as detection, localization, estimation, and classification [4,28,29,30,31]. Numerous ML and DL algorithms have been created for the purpose of categorizing fish species. For example, Jager et al. [32] utilized the AlexNet architecture for feature extraction and employed multiclass support vector machine (SVM) for classification. Similarly, hierarchical features combined with SVM have been utilized for fish classification. Carion et al. [33] introduced end to end object detection with transformers (DETR) and the experiments are conducted in challenging MSCOCO object detection datasets. Fang et al. [34] presented You Only Look Once (YOLO) model based on transformer for object detection on public MSCOCO dataset.
The You Only Look Once (YOLO) model [35,36,37,38] widely used for object detection, is effective in identifying fish, especially in videos. YOLO operates as a single-shot detection model [39], processing the entire image or video frame in one go to accurately predict object locations and classes. The YOLO model is specifically designed for efficient processing, making it ideal for real-time use. With extensive training on large datasets, it can achieve impressive levels of accuracy. The YOLO model utilizes a convolutional neural network (CNN) architecture and is trained using a set of labeled images. Through training, the model becomes adept at predicting bounding boxes that represent the location of fish in an image or video, along with the associated class probabilities (i.e., the fish species). Several loss functions are employed during training to penalize inaccurate predictions and encourage continuous improvement in the model’s prediction capabilities over time.
YOLO provides a unified framework for object detection, allowing it to identify multiple objects in one traversal of the network. This is advantageous for tasks that require detecting and recognizing multiple objects. YOLO offers the advantage of efficient processing by being highly parallelizable, allowing it to effectively utilize multiple GPUs simultaneously. However, due to its single-stage detection network architecture, YOLO sacrifices some precision compared to two-stage networks like Faster R-CNN [40]. Underwater images often contain numerous small objects, which are typically overlooked by the pooling layers in CNNs due to their size, making detection and recognition difficult. Moreover, similar to other object detection models, YOLO encounters challenges in accurately detecting fish in situations of low lighting or partial obstructions. Despite these limitations, YOLO continues to be widely favored and proven effective for tasks involving fish classification and detection [41]. The YOLOv5 technique for object detection was initially introduced by Jocher et al., 2021 [42], who utilized it on the MS COCO public datasets [43]. Subsequently, Jung et al., 2022 [44] implemented YOLOv5 specifically for object detection in drone images. Wang et al. [45] presented an optimized nighttime nail detection with an advanced YOLOv5 model for improved road safety. In our previous work [5], we enhanced YOLOv5 by modifying its backbone to improve fish species recognition.In addition, Varghese et al., 2024 [46] introduced a novel YOLOv8 model for object detection in the public MS COCO dataset. When compared to the aforementioned algorithms, YOLOv8 was a significant advancement in the YOLO series, known for its high accuracy, fast processing speed, and capacity to detect multiple objects within an image [47]. Following that, Li et al. [48] introduced a modified YOLOv8-based technique for recognizing objects in images collected from the air. Ansah et al. [49] proposed a SB-YOLO-V8 technique layered with deep Learning framework for real-time detection of humans. Bi et al. [50] introduced a streamlined detection network based on an improved YOLOv8 model, tailored for identifying small targets in drone-captured aerial images. The modifications aim to enhance detection accuracy for objects observed from high altitudes. None of the aforementioned techniques have been adequately implemented for recognizing fish species in underwater environments with highly imbalanced species distributions. In this study, we have proposed YOLOv8-TF: Transformer-Enhanced YOLOv8 for Underwater Fish Species Recognition with Class Imbalance Handling to better handle the challenges posed by class imbalance in underwater fish detection, further improving its effectiveness in these complex environments.
The primary contributions of this research are as follows:
  • To enhance fish species recognition, we made modifications to the depth scale of various layers within the backbone of the YOLOv8 model. These modifications resulted in improved performance and accuracy in identifying fish species.
  • We incorporated the transformer block into both the backbone and neck networks of the YOLOv8-based approach. This enhancement boosts the model’s capabilities by enabling it to capture more contextual and global information, thereby improving its performance.
  • To tackle the challenges posed by a highly imbalanced dataset, we introduced a class-aware loss function along with Wise-IoUv3 [48]. This approach enhances both classification and localization accuracy, leading to more precise and reliable results.

2. Related Works

The rapid rise in popularity of DL techniques has led to their successful implementation across various industries, including the fishery industry [51]. DL techniques have greatly influenced information and data processing in the realm of smart fish farming, presenting both new opportunities and challenges. Within aquaculture, DL is extensively utilized for various tasks, including live fish identification [52], species classification [53], behavioral analysis [54], feeding selection [55], size or biomass estimation [56], and water quality forecasting [57]. Several DL approaches exist that are particularly well-suited for fish datasets, allowing for the development of tailored DL models specific to each task.
The YOLO algorithm has gained significant popularity in fish identification and detection tasks. A notable early study conducted by Xu et al. (2018) [29] employed YOLO to develop a model capable of identifying fish in underwater videos. The study utilized three diverse datasets captured at water power sites to train the YOLO model specifically for this purpose. In a recent study, the Southeast Area Monitoring and Assessment Program Dataset (SEAMAPD21) was utilized to examine the effectiveness of YOLOv4 in the identification and categorization of different fish species. The results showcased the successful localization and identification of various fish species within the SEAMAPD21 dataset [58] using the YOLOv4 model. Alaba et al. [4] employed two feature extraction networks, namely MobileNetv3-large [59] and VGG16 [60], to extract relevant features from underwater images. They then utilized a single-shot multi-box detector (SSD) to classify fish species and detect the precise location of the fish within the images from the SEAMAPD21 dataset. Wang et al. [61] introduced a novel attention mechanism, Channel and Spatial Fusion Attention (CSFA), which integrates channel and spatial attention into the YOLOv5 architecture. By combining these attention mechanisms, the network can effectively focus on both the salient features and spatial context of detected objects, leading to improved detection performance. To overcome challenges like turbidity and low lighting in underwater object detection, Pachaiyappan et al. [62] worked on including attention mechanisms, transformers, and diffusion models. These approaches enhance image quality, enabling more accurate detection and classification by using advanced imaging technique YOLO network (AIT-YOLOv7), supporting Sustainable Development Goal (SDG) 14: Life Below Water, which aims to conserve and sustainably use the oceans, seas, and marine resources.
Sung et al. [63] introduced a real-time fish detection model based on the YOLO algorithm, specifically designed for underwater vision. The effectiveness and precision of the proposed method were evaluated using real fish video images. The CNN model achieved an impressive classification accuracy of 93%, a predicted bounding box intersection over a union of 0.634, and a fish detection rate of 16.7 frames per second. Notably, it outperformed a fish detector that utilized a sliding window algorithm and a classifier trained with a histogram of oriented gradient features and a SVM. Jalal et al. [64] presented a two-step DL approach for identifying and classifying temperate fishes. In the initial step, they employed the YOLO object detection method to detect each fish in the image, disregarding its species or gender. In the subsequent step, a CNN with the Squeeze-and-Excitation structure was utilized to detect each individual fish in the image without any preliminary filtering. To address the limited samples issue, transfer learning was employed to improve the overall classification accuracy.
Collectively, these studies highlight the potential of DL, specifically the YOLO-based approach, in fish identification and detection tasks. They showcase promising outcomes in terms of accuracy and processing speed. However, challenges remain in addressing issues related to the complex and diverse underwater environment, as well as improving the models’ ability to generalize and handle new and unseen fish species effectively.

3. Proposed Approach

3.1. A YOLOv8 Technique for the Recognition of Fish Species in Underwater Environments

In YOLOv8, both the conventional convolution module and the C2f module [48,65] were employed to accomplish effective feature extraction and image downsampling, resulting in high-quality outputs, as shown in Figure 1. The C2f module integrates high-level features with contextual data to enhance detection performance. To gain richer gradient flow information while maintaining low weight, this module expands the gradient branch by reducing one conventional convolutional layer and fully utilizing the bottleneck module. YOLOv8 offers a range of models, including YOLOv8n, YOLOv8s, YOLOv8m, and YOLOv8l, each with varying architectural complexities and performance characteristics. While the underlying principle remains consistent, the classification of these models is based on their respective memory usage. For YOLOv8l, its backbone consists of C2f modules and conventional convolutional modules represented as 3 × C 2 f 6 × C 2 f 6 × C 2 f 3 × C 2 f . Figure 2 shows the detailed diagram for C2f module. Within the YOLOv8 architecture, feature maps are segregated into several distinct scale features in descending order. These features are denoted as C2f layers in the backbone, the FPN (Feature Pyramid Network) [66] structure and the PAN (Path Aggregation Network) [67] structure in the neck. The PAN-FPN structure employed in the YOLOv8 model serves as a complement to the conventional FPN. It utilizes a top-down approach to effectively transfer deep semantic features. In the final stage, the head component utilizes the image features obtained from the neck module to generate predictions, involving steps for both class and box predictions.

3.2. YOLOv8-TF: Transformer-Enhanced YOLOv8 for Underwater Fish Species Recognition with Class Imbalance Handling

The YOLOv8 backbone consists of C2f layers [48,65]. The C2f module establishes connections across layers through additional branches and adjusts varying channel numbers for different scale models. This design not only ensures a lightweight model but also captures richer gradient flow information. For the improvement in the backbone, we have modified the width and depth scales of the C2f layers. As shown in Figure 1, C2f modules for YOLOv8-TF model are represented as 6 × C 2 f 12 × C 2 f 12 × C 2 f 6 × C 2 f .

3.2.1. Transformer Block

We have introduced a transformer (trans) block [5,69,70], as shown in Figure 3, in the backbone network of YOLOv8-TF resulting in C2f and transformer (trans) modules as 6 × C 2 f 12 × C 2 f 12 × C 2 f 6 × t r a n s . For the improvements in the Neck, C2f modules are replaced with transformer (trans) modules. The transformer layers help capture long-range dependencies and global contextual information, which enhances the model’s ability to detect fish species more accurately by effectively distinguishing subtle visual patterns across different scales. For the processing of 2D images in trans layer, we flatten the spatial dimensions of the 2D feature map x ϵ R h × w × d into a sequence x p ϵ R h × w × d , where (h,w) represents the original feature map resolution, d denotes the number of channels, and h × w functions as the effective input sequence length for the transformer layer. To imbue the attention operation with positional awareness, transformer-based architectures commonly employ position encoding. We employ standard learnable 1D positional embeddings through a linear layer to maintain positional information.

3.2.2. Wise-IoU Loss Function

YOLOv8 employs the anchor-free concept such that the Binary Cross Entropy (BCE) loss is used for the classification loss, and Bounding Box Regression (BBR) loss and distribution focal (DF) loss are used for the regression portion [48,65].
L l o s s = α 1 L B C + α 2 L D F + α 3 L B B R
where L B C denotes classification loss, L B B R represents regression loss, and L D F defines the distribution focal loss with regularization coefficients α 1 , α 2 , and α 3 .
Wise Intersection over Union (WIoU) [71,72] introduces a dynamic, non-monotonic focus mechanism that evaluates anchor box quality through dissimilarity rather than traditional IoU. Then, the cost function can be estimated as:
L l o s s = α 1 L B C + α 2 L D F + L W I o U v 3
where L W I o U v 3 denotes the wise-IoU loss term, which can be estimated as below:
L W I o U v 3 = 1 bbox wbox bbox wbox exp ( x bbox x wbox ) 2 + ( y bbox y wbox ) 2 ( w 2 + h 2 ) κ
κ = θ / τ β θ τ
The L W I o U v 3 loss evaluates the quality of an anchor box by considering both the spatial overlap and the distance between the centers of the predicted and ground truth boxes. Unlike traditional Intersection over Union (IoU) metrics, which only measure overlap, wise IoU introduces a distance-based penalty for more insightful gradient gain allocation. In this formulation, x b b o x , y b b o x represent the center coordinates of the predicted anchor box, while x w b o x , y w b o x denote the center coordinates of the ground truth anchor box. The terms b b o x and w b o x correspond to the predicted and ground truth bounding boxes, respectively, with w and h representing the width and height of the reference prediction box. The term θ represents the abnormality degree for prediction box, β and θ are hyperparameters. By incorporating the center distance into the loss calculation, wise IoU dynamically adjusts gradients to prioritize anchor boxes closer to the ground truth, ultimately improving object detection performance.

3.2.3. Class-Aware Loss Function

To improve the loss function, we have introduced class-aware (ca) regularization [4] terms in classification, regression, and distribution focal loss terms of the YOLOv8 model.
L l o s s = α 1 L B C c a + α 2 L D F c a + L W I o U v 3 c a
The proposed class aware classification, distribution focus, and regression loss terms can be defined as:
L B C c a = 1 n s n 1 n s n η L B C , L D F c a = 1 n s n 1 n s n η L D F , L W I o U v 3 c a = 1 n s n 1 n s n η L W I o U v 3 ,
where, L B C c a is class-aware classification loss and L W I o U v 3 c a is class-aware wise IoU loss, and L D F c a is the class-aware distribution focal loss. The quantity n s represents the number of training instances associated with each species, while n denotes the total number of training samples and η is a tunable hyper-parameter described below. It is important to note that in this context, n s is considerably smaller than n.
When n s is characterized by a small value, and the parameter η is increased, the multiplying term’s value 1 n s n 1 n s n η increases as well. This increase effectively amplifies the weight assigned to the less-represented class. To achieve optimal performance, it is essential to fine-tune the hyper-parameter values. In most cases, raising the η value results in an increase in the value of the multiplying term within the loss functions. However, this alone may not lead to an overall performance improvement, as it can have repercussions on other classes. Consequently, the ideal hyper-parameter value for maximizing performance on a specific dataset must be systematically adjusted, beginning with an initial setting of parameter η = 1. For this specific training scenario, we set η to the value of 8.

3.3. Dataset Description

The Southeast Area Monitoring and Assessment Program Dataset 2021 (SEAMAPD21) [58] provides a large-scale collection of reef fish data from the Gulf of Mexico. This dataset encompasses 130 distinct classes of fish species captured in underwater environments, with a total of 28,319 images. However, certain species have limited representation in the dataset as shown in Figure 4, and the model’s performance is influenced by samples with a higher number of species per class. Furthermore, detecting fish in low-resolution underwater environments presents challenges due to the difficulty in distinguishing between images and the background. Images from the dataset are illustrated in Figure 5. The task of detecting fish in certain images presents challenges, even for a human. For the experimental setup, a train-validation-test split ratio of 70/15/15 is utilized. The mean average precision (mAP) is then evaluated on the test set to assess the model’s performance.

4. Experimental Results

For the methods in comparison, parameters were tuned according to MobileNetv3-large [4], VGG300 [4], VGG512 [4], YOLOV5s [5], YOLOv5m [5], YOLOV5l [5], YOLOV8s [65], YOLOV5enh [5], YOLOv8m [65], and YOLOV8l [65].

4.1. Implementation Details

For training the proposed YOLOv8-TF, we utilized PyTorch 1.13.0 and employed an NVIDIA A100-SXM GPU. The models were trained and tested using this setup. The approach was trained from scratch, and Stochastic Gradient Descent (SGD) was employed as the optimizer for the model. A total of 300 epochs were conducted during the training process. The values for η were varied from 1 to 16 and optimal performance was found at η = 8. Optimal values of α 1 = 0.5, α 2 = 1.5, α 3 = 7.5, θ = 2, β = 1.8, and τ = 3 were used.

4.2. Performance

To evaluate the performance of the proposed YOLOv8-TF and other existing versions of YOLOv8, the commonly used metric mean average precision (mAP) is employed. Two variants of mAP are utilized: mAP0.5:0.95, which calculates the average precision over a range of Intersection Over Union (IoU) values from 0.5 to 0.95, and mAP0.5, which focuses on IoU at 0.5. These metrics provide a comprehensive assessment of the detection performance across different IoU thresholds. Table 1 presents the mAP for the YOLOv8enh-based approach in contrast to alternative methods. Certain species exhibit a limited number of samples, such as fewer than 10, which may result in inadequate samples for training, validation, and testing sets. As a result, 121 species are employed to gauge the mAP. The results indicate that the proposed YOLOv8-TF surpasses other versions, including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, YOLOv5s, YOLOv5m, YOLOv5l, MobileNetv3-large, VGG300, and VGG512 in the detection of fish species in underwater environments. Specifically, YOLOv8-TF exhibits superior performance with a mAP0.5 of 87.9% and mAP0.5:0.95 of 61.2%.
Additionally, Table 1 displays the network size, measured in terms of the number of parameters in millions (M), the number of calculations in GFLOPS, and speed measured in terms of Frames Per Second (FPS). It displays the inference latency on an NVIDIA A100-SXM GPU. This facilitates a comparison of the impact of the proposed approach with existing YOLOv8-based techniques. It is noticeable that YOLOv8-TF exhibits a higher number of parameters (51.08 M) and GFLOPS (165.4) compared to YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, YOLOv5s, YOLOv5m, and YOLOv5l. YOLOv8 outperforms YOLOv5 in terms of speed measured with FPS for fish species recognition in underwater environments. YOLOv8-TF provides a competitive FPS of 116 compared to other methods. When considering accuracy measured in terms of mAP, YOLOv8-TF also demonstrates superiority compared to other methods.
Figure 6 presents a comparison between YOLOv8-TF and other existing versions like YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5enh, YOLOv8s, YOLOv8m, and YOLOv8l focusing on metrics such as mAP, frames per second (FPS), and GFLOPS. The GFLOPS is depicted through the radii of the circles. While YOLOv8-TF demands higher GFLOPS compared to YOLOv8m, the performance improvement, measured in terms of mAP, makes up for this difference. In terms of speed, YOLOv8-TF with 116 FPS outperforms YOLOv5m, YOLOv5l, YOLOv5enh although it is slower compared to YOLOv8n, YOLOv8s, YOLOv8m, and YOLOv8l. The trade-off between detection accuracy and speed provided by YOLOv8-TF makes it superior in many cases compared to other versions.
Table 2 shows that the proposed YOLOv8-TF with a mAP0.5 of 94.30% and mAP0.5:0.95 of 58.10% outperforms all other compared methods for the publicly available Pascal VOC dataset. Additionally, parameters (in millions) and speed measured in FPS (frames per second) are reported. Proposed YOLOv8-TF shows better accuracy compared to other methods in expense of higher parameters 30.51M and lower FPS of 130. Similarly, Table 3 shows that the proposed YOLOv8-TF with a mAP0.5 of 69.50% and mAP0.5:0.95 of 51.80% outperforms all other compared methods for the publicly available MSCOCO dataset. Moreover, proposed YOLOV8-TF outperforms other compared methods with high mean average precision although it introduced complexity with higher parameters of 30.54M and slower FPS of 139.
Figure 7, Figure 8, Figure 9 and Figure 10 depict detection maps for fish species in underwater environments using SEAMAPD21. The images are generated using the established YOLOv8s, YOLOv8m, YOLOv8l, and the proposed YOLOv8-TF. The numbers after the names refer to the version of the YOLO model. Notably, YOLOv8-TF as shown in Figure 10 exhibits superior detection performance compared to other models, as shown by the higher mAP values in Table 1. In certain images, even humans may find it challenging to detect missed fish. For instance, the bottom two images in Figure 8 and the upper left corner of Figure 9 may not be identifiable or verifiable by a human.
Table 4 presents an ablation study conducted on the proposed YOLOv8-TF technique. The removal of the transformer block, depicted in the backbone and neck sections of Figure 1, results in reduced accuracy for YOLOv8-TF with class-aware (ca) loss and modified depth scale of backbone in C2f structure. Specifically, the mAP0.5 is 82.8%, and mAP0.5–0.95 is 54.8%. Furthermore, upon removing the class aware (ca) loss function from YOLOv8-TF, the accuracy of YOLOv8-TF with modified depth scale of backbone in C2f structure, without ca loss and trans block decreases to an mAP0.5 of 81.8% and mAP0.5–0.95 of 53.2%. Furthermore, parameters in million (params) and FPS (frames per second) are reported for ablation analysis. It can be observed that proposed YOLOV8-TF shows higher mean average precision compared to other versions consuming more parameters of 30.56 M and slower FPS of 116. It is because of involvement of transformer block in backbone and neck, using class aware loss function along with wise iou v3 loss, and implementing increased depth scale in the backbone of proposed YOLOv8-TF.

5. Discussion

For proposed YOLOv8-TF on SEAMAPD21, we have listed estimation of parameter count, GFLOPS, and FPS in Table 1. Similarly, Table 4 shows estimation of params (parameters in millions) and FPS (frames per second) for ablation analysis on YOLOv8-TF. It can be observed that proposed YOLOv8-TF outperforms other compared methods in terms of mean average precision (mAP). However, it suffers from increased parameter counts and slower FPS. Due to attention mechanism involved in transformer, addition of increased depth scale, involvement of class imbalance loss along with wise iou v3 loss, proposed YOLOv8-TF is influenced by increased number of parameters. An increase in the number of parameters leads to slower recognition speed because more resources are needed for each inference. Our model shows a higher accuracy for increase in the number of parameter counts and a decrease in the FPS, which result in an higher inference time. This model requires substantial memory because of higher number of parameters compared to other baseline models, such asYOLOV8m, YOLOv8s, and YOLOv8n. It may limit its use in resource-constrained environments. Our model may need optimization to address issues of slower inference time and higher memory usage in resource-constrained environments like embedded systems.

6. Conclusions

This study presents an enhanced approach, YOLOv8-TF, for fish species recognition in underwater environments. The effectiveness of the proposed YOLOv8-TF is evaluated through experiments conducted on the Southeast Area Monitoring and Assessment Program Dataset 2021 (SEAMAPD21), which comprises an underwater fish species dataset obtained from the Gulf of Mexico. The experimental results highlight the superiority of YOLOv8-TF compared to existing YOLOv8-based techniques. The method’s effectiveness is improved by adjusting the backbone’s depth scale, and adding a transformer block to both the backbone and neck sections further enhances the recognition of different fish species. Moreover, class-aware (ca) loss further enhances the detection performance by helping mitigate the class imbalance issue in the SEAMAPD21 dataset. We compare different YOLO models to illustrate the performance improvements achieved by YOLOv8-TF.
Future research directions will concentrate on building a lightweight model that can be deployed in resource-constrained environments and devices, aiming to enhance real-time processing capabilities and achieve higher accuracy. Furthermore, we plan to significantly ramp up the complexity and scope of our training data by incorporating a full training library that includes all available video annotations. This expansion will allow YOLOv8-TF to leverage temporal information inherent in video data and enrich the model’s understanding of fish behavior and movement patterns, significantly enhancing our model’s capabilities in fish tracking and counting across multiple images or videos. Such advancements will propel our ability to monitor and understand aquatic ecosystems to new heights and open the way for groundbreaking marine biology and conservation applications.

Author Contributions

Conceptualization, C.S.; methodology, C.S.; software, C.S.; validation, J.E.B. and R.M.; formal analysis, C.S.; investigation, C.S.; resources, J.E.B. and R.M.; data curation, J.P., M.D.C., C.S., M.M.N. and S.Y.A.; writing—original draft preparation, C.S., S.Y.A., M.M.N. and I.A.E.; writing—review and editing, J.E.B., R.M., J.P., M.D.C., R.C., M.D.G., T.R. and F.W.; visualization, C.S., S.Y.A. and M.M.N.; supervision, J.E.B., R.M., T.R. and F.W.; project administration, J.E.B., R.M., T.R. and F.W.; funding acquisition, J.E.B., R.M., T.R. and F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by award NA21OAR4320190 to the Northern Gulf Institute at Mississippi State University from NOAA’s Office of Oceanic and Atmospheric Research, U.S. Department of Commerce.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available at https://github.com/SEFSC/SEAMAPD21 (accessed on 8 March 2025). The given link is updated every time. So, it may have more datasets than we have used for this experiment. In addition, the source of this work will be made publicly available at https://github.com/SEFSC/FATES-ATI-EnhancedYOLOv8 (accessed on 8 March 2025).

Acknowledgments

The authors are thankful for the source of funding.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
YOLOv8You Only Look Once, version 8
YOLOv8enhenhanced form of You Only Look Once, version 8
SEAMAPD21The Southeast Area Monitoring and Assessment Program Dataset 2021
transTransformer block

Appendix A

Table A1. mAP (in %) for various methods on SEAMAPD21 dataset [4,5].
Table A1. mAP (in %) for various methods on SEAMAPD21 dataset [4,5].
SpeciesMobileNetv3VGG300VGG512YOLOv5mYOLOv5lYOLOv5enhYOLOv8sYOLOv8mYOLOv8lYOLOv10lYOLOv8-TF
ACANTHURUSCOERULEUS---0.960.540.188.04020.014.343.9
ACANTHURUS20.554.5559.3248.849.851.447.548.348.156.365.3
ALECTISCILIARIS72.7336.3629.964.464.762.762.270.465.468.265.9
ANISOTREMUSVIRGINICUS---15.915.733.86.723.739.749.666.3
ANOMURA061.3672.7374.172.481.278.185.285.782.275.9
ANTHIINAE---38.249.044.938.243.649.755.553.0
ARCHOSARGUSPROBATOCEPHALUS15.0245.4554.1338.344.847.547.150.547.150.561.2
BALISTESCAPRISCUS67.5373.4274.3170.070.872.673.175.175.576.777.3
BALISTESVETULA20.737.2642.9254.763.574.281.653.058.283.267.0
BODIANUSPULCHELLUS38.0755.6359.2161.461.462.061.465.466.466.471.9
BODIANUSRUFUS020.4141.5337.735.339.443.336.737.047.857.4
CALAMUSBAJONADO078.689.689.579.679.689.579.679.679.663.3
CALAMUSLEUCOSTEUS54.8669.2473.1671.872.272.775.777.177.577.280.1
CALAMUSNODOSUS7.576.8910.374.775.977.377.278.178.781.780.0
CALAMUSPRORIDENS15.3418.6412.7564.566.467.268.570.970.973.574.8
CALAMUS26.7348.458.2258.664.162.859.363.864.867.662.1
CANTHIDERMISSUFFLAMEN2.647.7348.4245.849.152.845.751.552.853.649.6
CANTHIGASTERROSTRATUS---45.053.765.032.551.247.466.957.7
CARANXBARTHOLOMAEI46.6250.6454.3558.160.562.659.758.660.764.676.1
CARANXCRYSOS19.1348.9445.5653.754.858.157.057.556.660.659.0
CARANXRUBER084.8554.5568.575.073.974.977.071.576.174.0
CARCHARHINUSFALCIFORMIS10054.5554.5581.684.383.578.383.283.283.078.5
CARCHARHINUSPEREZI100100100---- --
CARCHARHINUSPLUMBEUS06.0636.3657.359.867.160.566.462.865.350.0
CAULOLATILUSCHRYSOPS41.8542.4339.2273.072.875.274.279.977.880.179.1
CAULOLATILUSCYANOPS10.889.3210.8668.169.770.070.571.372.473.373.0
CENTROPRISTISOCYURA---63.965.567.568.875.474.374.174.0
CEPHALOPHOLISCRUENTATA31.170.6365.1454.655.354.753.856.660.456.151.5
CHAETODONOCELLATUS---29.831.735.425.231.231.348.850.1
CHAETODONSEDENTARIUS9.099.416.5248.450.052.647.551.450.759.353.7
CHAETODON---13.419.632.30.5719.030.643.240.7
CHROMISENCHRYSURUS1.23042.0819.816.812.824.19.9517.027.651.3
CHROMISINSOLATUS10010011.7648.365.044.446.040.057.335.558.5
CHROMIS---2.49001.811.537.460.3432.3
DERMATOLEPISINERMIS17.446.4313.0178.780.681.083.685.18.4487.287.0
DIODONTIDAE1001001003.7319.99.9514.99.959.9559.775.8
DIPLECTRUMFORMOSUM27.7244.4660.0444.647.649.245.147.650.256.458.6
EPINEPHELUSADSCENSIONIS21.4463.6875.841.145.846.642.145.550.558.564.5
EPINEPHELUSFLAVOLIMBATUS90.8590.7690.4182.482.185.387.189.188.686.589.1
EPINEPHELUSMORIO33.6937.535.3373.574.577.177.178.379.381.680.2
EPINEPHELUSNIGRITUS10010010064.061.856.867.162.065.564.569.8
EPINEPHELUS36.3663.6463.6446.748.055.462.057.963.066.569.2
EQUETUSLANCEOLATUS100100100-- -- -69.2
EQUETUSUMBROSUS39.573.2673.6466.3-71.171.571.173.374.970.8
GONIOPLECTRUSHISPANUS81.8285.7184.03000000080.6
GYMNOTHORAXMORINGA---48.355.256.052.457.054.460.966.3
GYMNOTHORAXSAXICOLA---62.261.163.759.666.166.268.470.2
HAEMULONAUROLINEATUM55.5768.375.6862.662.967.063.968.469.268.867.9
HAEMULONFLAVOLINEATUM024.1167.8534.135.439.644.137.249.457.651.6
HAEMULONMACROSTOMUM5010010044.248.053.039.243.861.065.480.5
HAEMULONMELANURUM53.0377.2984.9464.266.868.164.669.172.471.970.0
HAEMULONPLUMIERI39.4847.6864.3142.341.950.744.054.050.157.765.8
HALICHOERESBATHYPHILUS9.2726.5552.3639.748.147.051.452.251.960.737.9
HALICHOERESBIVITTATUS---22.526.631.421.725.627.940.341.4
HALICHOERESGARNOTI---23.221.329.928.735.039.634.247.9
HALICHOERES---32.1-37.131.541.038.945.045.2
HOLACANTHUSBERMUDENSIS11.0218.6523.3892.8-68.968.370.070.774.175.1
HOLACANTHUS---31.533.939.234.044.644.636.870.9
HOLANTHIUSMARTINICENSIS---78.483.547.532.741.845.358.654.9
HOLOCENTRUS084.2293.7848.3 62.550.961.960.064.154.0
HYPOPLECTRUSGEMMA---13.519.933.413.120.927.343.020.0
HYPOPLECTRUS---25.329.634.213.129.922.451.261.6
HYPOPLECTRUSUNICOLOR06.0633.6453.257.067.762.969.067.068.656.2
IOGLOSSUS---37.743.347.229.039.341.256.855.8
KYPHOSUS24.7618.5827.9147.545.763.359.558.653.362.227.6
LACHNOLAIMUSMAXIMUS9.812.739.3259.160.360.163.260.063.657.062.4
LACTOPHRYSTRIGONUS---30.735.134.826.332.840.551.572.3
LIOPROPOMAEUKRINES3.0320.7821.5670.865.069.365.076.370.571.380.8
LUTJANUSANALIS41.1228.0221.2157.363.558.157.661.759.259.969.3
LUTJANUSAPODUS044.4454.5548.940.341.930.429.447.254.130.2
LUTJANUSBUCCANELA62.0277.0981.7270.971.773.774.876.976.579.077.1
LUTJANUSCAMPECHANUS37.5849.2650.1868.969.772.371.574.274.375.575.5
LUTJANUSGRISEUS16.340.154.5658.060.161.760.461.863.267.767.2
LUTJANUSSYNAGRIS11.8113.169.4962.363.065.462.664.065.271.272.4
LUTJANUS---16.928.738.134.841.837.427.443.1
LUTJANUSVIVANUS70.6636.5645.4587.586.687.585.990.192.991.179.3
MALACANTHUSPLUMIERI---56.156.859.155.259.460.163.570.1
MULLOIDICHTHYSMARTINICUS072.7372.7354.385.285.273.184.275.370.049.7
MURAENARETIFERA70.9489.0571.7363.756.669.563.166.071.776.169.0
MYCTEROPERCABONACI45.4590.9190.9148.047.152.344.456.152.959.254.8
MYCTEROPERCAINTERSTIALIS070.9139.0975.678.377.776.773.274.380.182.3
MYCTEROPERCAINTERSTITIALIS68.8263.4269.8767.070.869.368 .573.272.671.976.1
MYCTEROPERCAMICROLEPIS---59.749.759.769.769.769.769.781.0
MYCTEROPERCAPHENAX26.2640.5241.4868.169.570.170.372.372.574.677.6
MYCTEROPERCA14.5554.6545.8631.931.935.932.242.451.037.630.4
OCYURUSCHRYSURUS21.2442.1655.5539.643.946.541.546.045.652.151.2
OPHICHTHUSPUNCTICEPS---57.960.561.960.363.367.565.371.9
OPISTOGNATHUSAURIFRONS---24.831.638.718.330.330.641.744.6
PAGRUSPAGRUS21.2128.5531.6864.765.667.866.769.569.772.671.6
PARANTHIASFURCIFER54.5556.2115.3634.144.737.233.640.325.141.543.1
POMACANTHUSARCUATUS18.9217.4318.8164.064.466.967.469.267.773.468.2
POMACANTHUSPARU27.8659.8271.8368.966.574.673.673.672.274.971.9
POMACANTHUS0063.64000000028.9
POMACENTRIDAE010.9925.7730.532.637.327.032.333.842.244.6
POMACENTRUSPARTITUS---11.019.924.210.112.818.828.633.9
POMACENTRUS---01.341.5600027.50
PRIACANTHUSARENATUS---52.254.724.242.439.855.260.482.6
PRISTIGENYSALTA012.3312.8664.667.266.965.167.868.775.176.3
PRISTIPOMOIDESAQUILONARIS3.5815.4418.1785.750.251.148.350.852.452.451.3
PSEUDUPENEUSMACULATUS---64.669.674.650.070.540.667.540.5
PTEROIS32.8323.5722.975.978.278.781.683.385.183.482.9
RACHYCENTRONCANADUM45.45027.7289.583.073.179.882.380.288.474.9
RHOMBOPLITESAURORUBENS44.7360.667.2959.060.361.862.163.864.564.664.0
RYPTICUSMACULATUS50.7970.9473.9661.163.265.163.866.669.167.569.3
SCARIDAE---1.5501.149.970.271.210.2113.5
SCARUSVETULA---40.034.834.940.638.331.056.052.0
SERIOLADUMERILI51.6261.1959.4869.069.670.672.373.574.174.373.0
SERIOLAFASCIATA45.6261.8954.6661.863.465.264.969.368.368.269.6
SERIOLARIVOLIANA51.5560.9366.4669.071.872.772.174.674.576.074.1
SERIOLA---23.222.639.832.238.317.643.643.3
SERIOLAZONATA---69.789.569.769.779.669.779.657.8
SERRANUSANNULARIS---64.165.167.965.070.270.177.879.0
SERRANUSPHOEBE---40.640.744.641.540.745.047.554.9
SPARIDAE---19.854.759.754.843.142.357.269.7
SPARISOMAAUROFRENATUM---0.720.7200.8301.038.60.60
SPARISOMAVIRIDE---30.138.038.738.646.539.447.60
SPHYRAENABARRACUDA045.4545.4571.173.478.472.171.883.784.174.4
STENOTOMUSCAPRINUS---0.8801.080.413.12.8317.311.5
SYACIUM---79.179.176.579.581.28.4488.486.7
SYNODONTIDAE---50.658.063.438.759.35.5768.476.8
THALASSOMABIFASCIATUM---3.366.1311.42.219.96.5834.238.2
UPENEUSPARVUS1.0717.6537.3625.628.926.326.533.422.631.053.1
XANTHICHTHYSRINGENS23.2110010078.381.083.077.784.78.7387.574.7

References

  1. Changa, C.M.; rong Fanga, W.; Jaob, R.C.; Shyuc, C.Z.; Liaoc, I.C. Development of an intelligent feeding controller for indoor intensive culturing of eel. Aquac. Eng. 2004, 32, 343–353. [Google Scholar] [CrossRef]
  2. Cabreira, A.G.; Tripode, M.; Madirolas, A. Artificial neural networks for fish-species identification. ICES J. Mar. Sci. 2009, 66, 1119–1129. [Google Scholar] [CrossRef]
  3. Alaba, S.; Shah, C.; Nabi, M.; Ball, J.; Moorhead, R.; Han, D.; Prior, J.; Campbell, M.; Wallace, F. Semi-supervised learning for fish species recognition. In Proceedings of the Ocean Sensing and Monitoring XV, Orlando, FL, USA, 3–4 May 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12543, pp. 248–255. [Google Scholar] [CrossRef]
  4. Alaba, S.Y.; Nabi, M.; Shah, C.; Prior, J.; Campbell, M.D.; Wallace, F.; Ball, J.E.; Moorhead, R. Class-aware fish species recognition using deep learning for an imbalanced dataset. Sensors 2022, 22, 8268. [Google Scholar] [CrossRef]
  5. Shah, C.; Alaba, S.Y.; Nabi, M.M.; Prior, J.; Campbell, M.; Wallace, F.; Ball, J.E.; Moorhead, R. An enhanced YOLOv5 model for fish species recognition from underwater environments. In Proceedings of the Ocean Sensing and Monitoring XV, Orlando, FL, USA, 3–4 May 2023; Hou, W., Mullen, L.J., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2023; Volume 12543, p. 125430O. [Google Scholar] [CrossRef]
  6. Shah, C.; Alaba, S.Y.; Nabi, M.M.; Caillouet, R.; Prior, J.; Campbell, M.; Wallace, F.; Ball, J.E.; Moorhead, R. MI-AFR: Multiple instance active learning-based approach for fish species recognition in underwater environments. In Proceedings of the Ocean Sensing and Monitoring XV, Orlando, FL, USA, 3–4 May 2023; Hou, W., Mullen, L.J., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2023; Volume 12543, p. 125430N. [Google Scholar] [CrossRef]
  7. Prior, J.; Campbell, M.; Dawkins, M.; Mickle, P.; Moorhead, R.; Alaba, S.; Shah, C.; Salisbury, J.; Rademacher, K.; Felts, A.; et al. Estimating precision and accuracy of automated video post-processing: A step towards implementation of AI/ML for optics-based fish sampling. Front. Mar. Sci. 2023, 10, 1150651. [Google Scholar] [CrossRef]
  8. Jalali, M.A.; Ierodiaconou, D.; Monk, J.; Gorfine, H.; Rattray, A. Predictive mapping of abalone fishing grounds using remotely-sensed LiDAR and commercial catch data. Fish. Res. 2015, 169, 26–36. [Google Scholar] [CrossRef]
  9. Churnside, J.H.; Wells, R.; Boswell, K.M.; Quinlan, J.A.; Marchbanks, R.D.; McCarty, B.J.; Sutton, T.T. Surveying the distribution and abundance of flying fishes and other epipelagics in the northern Gulf of Mexico using airborne lidar. Bull. Mar. Sci. 2017, 93, 591–609. [Google Scholar] [CrossRef]
  10. Boswell, K.M.; Wilson, M.P.; Cowan, J.H., Jr. A semiautomated approach to estimating fish size, abundance, and behavior from dual-frequency identification sonar (DIDSON) data. N. Am. J. Fish. Manag. 2008, 28, 799–807. [Google Scholar] [CrossRef]
  11. Villon, S.; Chaumont, M.; Subsol, G.; Villéger, S.; Claverie, T.; Mouillot, D. Coral reef fish detection and recognition in underwater videos by supervised machine learning: Comparison between Deep Learning and HOG+ SVM methods. In Proceedings of the International Conference on Advanced Concepts for Intelligent Vision Systems, Lecce, Italy, 24–27 October 2016; Springer: Cham, Switzeland, 2016; pp. 160–171. [Google Scholar] [CrossRef]
  12. Bicknell, A.W.; Godley, B.J.; Sheehan, E.V.; Votier, S.C.; Witt, M.J. Camera technology for monitoring marine biodiversity and human impact. Front. Ecol. Environ. 2016, 14, 424–432. [Google Scholar] [CrossRef]
  13. Shortis, M.; Harvey, E.; Abdo, D. A review of underwater stereo-image measurement for marine biology and ecology applications. In Oceanography and Marine Biology; CRC Press: Boca Raton, FL, USA, 2016; pp. 269–304. [Google Scholar] [CrossRef]
  14. Panetta, K.; Kezebou, L.; Oludare, V.; Agaian, S. Comprehensive Underwater Object Tracking Benchmark Dataset and Underwater Image Enhancement With GAN. IEEE J. Ocean. Eng. 2022, 47, 59–75. [Google Scholar] [CrossRef]
  15. Slonimer, A.L.; Dosso, S.E.; Albu, A.B.; Cote, M.; Marques, T.P.; Rezvanifar, A.; Ersahin, K.; Mudge, T.; Gauthier, S. Classification of Herring, Salmon, and Bubbles in Multifrequency Echograms Using U-Net Neural Networks. IEEE J. Ocean. Eng. 2023, 48, 1236–1254. [Google Scholar] [CrossRef]
  16. Ntouskos, V.; Mertikas, P.; Mallios, A.; Karantzalos, K. Seabed Classification From Multispectral Multibeam Data. IEEE J. Ocean. Eng. 2023, 48, 874–887. [Google Scholar] [CrossRef]
  17. Xiao, F.; Yuan, F.; Huang, Y.; Cheng, E. Turbid underwater image enhancement based on parameter-tuned stochastic resonance. IEEE J. Ocean. Eng. 2022, 48, 127–146. [Google Scholar] [CrossRef]
  18. Gu, K.; Liu, J.; Shi, S.; Xie, S.; Shi, T.; Qiao, J. Self-organizing multichannel deep learning system for river turbidity monitoring. IEEE Trans. Instrum. Meas. 2022, 71, 9510713. [Google Scholar] [CrossRef]
  19. Zeng, L.; Sun, B.; Zhu, D. Underwater target detection based on Faster R-CNN and adversarial occlusion network. Eng. Appl. Artif. Intell. 2021, 100, 104190. [Google Scholar] [CrossRef]
  20. Harden Jones, F. The reaction of fish to moving backgrounds. J. Exp. Biol. 1963, 40, 437–446. [Google Scholar] [CrossRef]
  21. SWIPENET: Object detection in noisy underwater scenes. Pattern Recognit. 2022, 132, 108926. [CrossRef]
  22. Cardaillac, A.; Ludvigsen, M. Camera-sonar combination for improved underwater localization and mapping. IEEE Access 2023, 11, 123070–123079. [Google Scholar] [CrossRef]
  23. Almanza-Medina, J.E.; Henson, B.; Zakharov, Y.V. Deep learning architectures for navigation using forward looking sonar images. IEEE Access 2021, 9, 33880–33896. [Google Scholar] [CrossRef]
  24. Chang, C.C.; Ubina, N.A.; Cheng, S.C.; Lan, H.Y.; Chen, K.C.; Huang, C.C. A Two-Mode Underwater Smart Sensor Object for Precision Aquaculture Based on AIoT Technology. Sensors 2022, 22, 7603. [Google Scholar] [CrossRef]
  25. Wang, Y.; Yu, X.; An, D.; Wei, Y. Underwater image enhancement and marine snow removal for fishery based on integrated dual-channel neural network. Comput. Electron. Agric. 2021, 186, 106182. [Google Scholar] [CrossRef]
  26. Ju, Y.; Xiao, J.; Zhang, C.; Xie, H.; Luo, A.; Zhou, H.; Dong, J.; Kot, A.C. Towards marine snow removal with fusing Fourier information. Inf. Fusion 2025, 117, 102810. [Google Scholar] [CrossRef]
  27. Kaneko, R.; Sato, Y.; Ueda, T.; Higashi, H.; Tanaka, Y. Marine Snow Removal Benchmarking Dataset. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; pp. 771–778. [Google Scholar] [CrossRef]
  28. Debnath, B.; Ebu, I.A.; Biswas, S.; Gurbuz, A.C.; Ball, J.E. Fmcw radar range profile and micro-doppler signature fusion for improved traffic signaling motion classification. In Proceedings of the 2024 IEEE Radar Conference (RadarConf24), Denver, CO, USA, 6–10 May 2024; pp. 1–6. [Google Scholar] [CrossRef]
  29. Xu, W.; Matzner, S. Underwater fish detection using deep learning for water power applications. In Proceedings of the 2018 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 12–14 December 2018; pp. 313–318. [Google Scholar] [CrossRef]
  30. Nabi, M.; Shah, C.; Alaba, S.Y.; Prior, J.; Campbell, M.D.; Wallace, F.; Moorhead, R.; Ball, J.E. Probabilistic model-based active learning with attention mechanism for fish species recognition. In Proceedings of the OCEANS 2023-MTS/IEEE US Gulf Coast, Biloxi, MS, USA, 25–28 September 2023; pp. 1–8. [Google Scholar] [CrossRef]
  31. Shah, C.; Nabi, M.; Alaba, S.Y.; Prior, J.; Caillouet, R.; Campbell, M.D.; Wallace, F.; Ball, J.E.; Moorhead, R. A zero shot detection based approach for fish species recognition in underwater environments. In Proceedings of the OCEANS 2023-MTS/IEEE US Gulf Coast, Biloxi, MS, USA, 25–28 September 2023; pp. 1–7. [Google Scholar] [CrossRef]
  32. Jäger, J.; Rodner, E.; Denzler, J.; Wolff, V.; Fricke-Neuderth, K. SeaCLEF 2016: Object Proposal Classification for Fish Detection in Underwater Videos. In Proceedings of the CLEF (Working Notes), Évora, Portugal, 5–8 September 2016; pp. 481–489. [Google Scholar]
  33. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
  34. Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You only look at one sequence: Rethinking transformer in vision through object detection. Adv. Neural Inf. Process. Syst. 2021, 34, 26183–26197. [Google Scholar]
  35. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-YOLOv4: Scaling Cross Stage Partial Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13029–13038. [Google Scholar] [CrossRef]
  36. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
  37. Feng, J.; Jin, T. CEH-YOLO: A composite enhanced YOLO-based model for underwater object detection. Ecol. Inform. 2024, 82, 102758. [Google Scholar] [CrossRef]
  38. Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. SOD-YOLO: Small-Object-Detection Algorithm Based on Improved YOLOv8 for UAV Images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
  39. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  40. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
  41. Ortenzi, L.; Aguzzi, J.; Costa, C.; Marini, S.; D’Agostino, D.; Thomsen, L.; De Leo, F.C.; Correa, P.V.; Chatzievangelou, D. Automated species classification and counting by deep-sea mobile crawler platforms using YOLO. Ecol. Inform. 2024, 82, 102788. [Google Scholar] [CrossRef]
  42. Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; Kwon, Y.; Michael, K.; Changyu, L.; Fang, J.; Skalski, P.; Hogan, A.; et al. ultralytics/yolov5: V6. 0-YOLOv5n’Nano’models, Roboflow integration, TensorFlow export, OpenCV DNN support. Zenodo 2021. [Google Scholar] [CrossRef]
  43. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
  44. Jung, H.K.; Choi, G.S. Improved YOLOv5: Efficient Object Detection Using Drone Images under Various Conditions. Appl. Sci. 2022, 12, 7255. [Google Scholar] [CrossRef]
  45. Wang, H.; Hu, Z.; Mo, H.; Zhao, X. Enhanced nighttime nail detection using improved YOLOv5 for precision road safety. Sci. Rep. 2025, 15, 5224. [Google Scholar] [CrossRef]
  46. Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
  47. Zhang, Y.; Wu, Z.; Wang, X.; Fu, W.; Ma, J.; Wang, G. Improved YOLOv8 Insulator Fault Detection Algorithm Based on BiFormer. In Proceedings of the 2023 IEEE 5th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 14–16 July 2023; pp. 962–965. [Google Scholar] [CrossRef]
  48. Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
  49. Ansah, P.A.K.; Appati, J.K.; Owusu, E.; Boahen, E.K.; Boakye-Sekyerehene, P.; Dwumfour, A. SB-YOLO-V8: A Multilayered Deep Learning Approach for Real-Time Human Detection. Eng. Rep. 2025, 7, e70033. [Google Scholar] [CrossRef]
  50. Bi, J.; Li, K.; Zheng, X.; Zhang, G.; Lei, T. SPDC-YOLO: An Efficient Small Target Detection Network Based on Improved YOLOv8 for Drone Aerial Image. Remote Sens. 2025, 17, 685. [Google Scholar] [CrossRef]
  51. Yang, X.; Zhang, S.; Liu, J.; Gao, Q.; Dong, S.; Zhou, C. Deep learning for smart fish farming: Applications, opportunities and challenges. Rev. Aquac. 2021, 13, 66–90. [Google Scholar] [CrossRef]
  52. Villon, S.; Mouillot, D.; Chaumont, M.; Darling, E.S.; Subsol, G.; Claverie, T.; Villéger, S. A deep learning method for accurate and fast identification of coral reef fishes in underwater images. Ecol. Inform. 2018, 48, 238–244. [Google Scholar] [CrossRef]
  53. Rathi, D.; Jain, S.; Indu, S. Underwater fish species classification using convolutional neural network and deep learning. In Proceedings of the 2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR), Bangalore, India, 27–30 December 2017; pp. 1–6. [Google Scholar] [CrossRef]
  54. Shreesha, S.; Pai, M.M.; Pai, R.M.; Verma, U. Pattern detection and prediction using deep learning for intelligent decision support to identify fish behaviour in aquaculture. Ecol. Inform. 2023, 78, 102287. [Google Scholar] [CrossRef]
  55. Zhou, C.; Xu, D.; Chen, L.; Zhang, S.; Sun, C.; Yang, X.; Wang, Y. Evaluation of fish feeding intensity in aquaculture using a convolutional neural network and machine vision. Aquaculture 2019, 507, 457–465. [Google Scholar] [CrossRef]
  56. Abinaya, N.; Susan, D.; Sidharthan, R.K. Deep learning-based segmental analysis of fish for biomass estimation in an occulted environment. Comput. Electron. Agric. 2022, 197, 106985. [Google Scholar] [CrossRef]
  57. Zambrano, A.F.; Giraldo, L.F.; Quimbayo, J.; Medina, B.; Castillo, E. Machine learning for manually-measured water quality prediction in fish farming. PLoS ONE 2021, 16, e0256380. [Google Scholar] [CrossRef] [PubMed]
  58. Boulais, O.; Alaba, S.Y.; Ball, J.E.; Campbell, M.; Iftekhar, A.T.; Moorehead, R.; Primrose, J.; Prior, J.; Wallace, F.; Yu, H.; et al. SEAMAPD21: A large-scale reef fish dataset for fine-grained categorization. In Proceedings of the Eight Workshop on Fine-Grained Visual Categorization, Online, 25 June 2021. [Google Scholar]
  59. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
  60. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  61. Wang, X.; Xue, G.; Huang, S.; Liu, Y. Underwater Object Detection Algorithm Based on Adding Channel and Spatial Fusion Attention Mechanism. J. Mar. Sci. Eng. 2023, 11, 1116. [Google Scholar] [CrossRef]
  62. Pachaiyappan, P.; Chidambaram, G.; Jahid, A.; Alsharif, M.H. Enhancing Underwater Object Detection and Classification Using Advanced Imaging Techniques: A Novel Approach with Diffusion Models. Sustainability 2024, 16, 7488. [Google Scholar] [CrossRef]
  63. Sung, M.; Yu, S.C.; Girdhar, Y. Vision based real-time fish detection using convolutional neural network. In Proceedings of the OCEANS 2017-Aberdeen, Aberdeen, UK, 19–22 June 2017; pp. 1–6. [Google Scholar] [CrossRef]
  64. Jalal, A.; Salman, A.; Mian, A.; Shortis, M.; Shafait, F. Fish detection and species classification in underwater environments using deep learning with temporal information. Ecol. Inform. 2020, 57, 101088. [Google Scholar] [CrossRef]
  65. Ultralytics. Ultralytics/ultralytics: New000YOLOv8 in PyTorch > ONNX > CoreML > TFLite. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 8 March 2025).
  66. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
  67. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
  68. Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
  69. Roy, S.K.; Sukul, A.; Jamali, A.; Haut, J.M.; Ghamisi, P. Cross hyperspectral and LiDAR attention transformer: An extended self-attention for land use and land cover classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5512815. [Google Scholar] [CrossRef]
  70. Zhang, Z.; Lu, X.; Cao, G.; Yang, Y.; Jiao, L.; Liu, F. ViT-YOLO:Transformer-Based YOLO for Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2799–2808. [Google Scholar] [CrossRef]
  71. Wu, T.; Dong, Y. YOLO-SE: Improved YOLOv8 for Remote Sensing Object Detection and Recognition. Appl. Sci. 2023, 13, 12977. [Google Scholar] [CrossRef]
  72. Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Figure 1. An enhanced architecture based on YOLOv8-TF for the detection of fish species using the SEAMAPD21 dataset.
Figure 1. An enhanced architecture based on YOLOv8-TF for the detection of fish species using the SEAMAPD21 dataset.
Sensors 25 01846 g001
Figure 2. C2f block [68] for YOLOv8 model. For input image X with dimensions h × w × c i n , the output of the C2f layer can be represented as Y with dimensions h × w × c o u t . It consists of n number of bottlenecks connected serially to transfer the information.
Figure 2. C2f block [68] for YOLOv8 model. For input image X with dimensions h × w × c i n , the output of the C2f layer can be represented as Y with dimensions h × w × c o u t . It consists of n number of bottlenecks connected serially to transfer the information.
Sensors 25 01846 g002
Figure 3. Transformer (trans) block for a YOLOv8-TF approach, where q, k, and v denote query, key, and value respectively. Here, ⊕ signify element-wise summation.
Figure 3. Transformer (trans) block for a YOLOv8-TF approach, where q, k, and v denote query, key, and value respectively. Here, ⊕ signify element-wise summation.
Sensors 25 01846 g003
Figure 4. The distribution of sample occurrences per species in SEAMAPD21 [58] exhibits a highly imbalanced structure.
Figure 4. The distribution of sample occurrences per species in SEAMAPD21 [58] exhibits a highly imbalanced structure.
Sensors 25 01846 g004
Figure 5. Images from the SEAMAPD21 dataset depict scenes where fish are often difficult to distinguish from the background, adding to the challenge of identification. Some fish images pose challenges even for human detection, possibly due to occlusion by vertical bars or other fish.
Figure 5. Images from the SEAMAPD21 dataset depict scenes where fish are often difficult to distinguish from the background, adding to the challenge of identification. Some fish images pose challenges even for human detection, possibly due to occlusion by vertical bars or other fish.
Sensors 25 01846 g005aSensors 25 01846 g005b
Figure 6. Comparing the performance of our proposed YOLOv8-TF with YOLOv5 and YOLOv8-based approaches for the SEAMAPD21 data based on FPS. GFLOPS is shown by radii of the circles.
Figure 6. Comparing the performance of our proposed YOLOv8-TF with YOLOv5 and YOLOv8-based approaches for the SEAMAPD21 data based on FPS. GFLOPS is shown by radii of the circles.
Sensors 25 01846 g006
Figure 7. Detection images for YOLOv8s in SEAMAPD21.
Figure 7. Detection images for YOLOv8s in SEAMAPD21.
Sensors 25 01846 g007
Figure 8. Detection images for YOLOv8m in SEAMAPD21.
Figure 8. Detection images for YOLOv8m in SEAMAPD21.
Sensors 25 01846 g008
Figure 9. Detection images for YOLOv8l in SEAMAPD21.
Figure 9. Detection images for YOLOv8l in SEAMAPD21.
Sensors 25 01846 g009
Figure 10. Detection images for YOLOv8-TF in SEAMAPD21. High confidence is exhibited in the detection of all fish species within each image.
Figure 10. Detection images for YOLOv8-TF in SEAMAPD21. High confidence is exhibited in the detection of all fish species within each image.
Sensors 25 01846 g010
Table 1. The mean average precision (mAP%) was computed for 121 species using the SEAMAPD21 dataset (see Table A1).
Table 1. The mean average precision (mAP%) was computed for 121 species using the SEAMAPD21 dataset (see Table A1).
MethodmAP0.5mAP0.5:0.95 ParametersGFLOPSFPS
MobileNetv3-large-32.51--105
VGG300-48.99--67
VGG512-52.75--54
YOLOv5s71.943.97.37 M17.1117
YOLOv5m75.647.921.37 M49.5110
YOLOv5l78.550.646.80 M109.9103
YOLOv5enh81.153.061.30 M151.099
YOLOv8n72.145.43.66 M11.2146
YOLOv8s76.149.611.21 M29.1137
YOLOv8m80.352.725.91 M79.1128
YOLOv8l81.153.443.70 M151.0120
YOLOv10l84.258.525.92 M127.4125
YOLOv8-TF87.961.230.56 M195.7116
Table 2. The mean average precision (mAP%) was computed for 20 classes using the pascal VOC dataset.
Table 2. The mean average precision (mAP%) was computed for 20 classes using the pascal VOC dataset.
MethodmAP0.5mAP0.5:0.95 ParametersFPS
YOLOv8n83.9053.703.93 M194
YOLOv8s87.6554.8014.81 M168
YOLOv8m89.1054.1026.49 M155
YOLOv8l91.7056.4047.31 M141
YOLOv10l92.4057.1225.79 M147
YOLOv8-TF94.6058.2130.51 M130
Table 3. The mean average precision (mAP%) was computed for 80 classes using the MSCOCO dataset.
Table 3. The mean average precision (mAP%) was computed for 80 classes using the MSCOCO dataset.
MethodmAP0.5mAP0.5:0.95 ParametersFPS
YOLOv8n47.7032.404.07 M242
YOLOv8s59.4442.4314.83 M216
YOLOv8m62.9045.7026.86 M190
YOLOv8l67.2049.6047.36 M153
YOLOv10l68.2050.5025.88 M162
YOLOv8-TF69.5051.8030.54 M139
Table 4. Ablation analysis.
Table 4. Ablation analysis.
YOLOv8-TFmAP0.5mAP0.5:0.95ParamsFPS
Depth ScaleCA LossTransWise IoU v3 Loss
81.853.226.87 M126
82.854.827.12 M123
86.259.230.52 M118
87.961.230.56 M116
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shah, C.; Nabi, M.M.; Alaba, S.Y.; Ebu, I.A.; Prior, J.; Campbell, M.D.; Caillouet, R.; Grossi, M.D.; Rowell, T.; Wallace, F.; et al. YOLOv8-TF: Transformer-Enhanced YOLOv8 for Underwater Fish Species Recognition with Class Imbalance Handling. Sensors 2025, 25, 1846. https://doi.org/10.3390/s25061846

AMA Style

Shah C, Nabi MM, Alaba SY, Ebu IA, Prior J, Campbell MD, Caillouet R, Grossi MD, Rowell T, Wallace F, et al. YOLOv8-TF: Transformer-Enhanced YOLOv8 for Underwater Fish Species Recognition with Class Imbalance Handling. Sensors. 2025; 25(6):1846. https://doi.org/10.3390/s25061846

Chicago/Turabian Style

Shah, Chiranjibi, M M Nabi, Simegnew Yihunie Alaba, Iffat Ara Ebu, Jack Prior, Matthew D. Campbell, Ryan Caillouet, Matthew D. Grossi, Timothy Rowell, Farron Wallace, and et al. 2025. "YOLOv8-TF: Transformer-Enhanced YOLOv8 for Underwater Fish Species Recognition with Class Imbalance Handling" Sensors 25, no. 6: 1846. https://doi.org/10.3390/s25061846

APA Style

Shah, C., Nabi, M. M., Alaba, S. Y., Ebu, I. A., Prior, J., Campbell, M. D., Caillouet, R., Grossi, M. D., Rowell, T., Wallace, F., Ball, J. E., & Moorhead, R. (2025). YOLOv8-TF: Transformer-Enhanced YOLOv8 for Underwater Fish Species Recognition with Class Imbalance Handling. Sensors, 25(6), 1846. https://doi.org/10.3390/s25061846

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop