HRA-YOLO: An Effective Detection Model for Underwater Fish

Wang, Hongru; Zhang, Jingtao; Cheng, Hu

doi:10.3390/electronics13173547

Open AccessArticle

HRA-YOLO: An Effective Detection Model for Underwater Fish

by

Hongru Wang

^*

,

Jingtao Zhang

and

Hu Cheng

School of Mechanical Engineering, Jiangsu University of Science and Technology, Zhenjiang 212100, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3547; https://doi.org/10.3390/electronics13173547

Submission received: 10 August 2024 / Revised: 30 August 2024 / Accepted: 3 September 2024 / Published: 6 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

In intelligent fisheries, accurate fish detection is essential to monitor underwater ecosystems. By utilizing underwater cameras and computer vision technologies to detect fish distribution, timely feedback can be provided to staff, enabling effective fishery management. This paper proposes a lightweight underwater fish detection algorithm based on YOLOv8s, named HRA-YOLO, to meet the demand for a high-precision and lightweight object detection algorithm. Firstly, the lightweight network High-Performance GPU Net (HGNetV2) is used to substitute the backbone network of the YOLOv8s model to lower the computational cost and reduce the size of the model. Second, to enhance the capability of extracting fish feature information and reducing missed detections, we design a residual attention (RA) module, which is formulated by embedding the efficient multiscale attention (EMA) mechanism at the end of the Dilation-Wise Residual (DWR) module. Then, we adopt the RA module to replace the bottleneck of the YOLOv8s model to increase detection precision. Taking universality into account, we establish an underwater fish dataset for our subsequent experiments by collecting data in various waters. Comprehensive experiments are carried out on the self-constructed fish dataset. The results on the self-constructed dataset demonstrate that the precision of the HRA-YOLO model improved to 93.1%, surpassing the original YOLOv8s model, while the computational complexity was reduced by 19% (5.4 GFLOPs), and the model size was decreased by 25.3% (5.7 MB). And compared to other state-of-the-art detection models, the overall performance of our model shows its superiority. We also perform experiments on other datasets to verify the adaptability of our model. The experimental results on the Fish Market dataset indicate that our model has better overall performance than the original model and has good generality.

Keywords:

fish detection; YOLOv8s; HGNetV2; residual attention module

1. Introduction

Fish are among the most important sources of high-quality protein for humans. Since the twentieth century, aquaculture has made significant contributions to providing a sustainable and nutritionally rich source of animal protein for the growing population of the world. According to statistics, global aquaculture production reached a historical high of 122.6 million tons in 2020, accounting for 49.2% of total global aquatic animal production [1]. To meet the ever-growing human need for aquatic products, intelligent management has become the inevitable trend in aquaculture. As an important member of aquaculture, the intelligent level of fish breeding has attracted more and more attention. With intelligent monitoring devices, the distribution of fish schools in different underwater regions can be detected effectively and quickly, which can guide the breeder in accurate feeding. It is beneficial for both fish health and feed saving.

Fish detection methods based on optical vision primarily fall into two categories: traditional image processing methods and deep-learning-based methods. The former was widely adopted to detect fish in early times. For example, Spampinato et al. [2] used spatial Gabor filters and cooccurrence matrix properties to extract the texture features of fish and utilized curvature scale space transformation to further delineate the shape features to recognize and classify fish species. However, they did not consider the color variations of fish in different environments. To solve this problem, Tharwat et al. [3] used the Weber Local Descriptor (WLD) and color moments to extract texture and color features from fish images and adopted an AdaBoost classifier for fish recognition. Although these methods can effectively identify fish, they rely on manually provided fish feature information, which may overlook some subtle and hidden characteristics not easily noticed by humans. As a result, their detection performance is relatively poor, especially in complex underwater environments. On the contrary, deep-learning-based methods enable automatic feature extraction because they are data driven, and they can significantly improve the precision of fish recognition. Therefore, they have been extensively applied in various aspects of aquaculture in recent years, including feeding detection [4], image segmentation [5], live fish identification [6], and behavior analysis [7]. Deep-learning-based methods are typically one of two types, namely, two-stage methods and one-stage methods. The former, such as Fast R-CNN [8] and Faster R-CNN [9], detect targets in two steps, namely feature extraction and target recognition. Due to the two separate steps, these methods have a slow detection speed, high complexity, and large size. Thus, they are not suitable for situations with high real-time performance and low computing power. The latter directly figures out the target’s positional coordinates for classification without feature extraction, significantly increasing detection speed while maintaining precision. Therefore, one-stage detection methods are being applied more and more widely in various detection tasks. Common one-stage object detection methods include Retinanet [10], SSD (Single-Shot MultiBox Detector) [11], and the YOLO (You Only Look Once) series [12]. Among them, the YOLO series has been widely studied by researchers because of its outstanding performance and rapid deployment capabilities. Ahsan Jalal et al. [13] introduced an integrated solution that introduces both optical flow and Gaussian mixture models into the YOLO network, achieving a unified method for detecting and classifying fish in unconstrained underwater videos. Although this approach improved detection performance, increased computational complexity limited its deployment on embedded devices. To solve this problem, Zhang et al. [14] built a YOLOv4-based method, which combines MobileNet v2 and depthwise separable convolutions to reduce the parameter count and size of the model, and incorporated attention mechanisms into the feature fusion module to increase recognition precision. They performed detection experiments with different datasets and obtained satisfactory results, yet missed detections may occur when there are shallow features present. To overcome this, some researchers focused on elevating multiscale representation capabilities. Li et al. [15] modified YOLOv5 by introducing the Res2Net residual structure and embedding the CA (coordinate attention) mechanism at its end, thus reducing the size of the model while maintaining precision and increasing the representation of multiscale features. Liu et al. [16] addressed the limitations of current detection methods by improving YOLOv7. Unlike previous scholars, they not only adopted lightweight modules but also added dense skip connections between them, improving feature extraction capabilities and network inference speed. However, their methods need more storage space and are not suitable for underwater detectors.

To solve the above-mentioned issues, we propose a new YOLOv8s-based detection model for fish detection in underwater aquaculture. Our main contributions are as follows:

Build a fish dataset under different water quality conditions. To ensure practicality in various real environments, fish data were previously collected in two distinct water quality settings and supplemented with images from a laboratory exhibiting clear water conditions and fish images on the Internet. Data augmentation is implemented here on the collected data to form the datasets.
Construct HRA-YOLO by reshaping the YOLOv8s model. We modify the YOLOv8s model in two ways. One is adopting the lightweight network HGNetV2 as the backbone of YOLOv8s, and the other is using a newly designed module named RA (residual attention) to replace the bottleneck of its C2f module.
Perform comprehensive experiments and analyze the experimental results from multiple perspectives.

2. Materials and Methods

2.1. YOLOv8s Model

YOLOv8s [17] is the upgraded model in the YOLO series developed by Ultralytics. It contains three main parts: the backbone, neck, and head. The YOLOv8s backbone network is made up of the CBS convolution module, the C2f (CSPDarknet53 to 2-stage FPN) module, and the SPPF module. YOLOv8s adopts the C2f module to extract features and regulate the number of channels to speed up feature extraction. The neck network of YOLOv8s is made up of upsampling operations (upsample), concatenation operations (Concat), C2f modules, and CBS convolution modules. It retains the YOLOv5 path aggregation network + feature pyramid network (PAN + FPN) architecture to improve feature integration. The head network is made up of three detection layers, aimed at detecting features of different scales produced by the neck network. Simultaneously, the detection layer segment employs a decoupled head structure, separating the classification and detection heads.

2.2. HRA-YOLO Model

The YOLOv8s model has made significant progress in the aspect of detection precision and speed and has demonstrated good performance in multiscale prediction. However, there are still deficiencies in its feature extraction capabilities, which limit further improvements in its detection precision. Moreover, given the need to deploy the model on embedded devices, optimizing the model to reduce its dependence on memory and computational resources becomes particularly urgent.

To address these challenges, our goal is to optimize the YOLOv8s model without sacrificing detection speed while ensuring precision. First, we replace the YOLOv8s backbone network with the HGNetV2 feature extraction network. Specifically, we remove all of the C2f and Conv layers in the original backbone network, add stem, HG-Block, and DWConv layers, and connect them in a specific sequence to reduce the computational burden and improve operational efficiency. Second, we design a residual attention (RA) module, specifically embedding the EMA attention mechanism at the end of the Dilation-Wise Residual (DWR) structure. This embedded configuration enables more complex feature extraction within the network, thus increasing detection precision. Finally, all C2f modules in the neck network are improved, turning the C2f modules into residual attention feature extraction (RAFE) modules by replacing the original C2f bottleneck structure with the RA module. Through these methods, we obtain a new lightweight and high-precision network model called HRA-YOLO, as shown in Figure 1.

2.2.1. Lightweight Backbone Network

High-Performance GPU Net (HGNetV2) is a lightweight network developed by the Baidu PaddlePaddle team [18] and serves as the backbone network for the RT-DETR model. The overall structure of HGNetV2 comprises one stem module and four stage modules, as depicted in Figure 2a. The stem module, as the network’s preprocessing layer, is made up of standard convolution modules. Each stage module incorporates the core HG-Block structure. Stage 1 contains only one HG-Block, while Stages 2 to 4 all contain one Learnable Down-Sampling (LDS) layer and multiple HG-Blocks. The HG-Block has several 3 × 3 standard convolution layers to capture a wide variety of features, as shown in Figure 2b. Subsequently, these features pass through a 1 × 1 convolution layer for compression and processing of the concatenated features followed by an excitation convolution layer (1 × 1 Conv) for further feature processing, and a connection operation is performed to merge and output the feature information. The HG-Block is designed to process data hierarchically, allowing the network to progressively extract information from low-level features to high-level features during learning. The LDS layer is designed to reduce the spatial dimensions of the feature map and expand the receptive field, using depthwise convolution (DWConv) to perform convolution operations.

Depthwise convolution is an efficient convolution operation in convolutional neural networks. It significantly reduces computation and the number of parameters by performing convolution operations on each channel individually, making it suitable for resource-constrained environments. Assuming that the input channel count is C₁, the output channel count is C₂, the convolution kernel size is K × K, and the feature map size is H × W, the parameter count P_DW and computational cost F_DW for depthwise convolution are given by

P_{D W} = C_{1} \times K \times K

(1)

F_{D W} = C_{1} \times H \times W \times K \times K

(2)

The parameter count P and computational cost F for standard convolution are

P = C_{1} \times C_{2} \times K \times K

(3)

F = C_{1} \times C_{2} \times H \times W \times K \times K

(4)

According to Formulas (1) to (4), we can obtain

\frac{F_{D W}}{F} = \frac{P_{D W}}{P} = \frac{1}{C_{2}}

(5)

C₂ is greater than one, that is, F_DW < F and P_DW < P, so the parameter count and computational cost of DWConv are significantly less than those of standard convolution, making it suitable for lightweight networks.

2.2.2. Dilation-Wise Residual Module (DWR)

The Dilation-Wise Residual (DWR) module, proposed by Wei et al. [19], is an efficient two-step method for acquiring multiscale contextual information. It employs a dilated residual structure with a multi-branch configuration, where each branch uses a dilation depthwise convolution with varying dilation rates. The overall structure, as shown in Figure 3, consists mainly of regional and semantic residualization. In regional residualization, regional residual features are first generated by a standard 3 × 3 convolution layer, batch normalization (BN) layer, and the activation function ReLU, and a series of simplified feature maps with diverse sizes are obtained. These feature maps serve as the input for subsequent morphological filtering. Then, multirate expanded depthwise convolution (D-n 3 × 3 DConv) is employed to perform morphological filtering on these regional features of varied sizes to realize semantic residualization. Finally, all feature maps are merged and undergo BN, followed by pointwise convolution (1 × 1 Conv) to combine features to obtain the final residual. The final residual is merged with the initial input features to obtain a more comprehensive representation of the features.

2.2.3. Residual Attention (RA) Structure

The efficient multiscale attention module (EMA) [20] is a novel mechanism of attention that emphasizes the importance of interactions between spatial positions. It employs parallel substructures to lessen the number of layers in the network, re-encodes the global information, calibrates the channel weights in each parallel branch, and uses cross-space interaction methods to aggregate the output features of all the parallel branches, resulting in enhanced pixel-level attention for high-level feature maps.

The EMA module, as shown in Figure 4b, utilizes three branching paths: two 1 × 1 branches and one 3 × 3 branch to extract attention weights from grouped feature maps. The input X of size C × H × W is divided into subfeatures G, expressed as X = [X₀, X_i,…, X_G−1], with X_i∈R^C//G×H×W. In the 1 × 1 branch, two one-dimensional global average pooling operations are used along with the x and y directions to encode the channels, establishing a cross-channel information interaction. Then, in the h direction, the two encoded features are merged by a shared 1 × 1 convolution layer. This generates two vectors along the H and W directions, which are then subjected to nonlinear fitting using the sigmoid activation function. After re-weighting the adaptive feature selection, the results from the two 1 × 1 branches are merged. In the 3 × 3 branch, only one 3 × 3 convolution is employed to extract multiscale features, which serves as the output of the 3 × 3 branch. For cross-spatial learning, the process includes two steps. First, two-dimensional global average pooling is used to encode the global information of the 1 × 1 branch output, followed by activation with the Softmax function and pointwise multiplication with the output of the 3 × 3 branch to create the first spatial attention map. The second step involves applying two-dimensional global average pooling and Softmax activation to the 3 × 3 branch output, then performing pointwise multiplication with the group-normalized output of the 1 × 1 branch to produce the second spatial attention map. Finally, these two spatial attention maps are merged and processed through the sigmoid function and re-weighted adaptive feature selection to obtain global contextual information. The formula for two-dimensional global average pooling is

z_{c} = \frac{1}{H \times W} \sum_{j}^{H} \sum_{i}^{W} x_{c} (i, j)

(6)

Here, H and W represent the height and width of the feature map, respectively, and x_c (i, j) denotes the value of the element located in the i-th row and the j-th column in the c-th channel of the feature map.

Although the DWR network can both increase the efficiency of multiscale information capture and reduce the computational load, it may decrease the detection precision. To accurately gain feature information and increase detection precision without increasing the overall size of the model, the RA module is developed by inserting the multiscale EMA attention module after the 1 × 1 Conv layer in the DWR module, as illustrated in Figure 4a. Figure 4b shows the details of the EMA module.

2.2.4. Residual Attention Feature Extraction Module (RAFE)

To increase the detection performance of the presented model, this article reconstructs the YOLOv8s C2f module. The bottleneck structure of the C2f module has certain limitations in extracting feature information from fish objects, and its efficiency in capturing contextual information also needs to be improved. To overcome these shortcomings, we first choose DWR dilated residual modules to replace all bottleneck structures of the C2f module. Then, all DWR modules are replaced with the RA residual attention structure to create the Residual Attention Feature Extraction module (RAFE), illustrated in Figure 5.

3. Experimental Results and Discussion

3.1. Evaluation Metrics and Experimental Environment

It is essential to evaluate the effectiveness and efficiency of the proposed model. In this paper, we employ precision (P), recall (R), mean average precision (mAP), model size (MB), and FLOPs as evaluation metrics [21]. P refers to the proportion of true-positive samples in the samples predicted as positive. R indicates the proportion of true-positive samples that are correctly predicted. mAP is the average of the average precision across all categories. Model size refers to the amount of storage space occupied by a deep learning model. FLOPs measure the model’s complexity by the number of floating-point operations.

The formulas for three of the evaluation metrics are as follows:

P = \frac{T_{P}}{T_{P} + F_{P}} \times 100 %

(7)

R = \frac{T_{P}}{T_{P} + F_{N}} \times 100 %

(8)

m A P = \frac{\sum_{1}^{K} \int_{0}^{1} P (R) d R}{K}

(9)

In Formulas (7) and (8), T_P indicates the number of positive samples correctly classified as positive by the model, F_P denotes the number of negative samples incorrectly classified as positive, and F_N shows the number of positive samples incorrectly classified as negative. In Formula (9), K is the number of object categories that must be detected, and P(R) is the precision recall function.

The experimental configuration for this paper is as follows: the Windows 10 operating system, an NVIDIA GeForce RTX 3060 graphics card with 12 GB of video memory, an Intel Core i5-10400F processor at 2.90 GHz, 16 GB of system RAM, Python 3.9 used as the programming language, and PyTorch 2.0.1 used as the deep learning framework. The acceleration environment includes CUDA 11.8 and CUDNN 8.9.2.

The hyperparameters of the model are configured as follows: the initial learning rate is set to 0.01, the cyclical learning rate is 0.01, and the weight decay coefficient is 0.0005. The batch size is set to 32, and the number of training epochs is set to 300. The input image size is configured to 640 × 640 pixels. The model is optimized using the SGD (Stochastic Gradient Descent) optimization algorithm. Other hyperparameters not mentioned here are set to their default values.

3.2. Production of Experimental Data

3.2.1. Data Collection and Annotation

The quality of the dataset affects the practicality of the model. Therefore, we collected images from various environments to construct a high-quality fish dataset. We collected data at the Zhangsi Reservoir in our city (Figure 6a) and the Donggu Reservoir on our campus (Figure 6b), respectively, using underwater monitoring equipment (model HK90) produced by Shenzhen Haxtech (Figure 6c). The equipment has a frame rate of 25FPS and an image resolution of 1080 × 1920PX. Taking into account the impact of varying light conditions at different times, we collected data at 9 a.m., 1 p.m., and 5 p.m. Beijing time. A fixed amount of bait was used to attract fish each time, and the feeding process lasted for 60 min, starting image capture 10 min after feeding. The captured video data were stored in AVI format on a storage card and imported into a computer for frame-by-frame image extraction. The intense swimming of the fish at the beginning of bait feeding resulted in unusable blurred images, and these were deleted from the dataset. In total, 2608 fish images were retained from both environments. From a pool of 2608 images, 600 fish images were selected separately as part of the test dataset, with the remaining 2008 images constituting the original dataset. There is no information overlap between the two.

The diversity of data is crucial for the performance of the model. To avoid model overfitting caused by a single sample during the training process, we additionally included two sets of data. One comprises 220 fish images sourced from the Internet, while the other consists of fish images captured in a laboratory setup with an experimental water tank. As illustrated in Figure 7, the setup of the experimental aquarium includes a tank, an external fish tank filter, two light sources, and underwater detection equipment (see Figure 6c). The tank is 2 m long, 1 m wide, and 0.7 m high. The camera is positioned at the edge of the tank, angled 30 degrees towards the bottom to minimize the impact of direct light. The tank houses ten living fish. During the experiment, video data of the fish in the tank were collected at 9 a.m., 1 p.m., and 5 p.m. Beijing time. The captured video data were stored in AVI format on a memory card and then transferred to a computer, where blurry images were removed, resulting in 902 clear images. Additionally, 20 images were selected from the 220 Internet images and 30 images from the 902 experimental aquarium images as another part of the test set.

The examples of images from environment one, environment two, the laboratory, and the internet are shown in Figure 8. A total of 3080 original image samples and 650 test images were acquired. LabelimgV1.8.6 software was used to annotate all targets in the image samples with rectangular labels, producing text labels in a TXT format readable by HRA-YOLO to form the original datasets.

3.2.2. Offline Data Augmentation

The quantity of data samples significantly influences the detection precision, and utilizing data augmentation techniques can enhance the model’s generalization capability. Therefore, in this study, we employ YOLOv8s as the base model and use offline data augmentation techniques to augment the original dataset obtained earlier. The augmented images are then applied to both the training and validation datasets. Different offline data augmentation methods may yield varying effects on the model. Thus, this paper designs experiments to compare different data augmentation methods.

Offline data augmentation methods include geometric transformations, photometric changes, and intensity transformations [22]. Given the minimal variation in illumination among images in small-scale datasets, the ability to rely solely on the impact of illumination changes is limited. Consequently, this paper combines photometric changes with intensity transformations for a comprehensive comparison with other methods. The geometric transformations explored include random cropping, translation, random rotation, and mirroring, and intensity transformations involve adjustments to brightness, the addition of random noise, cutout, and random erasure. We selected 250 images from environment one, environment two, and the laboratory images, applied different data augmentation methods, and expanded the datasets fourfold for analytical comparison. The experimental results are shown in Table 1.

Table 1 shows that, regardless of the data augmentation method employed, the augmented datasets show apparent improvements in P, R, and mAP. The geometric transformation method obtains 0.3% higher precision and 0.5% higher mAP than the intensity transformation method, but it shows a 0.5% lower recall rate. The combined use of intensity and geometric transformations yields the least favorable results.

From the analysis mentioned above, we can conclude that adopting geometric transformation methods is the best choice when augmenting the underwater fish datasets offline with a small number of images. Therefore, we used geometric transformation methods to augment the original datasets in this study.

So far, a total of 3262 training images, 818 validation images, and 650 test images have been obtained, which constitute the final fish datasets.

3.3. Experimental Results of the HRA-YOLO Model

Experimental evaluations of the improved model were performed on the self-constructed datasets. Figure 9 illustrates the changes in mAP and loss during the training process of the improved model. From Figure 9, it can be seen that during the initial stages, the loss rapidly decreases as a result of the high learning rate. As the number of training iterations increases, changes in the loss curve become more gradual, indicating convergence. Similarly, in the initial stages, mAP increases quickly. Around 190 epochs, the model learning rate stabilizes, and mAP essentially ceases to change without any decline. The training curves demonstrate the stability of the improved model’s performance with no signs of overfitting.

To further evaluate the performance of the improved model in detecting fish in real-world environments, we tested the model using images from the test dataset. Since the self-constructed fish dataset was collected in actual environments, the sizes of the fish vary, ensuring the reliability of the model. The detection results are shown in Figure 10. As can be seen in Figure 10, the model can accurately identify multiple targets under different fish sizes (a, b, c, d, e, f), various occlusion conditions (a, b), clear conditions (c), blurry conditions (d), different brightness levels (e, f), laboratory environments (g), and network images (h). This indicates that the improved model has good generalization.

The training results of the improved and original models are shown in Table 2. It is apparent that, compared to the original model, the improved model has increased its precision and mAP by 1% and 0.1%, respectively. Furthermore, there is a significant reduction in computational complexity, as indicated by a 19% decrease (5.4 G) in FLOPs, a 26% decrease (2.900176 M) in the number of parameters, and a 25.3% decrease (5.7 MB) in the model size.

To demonstrate changes in the target region of interest (ROI), this article uses the Grad-CAM algorithm [23] to visualize the effects of feature extraction with heatmaps, as illustrated in Figure 11. In the heatmap, warmer colors (e.g., red) indicate areas of higher attention, while cooler colors (e.g., blue) represent areas of lower attention.

In Figure 11, we can see that the ROIs are blurred due to background interference with YOLOv8s. On the contrary, there is a noticeable decrease in attention to areas without fish and an increase in ROIs with our model. This indicates that the proposed model can reduce interference from irrelevant backgrounds and effectively capture multiscale contextual information, thereby focusing on fish and improving the precision of fish detection.

3.4. Comparison of Different Attention Mechanisms

To verify that embedding the EMA attention mechanism into the DWR module is the optimal choice, we performed experiments using different attention mechanisms under the same replacement method and experimental conditions, including CA [24], the NAM (normalization-based attention module) [25], the SimAM (simple attention module) [26], and the EMA [20]. The results are shown in Table 3.

The experimental results demonstrate that incorporating different attention mechanisms into the DWR module can improve various metrics of the model. Specifically, the inclusion of EMA significantly improves the model’s precision to the greatest extent, while the addition of SimAM, conversely, leads to a decrease in precision. Regarding the recall rate, SimAM achieves the highest increase, but the inclusion of CA results in a lower recall rate. In terms of the mAP value, the enhancements provided by adding SimAM and EMA are the most effective. The CA mechanism contributes minimally to the complete improvement in model performance, while NAM, SimAM, and EMA significantly improve model performance. Compared to using the DWR module alone, integrating the EMA mechanism achieved the highest precision improvement, with an increase of 1.3%. Additionally, the recall rate increased by 0.4% and the mAP improved by 0.8%, showing the best overall performance. Comparative experiments show that combining the EMA attention mechanism with the DWR module has advantages over other attention mechanisms on custom fish datasets.

3.5. Ablation Experiments

To evaluate the effectiveness of incorporating various improvement modules, we designed several groups of ablation experiments based on the YOLOv8s model by sequentially integrating different improvement modules, as shown in Table 4.

From the results of Experiment 2, replacing the backbone network of the model with the HGNetV2 lightweight network results in an increase in precision, a slight decline in mAP, and a substantial reduction in FLOPs, which indicates that the HGNetV2 network can significantly reduce the model’s parameter count while ensuring precision. By comparing the results of Experiments 1 and 3 and 2 and 5, it is evident that substituting the DWR module for the C2f structure can decrease the computational load and increase the recall rate but can reduce both the precision and the mAP values, showing that substituting the C2f bottleneck structure with the DWR module can further decrease the computational load of the model, although at the cost of overall performance. Furthermore, by comparing the results of Experiments 3 and 4 and 5 and 6, it was found that replacing all DWR modules with RAFE modules, although it caused an increase of 0.3 G in FLOPs, can significantly increase the precision, recall rate, and mAP value. This demonstrates that embedding the EMA mechanism at the end of DWR modules indeed allows the neural network to capture more diverse scale feature information. In general, incorporating all the improvement modules resulted in a decrease of FLOPs to 23.0 G, with precision and mAP reaching 93.1% and 94.5%, respectively. In summary, the ablation experiments prove that HRA-YOLO can effectively improve detection precision while simultaneously reducing computational costs.

3.6. Comparison of Different Object Detection Models

To test the superiority of the proposed model in this article, we compared our model with other newest existing detection models, which include RT-DETR-L [18], YOLOv7-tiny [27], YOLOv5s [28], SSD [11], EfficientDet [29], RC-YOLOv5 [15], YOLOv9s [30], and YOLOv10s [31]. The experimental results are presented in Table 5.

In Table 5, it is evident that there is a significant overall difference between RT-DETR-L and our model. YOLOv7-tiny is very close to our model in terms of mAP and even exceeds our model in recall rate and speed, but it shows a substantial gap in precision. The YOLO series models generally outperform other models in terms of precision and mAP. Although YOLOv5s has a faster speed than our model, its mAP is the lowest among the four YOLO models. RC-YOLOv5 achieves the highest speed but has lower values in the other three parameters compared to our model. YOLOv9s has the highest recall rate and mAP among all detection models, but its other three parameters are lower than our model’s, and all metrics of YOLOv10s are below our model’s performance. Among the nine detection models, SSD shows the lowest precision and recall rate. Compared to our model, EfficientDet shows significant differences in mAP and FPS.

Detecting the target accurately is the most crucial task of the model. Our model has the highest precision, and although some other metrics are slightly lower than those of individual models, considering precision, detection speed, and computational complexity, our model offers the best overall performance. It achieves a detection speed of 103.3 FPS per image, meeting the detection requirements.

The performance of the model is mainly influenced by three key factors: precision, speed, and FLOPs. As shown in Figure 12, the horizontal axis represents speed, the vertical axis represents precision, and the size of each circle indicates the magnitude of the FLOPs.

As illustrated in Figure 12, our model demonstrates its effectiveness in detection speed, computational load, and precision when balancing trade-offs between model precision and computational resources. This makes it suitable for practical deployment in environments that require high precision but have limited hardware resources.

3.7. Cross-Dataset Validation

The HRA-YOLO model performed well in the self-constructed fish dataset, demonstrating the effectiveness of our work. To further verify its performance, cross-dataset validation was performed using the Fish Market dataset [32]. The Fish Market dataset includes 19 species of fish, with a total of 16,859 fish images. As shown in Figure 13, these 19 fish species are aair, boal, chapila, deshi puti, foli, ilish, kal baush, katla, koi, magur, mrigel, pabda, pangas, puti, rui, shol, taki, tara baim, and telapiya. The dataset is divided into a training set (12,474 images), a validation set (3106 images), and a test set (1279 images). The categories and distribution of the Fish Market dataset differ from the self-constructed dataset, which can effectively simulate the adaptability of the HRA-YOLO model in various ecological environments.

Table 6 shows that the HRA-YOLO model outperforms the YOLOv8s model on the Fish Market dataset. Specifically, it improves precision by 0.6% and mAP by 0.4%, while it reduces the number of parameters by approximately 2.9 M. Moreover, HRA-YOLO maintains high detection performance despite the reduced number of parameters. Although the speed decreases (by 19.3 FPS), the improvements in precision and mAP indicate greater effectiveness in detection tasks, and this advantage becomes particularly evident when dealing with diverse data. The experimental results also demonstrate HRA-YOLO’s strong generalization capability. Therefore, the structural design of HRA-YOLO is considered both reasonable and successful.

3.8. Missed Detection Analysis

The HRA-YOLO model can accurately identify underwater fish, but due to complex underwater environments, factors such as blurring, occlusion, and small target size will lead to missed detections in some images, as shown in Figure 14.

In Figure 14b, the circled areas represent the missed fish. The likely cause is the rapid turning movements of the fish, where the tail obscures the head, creating these blurred areas and reducing the feature information, thus preventing the model from recognizing the target fish. However, in the frames before (Figure 14a) and after (Figure 14c) the missed detection, the target fish is accurately identified. Therefore, the occasional missed detection of individual fish will not affect the final detection performance. To further reduce occurrences of mid-frame detection misses, future research will focus on increasing the number of similar images in the dataset and using underwater image processing algorithms.

4. Conclusions

Deploying a high-precision, lightweight fish detection algorithm on underwater monitoring equipment with limited computational resources is a challenging task. In this study, we propose a lightweight fish detection model, HRA-YOLO, based on YOLOv8s to address these challenges. First, we collected fish data from different environments and augmented the dataset with geometric transformations to obtain our self-constructed dataset. Then, we reshaped the structure of YOLOv8s to form the HRA-YOLO model. We used HGNetV2 to replace the backbone network of the YOLOv8s model and designed a residual attention (RA) module to improve the original C2f structure. Finally, we carried out comprehensive experiments to verify our work.

Experiments carried out on our self-constructed fish dataset showed that HRA-YOLO improved precision by 1% and mAP by 0.1% and reduced model parameters by 26%. The embedded attention experiments verified the superiority of combining the DWR structure with the EMA attention mechanism. Ablation experiments demonstrated the effectiveness of each improvement. Compared to current mainstream detection algorithms, HRA-YOLO showed optimal performance in terms of detection precision, computational load, and detection speed. Additionally, cross-dataset experiments on the Fish Market dataset revealed that HRA-YOLO improved precision by 0.6% and mAP by 0.4%, indicating the model’s excellent generalization and robustness.

In summary, the proposed HRA-YOLO detection model achieved outstanding performance in underwater fish detection tasks. It provides an effective solution for the detection of underwater fish.

However, the detection speed of the HRA-YOLO model still lags behind the most advanced models, and there is still room for further improvement in reducing the number of model parameters. In the future, we will focus on the following areas:

Explore more efficient network structures and optimization algorithms to further enhance the real-time data processing speed of the model.
Investigate underwater image enhancement algorithms and data augmentation methods to reduce false detection rates, improve detection precision, and enhance the model generalization ability.
Research knowledge distillation and model pruning techniques to further reduce the model size while maintaining precision.
Further study cross-domain transfer learning and multimodal fusion techniques to improve the model’s adaptability in environments where the visible spectrum is unreliable, in addition to other practical scenarios.

Author Contributions

H.W.: conceptualization, methodology, writing—review and editing, supervision, funding acquisition. J.Z.: software, validation, formal analysis, investigation, writing—original draft. H.C.: data curation, formal analysis, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Access to the experimental data presented in this article can be obtained by contacting the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

FAO. The State of World Fisheries and Aquaculture 2022. Towards Blue Transformation; FAO: Rome, Italy, 2022; pp. 1–15. [Google Scholar] [CrossRef]
Spampinato, C.; Giordano, D.; Salvo, R.D.; Chen-Burger, Y.-H.J.; Fisher, R.B.; Nadarajan, G. Automatic fish classification for underwater species behavior understanding. In Proceedings of the First ACM International Workshop on Analysis and Retrieval of Tracked Events and Motion in Imagery Streams, Firenze, Italy, 29 October 2010; pp. 45–50. [Google Scholar] [CrossRef]
Tharwat, A.; Hemedan, A.A.; Hassanien, A.E.; Gabel, T. A biometric-based model for fish species classification. Fish. Res. 2018, 204, 324–336. [Google Scholar] [CrossRef]
Xu, C.; Wang, Z.; Du, R.; Li, Y.; Li, D.; Chen, Y.; Li, W.; Liu, C. A method for detecting uneaten feed based on improved YOLOv5. Comput. Electron. Agric. 2023, 212, 108101. [Google Scholar] [CrossRef]
Fernandes, A.F.A.; Turra, E.M.; de Alvarenga, É.R.; Passafaro, T.L.; Lopes, F.B.; Alves, G.F.O.; Singh, V.; Rosa, G.J.M. Deep Learning image segmentation for extraction of fish body measurements and prediction of body weight and carcass traits in Nile tilapia. Comput. Electron. Agric. 2020, 170, 105274. [Google Scholar] [CrossRef]
Xu, X.; Li, W.; Duan, Q. Transfer learning and SE-ResNet152 networks-based for small-scale unbalanced fish species identification. Comput. Electron. Agric. 2021, 180, 105878. [Google Scholar] [CrossRef]
Chen, L.; Yin, X. Recognition Method of Abnormal Behavior of Marine Fish Swarm Based on In-Depth Learning Network Model. J. Web Eng. 2021, 20, 575–596. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Jalal, A.; Salman, A.; Mian, A.; Shortis, M.; Shafait, F. Fish detection and species classification in underwater environments using deep learning with temporal information. Ecol. Inform. 2020, 57, 101088. [Google Scholar] [CrossRef]
Zhang, M.; Xu, S.; Song, W.; He, Q.; Wei, Q. Lightweight Underwater Object Detection Based on YOLO v4 and Multi-Scale Attentional Feature Fusion. Remote Sens. 2021, 13, 4706. [Google Scholar] [CrossRef]
Li, L.; Shi, G.; Jiang, T. Fish detection method based on improved YOLOv5. Aquac. Int. 2023, 31, 2513–2530. [Google Scholar] [CrossRef]
Liu, K.; Sun, Q.; Sun, D.; Peng, L.; Yang, M.; Wang, N. Underwater Target Detection Based on Improved YOLOv7. J. Mar. Sci. Eng. 2023, 11, 677. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO, Version 8.0.0. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 11 January 2024).
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar] [CrossRef]
Wei, H.; Liu, X.; Xu, S.; Dai, Z.; Dai, Y.; Xu, X. DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual Information for Real-time Semantic Segmentation. arXiv 2022, arXiv:2212.01173. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Bono, F.M.; Radicioni, L.; Cinquemani, S. A novel approach for quality control of automated production lines working under highly inconsistent conditions. Eng. Appl. Artif. Intell. 2023, 122, 106149. [Google Scholar] [CrossRef]
Khalifa, N.E.; Loey, M.; Mirjalili, S. A comprehensive survey of recent trends in deep learning for digital images augmentation. Artif. Intell. Rev. 2022, 55, 2351–2377. [Google Scholar] [CrossRef] [PubMed]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based Attention Module. arXiv 2021, arXiv:2111.12419. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 11863–11874. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J. YOLOv5 by Ultralytics, Version 7.0. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 2 March 2024).
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Roboflow100. Fish Market Dataset. 2023. Available online: https://universe.roboflow.com/roboflow-100/fish-market-ggjso (accessed on 20 August 2024).

Figure 1. The architecture of the HRA-YOLO model.

Figure 2. HGNetV2 and its key structure: (a) the structure of HGNetV2; (b) the structure of the HG-block.

Figure 3. DWR module. Note: c denotes the base number of channels in the feature map; Conv represents convolution; DConv means depthwise convolution; D-n represents dilated convolution with a dilation rate of n.

Figure 4. RA module and its key components: (a) RA module; (b) EMA module.

Figure 5. C2f module (left), RAFE module (right).

Figure 6. Data acquisition: (a) environment one; (b) environment two; (c) collection equipment.

Figure 7. Schematic diagram of the experimental water tank.

Figure 8. Image samples from different environments: (a) environment one image; (b) environment two image; (c) laboratory image; (d) Internet image.

Figure 9. Comparison curves of parameters before and after model improvement.

Figure 10. Detection results of the HRA-YOLO model.

Figure 11. Comparison of heatmaps before and after model improvement: (a) scene one; (b) scene two; (c) scene three.

Figure 12. Comprehensive comparison of results from different detection models.

Figure 13. The Fish Market dataset: (a) instance distribution; (b) instance size distribution.

Figure 14. Continuous frame detection effect: (a) result of the previous frame; (b) result of the middle frame; (c) result of the next frame.

Table 1. Comparative results of different data augmentation methods.

Type of Datasets	Number of Images	Precision/%	Recall/%	mAP/%
Original	3080	91.1	86.8	92.3
Original + Intensity Transformation	4080	91.8	89.0	93.9
Original + Geometric Transformation	4080	92.1	88.5	94.4
Original + Both Intensity and Geometric Transformation	4080	91.8	87.4	93.5

Table 2. Comparative results of the evaluation metrics.

Model	Precision/%	Recall/%	mAP/%	FLOPs/G	Parameters/M	Speed/FPS	Model Size/MB
YOLOv8s	92.1	88.5	94.4	28.4	11.125971	124.6	22.5
HRA-YOLO	93.1	88.3	94.5	23.0	8.225795	103.3	16.8

Table 3. Comparative results of fusion experiments with different attention mechanisms.

Model	Precision/%	Recall/%	mAP/%
DWR (no mechanism)	91.8	87.9	93.7
DWR + CA	92.7	86.1	93.0
DWR + NAM	92.1	88.0	93.7
DWR + SimAM	91.5	88.7	94.5
DWR + EMA	93.1	88.3	94.5

Table 4. Results of ablation experiments.

Experiments	YOLOv8s	HGNetV2	DWR	RAFE	Precision/%	Recall/%	mAP/%	FLOPs/G
1	✓				92.1	88.5	94.4	28.4
2	✓	✓			92.4	87.6	94.2	23.3
3	✓		✓		91.7	88.7	93.9	27.8
4	✓			✓	92.3	89.1	94.2	28.1
5	✓	✓	✓		91.8	87.9	93.7	22.7
6	✓	✓		✓	93.1	88.3	94.5	23.0

Table 5. Results of seven different object detection models.

Model	Precision/%	Recall/%	mAP/%	Speed/FPS	FLOPs/G
SSD	89.3	81.9	92.0	28.7	84.1
EfficientDet	91.0	83.4	90.4	18.7	19.2
RT-DETR-L	91.2	88.3	93.2	57.6	100.6
YOLOv5s	91.3	87.5	93.4	138.5	15.8
RC-YOLOv5	92.0	87.3	93.8	143.6	12.6
YOLOv7-tiny	91.1	88.4	94.0	133.3	13.2
YOLOv9s	91.7	89.7	94.7	87.6	26.7
YOLOv10s	91.9	87.8	93.9	100.7	24.4
HRA-YOLO	93.1	88.3	94.5	103.3	23.0

Table 6. Performance comparison of YOLOv8s and HRA-YOLO based on the Fish Market dataset.

Model	Precision/%	Recall/%	mAP/%	Parameters/M	Speed/FPS
YOLOv8s	98.9	99.1	99.2	11.132937	122.2
HRA-YOLO	99.5	99.0	99.6	8.232761	102.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Zhang, J.; Cheng, H. HRA-YOLO: An Effective Detection Model for Underwater Fish. Electronics 2024, 13, 3547. https://doi.org/10.3390/electronics13173547

AMA Style

Wang H, Zhang J, Cheng H. HRA-YOLO: An Effective Detection Model for Underwater Fish. Electronics. 2024; 13(17):3547. https://doi.org/10.3390/electronics13173547

Chicago/Turabian Style

Wang, Hongru, Jingtao Zhang, and Hu Cheng. 2024. "HRA-YOLO: An Effective Detection Model for Underwater Fish" Electronics 13, no. 17: 3547. https://doi.org/10.3390/electronics13173547

APA Style

Wang, H., Zhang, J., & Cheng, H. (2024). HRA-YOLO: An Effective Detection Model for Underwater Fish. Electronics, 13(17), 3547. https://doi.org/10.3390/electronics13173547

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HRA-YOLO: An Effective Detection Model for Underwater Fish

Abstract

1. Introduction

2. Materials and Methods

2.1. YOLOv8s Model

2.2. HRA-YOLO Model

2.2.1. Lightweight Backbone Network

2.2.2. Dilation-Wise Residual Module (DWR)

2.2.3. Residual Attention (RA) Structure

2.2.4. Residual Attention Feature Extraction Module (RAFE)

3. Experimental Results and Discussion

3.1. Evaluation Metrics and Experimental Environment

3.2. Production of Experimental Data

3.2.1. Data Collection and Annotation

3.2.2. Offline Data Augmentation

3.3. Experimental Results of the HRA-YOLO Model

3.4. Comparison of Different Attention Mechanisms

3.5. Ablation Experiments

3.6. Comparison of Different Object Detection Models

3.7. Cross-Dataset Validation

3.8. Missed Detection Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI