1. Introduction
The report by the Food and Agriculture Organization, titled “Outlook for Fisheries and Aquaculture 2030”, forecasts that aquaculture production is anticipated to reach 100 million tons by 2027, achieving this milestone for the first time, and is projected to rise further to 106 million tons by 2030 [
1]. The increasing global population has resulted in a heightened demand for seafood, leading to the degradation of oceans and a continual decline in marine fisheries resources. To address this challenge, implementing modernized marine ranching systems is viewed as a promising approach to support the efficient restoration of marine fisheries resources and ecosystems [
2]. Recently, the deployment of underwater observation networks has enabled the real-time monitoring of alterations in marine environments and the biodiversity of biological resources, presenting new opportunities for the scientific management of marine ranching. However, the underwater environment presents challenges for the statistical assessment of biological resources due to limited visibility caused by factors like low light conditions, scattering, and absorption resulting from suspended particles and algae in the water. Advancements in computer vision have led to the increasing use of deep learning-based object detection techniques in complex underwater settings, offering robust tools for detection and classification. This technology also assists in analyzing the population, behavior, and interactions of underwater organisms within their habitats, encompassing research on fish diseases and responses to hypoxic conditions [
3].
The underwater fish resource assessment in the rapid development of marine ranching has primarily undergone three stages.
The first stage involves traditional counting methods, primarily relying on fishing techniques such as angling and trawling. Fishermen estimate the quantity of cultured fish through fishing, a process that incurs significant time and labor costs while also impacting the welfare and growth of the fish [
4].
The second stage entails using counters, sonar, and other sensor-based counting methods. With advancements in information technology, counting devices based on sensor principles and acoustic technologies have been widely applied for fish identification and counting in this phase. A number of researchers utilize sensors based on different principles for counting, such as infrared optical counters [
5] and resistivity fish counters [
6]. The optics counter has limited transparency in turbid water, which depends on the channel structure to limit fish movement. As a result, the counting accuracy is affected by fish overlapping and swimming back and forth, resulting in estimation errors. Visible light is restricted by distance when imaging underwater, and the increased distance will lead to blurred images and fewer images. The advantages of sound waves, such as long-distance propagation, position them as the optimal method for remote detection and identification of underwater targets. Particularly in underwater environments where light conditions change over time, acoustic imaging has become the scholars’ preferred approach to obtaining information about underwater targets. Counting methods based on acoustic technology, unaffected by distance variations in underwater imaging, can be categorized into acoustics and acoustic imaging. These methods are employed for underwater measurements and counting. DIDSON is a high-definition imaging sonar that can provide high-quality acoustic images under dark and turbid underwater conditions [
7,
8,
9,
10,
11,
12]. However, studies based on DIDSON demonstrate poor performance in detecting and counting small fish. While optical and acoustic methods can achieve fish counting, they cannot classify fish into different categories.
The third stage involves using deep learning-based underwater video automatic resource counting methods for identification and enumeration. Traditional counting methods and sensor-dependent approaches face challenges such as time efficiency, cost-effectiveness, and classification recognition. Deep learning is crucial in underwater video automatic resource counting, leveraging its outstanding capabilities in adaptive feature extraction and non-linear mapping. With the rapid advancement of hardware devices and their widespread deployment in underwater observation, devices such as underwater cameras, sonar, and underwater drones enable real-time visualization of underwater biological resources, promoting scientific fisheries management and sustainable production [
13,
14,
15]. Object detection, a computer vision technique for locating and classifying semantic objects in given images or videos, is integral to this stage. Presently, object detection algorithms have found widespread applications in various domains, including face recognition [
16,
17], text detection [
18,
19], pedestrian recognition [
20,
21], and vehicle detection [
22,
23], among others [
24]. With the rapid development of deep learning-based target detection technology, these algorithms surpass human visual accuracy and find extensive applications in aquaculture, such as fish counting, body length measurement, and individual behavior analysis. In contrast to terrestrial environments, the lenses of underwater cameras are prone to influences from sediment and biological attachments like algae. Simultaneously, the turbidity of water exhibits significant variations over time, leading to pronounced changes in the illumination and color distribution of underwater images, as illustrated in
Figure 1. However, the rapid development of attached sensors and advancements in underwater image enhancement makes it feasible to achieve automatic resource counting based on underwater videos.
In recent years, with the advancements in computer vision, deep learning-based object detection technologies have played a crucial role in fish counting. Image segmentation struggles to accurately delineate targets in complex underwater environments, impacting counting precision. In contrast, object detection, by defining bounding boxes around object positions and categories, achieves object quantity by counting the number of bounding boxes, exhibiting higher generalization capability and accuracy. Anchor-based object detection frameworks can be categorized into two-stage and one-stage detection frameworks. The primary distinction between one-stage and two-stage object detection frameworks is that a one-stage detector network is an end-to-end model, directly regressing and outputting the positions and categories of detected objects.
In contrast, unlike its one-stage counterpart, a two-stage detector implements a two-step strategy involving anchor box filtering from coarse to fine and regression output from fine to coarse. This strategy sacrifices speed to enhance precision. Classic two-stage frameworks include Faster R-CNN [
25] and Mask R-CNN [
26], which have been applied in detecting and counting fish in underwater environments. Classic one-stage detection frameworks include the You Only Look Once (YOLO) series, YOLOv1–YOLOv10 series [
27,
28,
29,
30,
31,
32,
33,
34,
35,
36], and SSD [
37] network structures.
Many researchers have applied object detection techniques to counting fish in aquaculture, utilizing both image- and video-based analysis approaches.
Based on images, researchers employ segmentation and detection techniques for fish counting. However, underwater images pose blurriness, noise, fish occlusion, and overlapping, making computer vision applications for counting inherently challenging. French et al. [
38] proposed a method using the N4-field algorithm for scene segmentation and counting, achieving a calculation error for individual fish ranging between 2% and 16%. Nevertheless, the fish datasets they processed were obtained beneath the deck, exhibiting relatively high contrast, which differs from datasets acquired in nature underwater environments. To extend model applicability to complex underwater environments, Li et al. [
39] introduced a fish detection system based on Fast R-CNN, which outperformed the Deformable Parts Model (DPM) in terms of mean Average Precision (mAP). The use of Selective Search for searching Regions of Interest (ROI) resulted in redundant computations of target features, impacting the training and inference speed of the object detection network. Faster R-CNN [
40] incorporated a Region Proposal Network (RPN) to enhance detection speed, achieving end-to-end training. Li et al. [
25] accelerated underwater fish detection using Faster R-CNN [
40], obtaining an Average Precision (mAP) of 82.7%, with a detection time of one-third that of Fast R-CNN. However, calculating the number of fish based on frame-by-frame image analysis cannot track individual trajectories without incorporating an inter-frame individual matching algorithm. This omission may result in significant counting errors.
Based on video analysis for detection and counting, fish appearance and motion information can be fused for real-time fish detection. At the same time, the association of targets helps prevent repetitive counting in tracking. Ditria et al. [
26] applied Mask R-CNN in aquatic ecology, demonstrating that deep learning exhibits higher accuracy and speed in fish abundance calculation. Arvind et al. [
41] combined Mask R-CNN instance segmentation with the Generic Object Tracker for fish detection and tracking in large ponds and tanks. The study indicated that using a multi-region parallel detection method yielded the best results, achieving a fish detector with an F1 score of 0.91 at a rate of 16 frames per second. However, the limitation of this approach lies in the inability of drones to detect fish in turbid water environments. Mohamed et al. [
42] employed a combination of multiple algorithms in fish farms, integrating the detection algorithm MSR-YOLO with optical flow algorithms for fish tracking. While the method performed well in fish farms, there was a significant gap compared to real underwater environments. Liu et al. [
43] proposed a real-time fish detection and tracking method (RMCF) for multi-class fish counting in the real underwater environment of marine ranches, using YOLOv4 as the backbone network, achieving a recognition accuracy of 95.6%. However, the method’s recognition accuracy decreases when the experimental area changes. Additionally, the motion of fish fitted using a constant speed model may differ from the irregular activities of real underwater fish. In challenging underwater environments, Liu et al. [
44] addressed the fish detection and segmentation difficulty by proposing an adaptive multiscale background modeling method applied to deep-sea cage videos. They utilized an online segmentation algorithm for fish detection and counting, maintaining robust results even without additional annotated videos for training and fine-tuning the framework. However, segmentation results from the background model pose the following issues: (1) The inference speed is relatively slow, rendering real-time computation unfeasible. (2) The segmentation of foreground moving objects can only achieve single-class semantic segmentation but not multi-class instance segmentation. (3) The segmentation result of a single moving object may correspond to several connected domains of different foreground objects, which leads to false statistics of the number of objects.
Similar to the multiscale regression Gaussian background model proposed by Liu et al. [
44], our designed network simplifies the background model and the multiscale regression block Gaussian background model into a single-scale block Gaussian background model. Moreover, in the second stage of object detection, the network can determine the type of connected domains, eliminate all noisy regions, and refine the position of the target bounding boxes.
However, numerous challenges remain in underwater video detection and recognition algorithms. The primary reasons for these challenges include the following:
Variability in the distribution of video and image samples input to the network. In real underwater environments, especially with continuously sampled video data, the presence of foreign objects and changes in water turbidity can significantly alter the illumination, color, and other characteristics of the images [
45,
46]. This leads to a decline in the model’s inferential capability over time [
47,
48]. One solution is to collect a sufficiently diverse training dataset that covers various periods and underwater conditions, simulating additional samples through data augmentation. However, due to sthe inability to encompass all possible scenarios, the model may exhibit limited adaptability to certain underwater environments. Regular analysis of misclassified samples and fine-tuning is essential to enhance the model’s adaptability to new samples.
To address these challenges, we propose a two-stage object detection network, named the Foreground Region Convolution Network (FR-CNN), designed to detect and enumerate fish in open sea areas. Meanwhile, our network eliminates the need for anchor frames. The FR-CNN utilizes foreground moving object segmentation results obtained through background modeling to establish a target region extraction network. Subsequently, a convolutional network is employed to extract features from candidate regions, facilitating target classification and refinement of the position coordinates for bounding boxes.
In our proposed two-stage object detection network FR-CNN, we use a combination of unsupervised and supervised approaches to identify and count fish species accurately. First, image pre-processing and enhancement techniques, including image clarity, contrast and brightness enhancement, and the creation of a dynamic background representation using a multiscale regression Gaussian background model, are used to overcome problems associated with low visibility, lighting variations, and turbidity. Foreground segmentation is then used to isolate potential fish targets from the background, and background subtraction techniques are used to identify regions of motion and serve as candidate regions for further analysis by the convolutional network. In the feature extraction and object detection stages, a convolutional neural network (CNN) is used to extract high-level features, and bounding boxes are generated using an anchor-free approach. A classification network is implemented for species identification to classify detected objects into specific fish species based on the extracted features. Finally, the number of fish is counted directly through the bounding boxes generated by the detection network, and the counts are aggregated across frames to provide a comprehensive assessment of the dynamics and behavior of the fish population. By combining these methods, FR-CNN effectively translates the signals of underwater images into fish species identification and counting, overcoming the challenges posed by the underwater environment and improving the accuracy and efficiency of marine resource assessment.
Overall, the contributions of this paper mainly include the following aspects:
We propose FR-CNN as an anchor-free two-stage object detection network. This network combines the advantages of both unsupervised and supervised methods, utilizing a supervised approach in the first stage of object detection for extracting target features from candidate regions. The unsupervised method effectively alleviates the prior errors introduced by publicly available datasets in the supervised stage, dynamically correcting for existing datasets and improving the precision of fine screening.
We introduce a multiscale regression Gaussian background model as an unsupervised method for dynamic calibration of existing datasets. In the context of two-stage object detection, addressing the issue of prior errors associated with anchor boxes on publicly available datasets, we employ an unsupervised approach that facilitates the automatic update of anchor boxes on local datasets, ensuring a closer alignment with the color variations resulting from underwater video imagery illumination changes.
The remaining sections of this paper are organized as follows:
Section 2 offers a de-tailed introduction to the FR-CNN framework.
Section 3 presents the experimental results, comparing them with various object detection networks on the validation set and providing discussions. Finally,
Section 4 concludes the entire paper.