Robust Fish Recognition Using Foundation Models toward Automatic Fish Resource Management

Hasegawa, Tatsuhito; Nakano, Daichi

doi:10.3390/jmse12030488

Open AccessArticle

Robust Fish Recognition Using Foundation Models toward Automatic Fish Resource Management

by

Tatsuhito Hasegawa

^1,*

and

Daichi Nakano

²

¹

Graduate School of Engineering, University of Fukui, Fukui, Fukui 910-8507, Japan

²

Fukui Prefectural Fisheries Experimental Station, Marine Resources Research Center, Tsuruga, Fukui 914-0843, Japan

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(3), 488; https://doi.org/10.3390/jmse12030488

Submission received: 2 February 2024 / Revised: 2 March 2024 / Accepted: 13 March 2024 / Published: 14 March 2024

Download

Browse Figures

Versions Notes

Abstract

:

Resource management for fisheries plays a pivotal role in fostering a sustainable fisheries industry. In Japan, resource surveys rely on manual measurements by staff, incurring high costs and limitations on the number of feasible measurements. This study endeavors to revolutionize resource surveys by implementing image-recognition technology. Our methodology involves developing a system that detects individual fish regions in images and automatically identifies crucial keypoints for accurate fish length measurements. We use grounded-segment-anything (Grounded-SAM), a foundation model for fish instance segmentation. Additionally, we employ a Mask Keypoint R-CNN trained on the fish image bank (FIB), which is an original dataset of fish images, to accurately detect significant fish keypoints. Diverse fish images were gathered for evaluation experiments, demonstrating the robust capabilities of the proposed method in accurately detecting both fish regions and keypoints.

Keywords:

resource survey; foundation models; Grounded-SAM; mask keypoint R-CNN; fish image bank dataset

1. Introduction

Sustainable society is being discussed in various fields, such as transportation [1] and shipping [2]. With the growing demand for marine conservation and the sustainability of fisheries, fishery resource assessment and sustainable use (FRASU) have become a global concern [3,4]. Resource assessment aims to recover species threatened by overfishing and establish sustainable fisheries. Accurate surveys are indispensable for implementing resource assessments based on scientific evidence. However, existing surveys primarily rely on manual methods, which are time-consuming and inefficient, thereby hindering continuous and detailed monitoring. To address this challenge, introducing new technologies and methodologies is imperative, promising substantial enhancements in the accuracy and efficiency of resource management.

The scope of resource surveys is extensive, and one specific task involves surveying the catch in the marketplace. Because of the nature of many fishing methods, where many fish are caught simultaneously, individually measuring each catch’s species, length, and weight is impractical, given the excessive labor it would entail. In Japanese resource surveys, box counts and tonnage were recorded for each species, and only a subset of the catch was sampled for measurement, where the species and length were recorded. Measurements are typically conducted manually at the market by multiple staff members, necessitating swift assessments within a constrained timeframe. This process requires substantial time and effort and limits the number of samples that can be measured. Therefore, we developed a system that automates length measurement using image-recognition techniques. The functional requirements of the proposed system are shown in Figure 1. The system utilizes a depth camera to capture images of the catch from above and automatically measures the species and length of each fish in the image through image recognition. The system necessitates the following functions:

(A): Instance segmentation:
(Input) RGB image including any fish.
(Output) Region masks for each fish.
(B): Keypoints detection:
(Input) RGB image cropped in the region of one fish.
(Output) XY coordinates of the nine defined keypoints of the fish.
(C): Species identification:
(Input) RGB image cropped in the region of one fish.
(Output) The species name of the fish.
(D): Calculate length:
(Input) XY coordinates of keypoints and depth information.
(Output) Some length types, such as standard, fork, and total length.
(E): Error check:
(Input) RGB image cropped in the region of one fish.
(Output) Determination of whether the fish image is suitable for resource survey or not.

Regarding (A) and (C), several research studies fulfill the aforementioned functional requirements [5,6,7] by employing convolutional neural networks (CNN), which are deep learning models. These studies accomplished catch recognition by constructing datasets and training an instance segmentation model. Specifically, the Mask R-CNN [8]. In one instance, Garcia et al. enhanced the performance of fish overlap detection by introducing a gradient-based method [5]. Álvarez-Ellacuría proposed a methodology that uses Mask R-CNN to detect only the head and estimates the total length through another classical regression model [6]. Tseng et al. utilized Mask R-CNN to detect fish caught on board a ship, implementing species identification and instance segmentation within the Mask R-CNN framework. Additionally, Bravata et al. proposed a method to directly estimate fish length and weight from straightforward background images using a simple CNN [9]. CNNs have also found applications in the aquaculture industry for automatically counting the total number of fish in salmon and shrimp aquaculture. Furthermore, Saleh et al. conducted a comprehensive survey of resource assessment research utilizing image recognition, summarizing the results in their survey study [10].

Concerning (B) and (D), several research studies have addressed fish keypoint detection [11,12,13]. Suo et al. introduced a method that combines fish region detection using a faster R-CNN and keypoint detection using a stacked hourglass network [11]. Similarly, Yu et al. proposed a keypoint detection method utilizing CenterNet [14] and constructed their dataset with keypoint labels [12]. Dong et al. introduced a method integrating fish region detection using YOLOv5 and keypoint detection using Lite-HRNet [13].

Several issues were evident in the aforementioned studies. First, these methods require extensive effort to amass large numbers of training data and manually annotate class labels, region masks, and keypoints. Although datasets such as Fish4Knowledge [15], LargeFish [16], and DeepFish [17] exist for fish instance segmentation, merely training models on these datasets does not guarantee robust performance across diverse fisheries. This is particularly problematic, given that resource surveys occur in various regions and seasons, making the time and effort required to construct large datasets for each region a formidable obstacle to system implementation. The second issue pertains to the absence of an (E) error check. Despite the high detection accuracy of image-recognition-based methods, achieving 100% detection is challenging. From a resource assessment perspective, only reliably detected data should be used in the survey. Hence, the incorporation of a function to analyze detection results becomes imperative.

In this study, we devised a robust catch-detection method based on foundation models and image synthesis by leveraging our exclusive dataset. Figure 1 illustrates the proposed automated resource survey system and the role of our method. The system utilizes RGB images and depth data from a depth camera as input, providing species and length details for each fish in the image as output. The novelty aspect of this research is the method’s robust performance across various fishing grounds for fulfilling functional requirements (A) and (B), along with its capacity to identify prediction errors. We aimed to promote the widespread adoption of automated resource surveys. Moreover, the proposed method holds potential for resource assessments, automating aquaculture management, and driving operational reforms in the fishery industry.

The contributions of this study are as outlined as follows:

We devised (A) an instance segmentation model applicable to resource surveys in diverse environments through prompt engineering for the capabilities of the foundational model, grounded-segment-anything (Grounded-SAM). Additionally, we validated the method’s performance using benchmark datasets and self-captured market images, showcasing, for the first time, the utility of Grounded-SAM in detecting fish regions for resource surveys.
We proposed this by utilizing the foundation model; for instance, for segmentation, subsequent detection models can be trained with the assumption of a single catch per image and removed background information. Building on this assumption, we devised robust (B) keypoint detection and (E) error-checking models, demonstrating their effectiveness through validation.
To facilitate the training of the models above, we created and released the fish image bank (FIB) dataset. This dataset is a new benchmark for fish detection and image synthesis, comprising 405 4K-resolution images, each featuring distinct fish species. The dataset was meticulously annotated with labels for fish specie, region masks, and keypoints.

Note that (C) is not addressed in this study as it can be implemented through various techniques, including our previous study [18], Fish-Pak [19] and WildFish [20,21]. As shown in Figure 2, since the keypoints detected are represented in a two-dimensional coordinate system with pixels as the unit of measurement, it is not possible to directly estimate the length of the fish. Therefore, by using information obtained from a depth camera or similar devices to project the keypoints into a three-dimensional coordinate system, it is possible to estimate the length of the fish by calculating the Euclidean distance. Additionally, since the method described in related research [12] is feasible, this study will exclude approach (D) from consideration.

2. Proposed Method

2.1. Overall Process and Advantages

As depicted in the process overview of the proposed method shown in Figure 3, the approach takes an RGB image as input and produces region masks and keypoints for each fish in the image. First, considering the presence of multiple fish in the input image, (A) instance segmentation becomes imperative for detecting individual fish regions. The instance segmentation process generates multiple images, each featuring an isolated individual fish, obtained by cropping with masks detected through foundation models. Subsequently, for each identified individual fish, (B) keypoint detection, which is crucial for measuring fish length, and (E) error checking of recognition results were executed. Ultimately, through postprocessing involving all inputs and outputs, only accurately measured individuals were meticulously chosen, and they constituted the final output of the entire method.

As mentioned above, achieving instance segmentation of fish regions, automatic detection of their classes, and keypoints in a single deep-learning model using Mask Keypoint R-CNN or CenterNet is technically feasible. However, in this study, we opted to implement instance segmentation, keypoint detection, and fish species identification as distinct deep learning models. There are two advantages to employing multiple models. First, overall performance was enhanced when task-specific models were utilized. Our previous study [22] demonstrated that instance segmentation performance for unknown fish species improved by training the Mask R-CNN model to predict a single class, fish, instead of directly predicting fish species. Additionally, fish species identification is a complex task akin to fine-grained image recognition (FGIR), necessitating specialized models for accurate identification. Second, employing multiple models substantially reduces the time required to prepare the training datasets, enabling the development of a robust method applicable to various fisheries. When using a single model, one needs to prepare numerous images featuring diverse fish species captured in various environments and train the model for each fishing ground by annotating labels for region masks, fish species, and keypoints. In contrast, using foundation models, our proposed method can accurately detect fish-like regions without identifying fish species, irrespective of the shooting environment. This eliminates the need for an additional dataset, allowing the cropping of a single fish region in an image and background removal, thereby reducing the dataset’s requirement of background diversity for keypoint detection and fish species identification. The division of roles among multiple models enables keypoint detection and fish species identification models to specialize in recognizing keypoints and species from a single fish without considering the background or number of fish in the image. Because it is more straightforward to prepare a dataset specialized for recognition from a single fish than for a single model, the proposed method can realize robust models applicable to various fishing grounds.

Admittedly, utilizing multiple models has limitations, such as a potential slowdown in inference speed and an increase in model size. However, in the context of resource surveys, the need for inference speeds as high as 10 times per second is not crucial. Inference can be performed in non-real-time after recording a video once, which mitigates the impact of a potential decrease in speed—a factor that is not a major concern. While the system should operate on edge devices, an alternative approach involves implementing the system by connecting it to a high-performance inference server, making the increase in model size less significant. Considering these considerations, we contend that the advantages of distributing roles among multiple models outweigh the limitations of a single-model resource-surveying approach.

2.2. Instance Segmentation by Foundation Models

2.2.1. Foundation Models

A foundation model is a broad term referring to a large-scale model pretrained on an extensive dataset. This model type can be applied to various downstream tasks without acquiring additional training. For instance, OpenAI’s CLIP [23] is renowned for its image-to-text capabilities.

In this study, we utilize Grounded-SAM 1, a foundational model chosen for implementing the instance segmentation of catches. Grounded-SAM is an instance segmentation method that integrates Grounding DINO [24] and segment-anything models (SAMs) [25]. These large-scale models, having undergone pretraining on a substantial dataset, demonstrate the capability to detect corresponding objects in an image for each individual without requiring additional training, provided with an image and natural language prompts as input.

SAM [25] is a segmentation model introduced by Meta in 2023, pretrained with an extensive dataset comprising over 11 million licensed images and more than 1 billion mask annotations. This pretraining enables SAM to segment various objects without requiring additional training. The model employs an image encoder and prompt encoder to convert input images and prompt them into feature maps, ultimately generating segmentation masks by fusing information from both sources. Owing to its exceptional segmentation performance, SAM is anticipated to be a focal point of research in 2024 [26]. It has been successfully applied in various domains, demonstrating its proficiency in tasks such as satellite image segmentation [27]. However, its performance in medical imaging has been noted as insufficient for certain tasks [28]. Notably, the potential use of SAM in fisheries has not yet been explored within the scope of our survey. Existing SAM implementations 2 possess specific characteristics, wherein prompts can be either point-based or box-based. Figure 4a illustrates an example prediction by SAM. The green box and star marks represent box-based and point-based prompts, respectively, whereas the blue-filled area depicts the output mask. This allows the segmentation of objects in the vicinity of points and optimal objects within boxes by providing points and boxes as prompts, in addition to images. In essence, SAM facilitates pixel clustering with consideration for the object’s semantic meaning.

Grounding DINO [24] integrates the grounding functionality proposed in GLIP [29] with the self-supervised learning method DINO [30]. It utilizes image and text encoders to convert input images and text prompts into feature maps, generating bounding boxes corresponding to the provided text. Unlike traditional object detection models limited to predefined classes, Grounding DINO, through its connection with the text encoder and grounding function, enables the specification of various classes at the time of inference using natural language, akin to CLIP. In Figure 4b, an example prediction of Grounding DINO is illustrated, producing six bounding boxes by specifying “fish’’ as a text prompt for the input image.

2.2.2. Prompt Engineering

The current SAM necessitates user-provided points or boxes as prompts, making it incapable of automatically detecting fish areas. Conversely, while Grounding DINO can detect objects specified by text, its performance diminishes when a specific fish species name, such as “mackerel”, is used as a prompt. Detection performance, particularly for obscure fish species, is challenging. For instance, in Figure 4b, both mackerel and flatfish are detected for the prompt “mackerel”, and the confidence level is low. Despite Grounding DINO being trained on extensive datasets, its effectiveness may be compromised when dealing with texts that deviate substantially from the data included in the training set.

Hence, in this study, we employ the abstract text “fish” instead of specific fish species names as the prompt for Grounding DINO. This allowed us to detect general-purpose fish-bounding boxes, which were then utilized as prompts to identify instance masks in SAM. As depicted in Figure 4c, even though individual fish species cannot be identified, all fish are generally detected as “fish,’’ resulting in accurate segmentation. A distinctive feature of our proposed method is its ability to accurately detect fish regions comprehensively by segregating instance segmentation and fish species identification models. SAM’s approach inspired this concept. As highlighted by Zhang et al. [31], conventional segmentation tasks involve label and mask prediction, with SAM specializing in the latter. By omitting fish species identification, our approach enhances accurate mask detection and improves the accuracy of fish species identification, a fine-grained image-recognition task.

2.3. Keypoint and Error Detection by FIB

Using instance segmentation, multiple images containing only one fish without a background were extracted from a single image. Owing to the variable resolution of each cropped region, resizing and padding were applied to achieve a uniform size of 512 × 512 px while preserving the aspect ratio. Although zero padding (black fill) is a common choice for padding, it can pose challenges in accurately recognizing the contours of the catch and may adversely affect subsequent processing. In this study, we opted for white fill during padding to address these issues.

The resized individual fish images are input for both the keypoint detection and error-checking models. In this section, we elaborate on the dataset (FIB) essential for training each model and the methodologies employed for model training.

2.3.1. FIB

FIB is our newly compiled image dataset, comprising 405 images of 4K resolution, each featuring one fish specimen from 24 different species, as depicted in Figure 5. The selection of 24 fish species was based on their common circulation in Fukui, where the authors conducted their research. The FIB was constructed with the cooperation of markets and wholesalers in Fukui, limiting the number of fish species it contains. The dataset is released under a CC BY-SA license and is available for download at the following URL: https://haselab.fuis.u-fukui.ac.jp/research/fish/fib.html (accessed on 1 February 2024). The detailed information is described in the same web page.

Each image was annotated with the fish species, a segmentation mask, and keypoints. The keypoints, based on previous studies [11,12,13], included nine keypoint locations (tip of mandible, eye, pectoral fin, dorsal fin, pelvic fin, root of tail, tip of tail (top, between, and bottom)). This selection allows the measurement of the standard, fork, and total length by utilizing a combination of certain keypoints. The authors manually labeled the segmentation masks and keypoints using coco-annotator 3 through visual inspection of the images. It is important to note that information regarding keypoints not visible in images may be subject to inaccuracies. FIB is a resource for training and evaluating image-recognition and segmentation models and for various applications employing the image synthesis method outlined below.

2.3.2. Keypoint Detection

In the context of keypoint detection, traditional algorithms such as SIFT [32] and SURF [33] are well known for identifying distinctive points related to local textures and shapes within images as keypoints. Recently, heuristic improvement methods have also been proposed and utilized for applications like image stitching [34]. On the other hand, approaches like Keypoint R-CNN [8] and PoseNet [35] detect keypoints based on semantically predefined specific parts, such as human eyes or limbs’ joints. In this study, we adopt the latter approach for supervised keypoints detection, aimed at measuring the length of fish.

In this study, the Mask Keypoint R-CNN [8] serves as the keypoint detection model and undergoes training with the FIB dataset. While conventional training typically involves using diverse images capturing fish species in various background environments, our proposed method mitigates this limitation by employing distinct models for segmentation and fish species discrimination. Consequently, FIB segmentation masks eliminate the background, treating all classes uniformly as a class labeled “fish’’ for training data. This approach streamlines the training process and circumvents the need for diverse images for each fish species from various backgrounds.

The model was trained using the following specifications: 90,000 iterations, input resolution set at 512 × 512 px, SGD as the optimization method (with a learning rate of 0.002, momentum of 0.9, and weight decay of 0.0005), and a cosine annealing learning rate scheduler. Basic data augmentation techniques are implemented, including scale, rotation, and blur. Additionally, recognizing that segmentation results might be obstructed by market tags in resource surveys, random erase [36] is applied to address this potential issue.

Since the detected keypoints may occasionally extend beyond the fish region, a keypoint correction step was introduced as a postprocessing measure. This correction was applied to keypoints identified outside the segmentation mask detected by SAM, and these keypoints were replaced with coordinates from the nearest masked area.

2.3.3. Error Check

When fish are densely populated, the area identified by Grounded-SAM may encompass images unsuitable for resource surveys. For instance, multiple fish might be intertwined in one detected area, or the entire fish might not be captured. In the context of resource surveys from densely packed fish images, such as those of boxed fish, it is preferable to meticulously choose and record only those fish that can be accurately detected rather than forcibly calculating the lengths of all fish. To address this, we devised an error-checking model to select the fish to be detected carefully.

ResNet50 [37] was employed as the encoder, and the output layer incorporated two heads for multitasking learning. Each head is responsible for estimating whether ‘‘multiple fishes are included (multi)’’ and ‘‘the whole fish (overall).’’ The loss function comprises binary cross-entropy losses for each task, with the two losses summed to obtain the total loss. The training process involved 100,000 iterations, an input resolution of 512 × 512 px, and SGD as the optimization method (learning rate 0.01, momentum 0.9, weight decay 0.0005). The learning rate was dynamically adjusted using cosine annealing.

The training data were dynamically generated using synthetic images from FIB. For each iteration, one to three fish were randomly selected from the FIB and placed randomly after applying the data augmentation outlined in the previous section. An ‘‘overall’’ error is labeled if any of the crucial keypoints (tip of the mandible, tail root, and three points at the tail tip) are absent from the synthesized image or if more than 30% of the original region is missing due to synthesis and random erasing. The number of fish determines the “multi’’ label in the synthesized image. Figure 6 illustrates the synthesized data in each case.

3. Experimental Settings

Table 1 provides a comprehensive overview of the datasets used to evaluate our proposed method. In addition to our FIB, we incorporate the Multiple and Webfish datasets for evaluation purposes. Multiple datasets (1) and (2) consisted of images of fish captured at our laboratory (Lab) and the Tsuruga branch of the Fukui Prefecture Fisheries Co-operative Associations (Market). Dataset (1) was utilized in our previous study [38] and is manually labeled with instance segmentation labels, whereas (2) remained unlabeled. Table 2 offers a detailed breakdown of the dataset (2). Labs 1–4 involved photographs of commercially available fresh fish on desks and conveyors, captured with an iPhone as the camera equipment. Markets 1–3 encompass images of fish in boxes taken at actual resource survey sites, again using an iPhone. Multiple (1) datasets comprise data from Lab1 to Lab3. Examples of (2) in each environment are illustrated in Figure 7.

Webfish is an image dataset compiled scraping in February 2023 with permission of Web Sakana Zukan (a visual dictionary of fish on the web) 4. Each image was labeled with fish species; however, segmentation labels were not provided. Given its user-contributed nature, many images in this dataset were captured by anglers. From the collected dataset, 205 fish species were selected based on their relevance to resource surveys in Japan. Up to five images of each fish species were extracted, each showcasing the entire fish and featuring only one fish. The Webfish dataset generally covers fish species necessary for Japanese resource surveys but does not include mollusks or shellfish. The fish species not covered in the dataset are Physiculus maximowiczi, Neosalangichthys ishikawae, and Dexistes rikuzenius, but similar species such as Hexagrammos otakii, Salangichthys microdon, and several species of Pleuronectiformes are included in the Webfish dataset.

This study uses the above datasets to address the following three research questions through experiments.

RQ1: Can the instance segmentation performance of the foundation models, Grounded-SAM, be effectively employed in various resource survey environments without additional training (robustness to fish species, background environment, and density)?
RQ2: Does the keypoint detection model, trained on FIB and coupled with background removal, offer robust keypoint detection across diverse environments and fish species?
RQ3: Can the error-checking model identify detection errors made by the foundation models in images taken in a real environment?

In Table 1, FIB and Multiple (1) are employed to assess the instance segmentation performance on single and multitailed images, respectively, using mean average precision (mAP) as the metric. FIB and Multiple (1) serve as test datasets, not publicly accessible on the web, to examine Grounded-SAM’s generalization ability, which is trained on vast web image data. Multiple (2) was utilized for a more realistic evaluation of instance segmentation, keypoint detection, and error checking. Owing to the absence of correct labels, results were visually inspected. Regarding the robustness of fish species, Webfish was utilized for an evaluation of instance segmentation and keypoint detection. Although there is a possibility that publicly available Webfish dataset might be part of Grounde-SAM’s training data, it is considered unlikely.

4. Results

4.1. RQ1. Instance Segmentation Performance

In this section, we assess the robustness of instance segmentation to diverse background environments and fish densities, addressing RQ1. The evaluation regarding the robustness of fish species will be discussed in Section 4.4.

Table 3 presents the evaluation results for the FIB and Multiple (1) datasets. Grounded-SAM exhibited excellent segmentation performance with mAP scores of 0.997 and 0.864 for the FIB and Multiple (1) datasets, respectively. As depicted in Figure 8, visual examination of both datasets revealed accurate detection for most fish. However, in Figure 8a, a Seriola quinqueradiata with only its mouth appearing in the upper right corner was not detected, and in (b), an Arctoscopus japonicus located in the center was missed. In (b), the box was not detected, indicating that the oversight was due to a Grounding DINO error and can be addressed by adjusting the threshold for the “fish” prompt in Grounding DINO. However, lowering the threshold too much may lead to the detection of unwanted objects, necessitating hyperparameter tuning during operation.

We then assessed the fish detection performance in crowded environments using the Multiple (2) dataset. Table 4 presents the results for each environment, demonstrating an overall recall of 73.9% and a precision of 97.9%. This indicates a few missed fish but a low rate of false detection. Specifically, only one case resulted in objects (boots) being misdetected as fish, and in another case, multiple individuals were erroneously identified as a single fish in an excessively crowded scenario (as indicated in “Misdetection” and “Union”, respectively).

Figure 9 displays the results of Grounded-SAM inference for each image in Figure 7. All fish were successfully detected in Labs 1–3. Lab1, with a relatively dense arrangement, exhibited accurate detection since there was no overlap in the depth direction. However, Lab4, with its extremely crowded and overlapping fish, resulted in suboptimal detection for many fish. Despite some overlap, fish in Markets 1 or 2 were accurately detected. Detection, however, faced challenges when the overlap was as substantial as in Market3. This difficulty may be partly attributed to the challenge of discriminating boundary lines, particularly for Oplegnathus fasciatus. In both cases, Grounding DINO failed to detect boxes, indicating that improving instance segmentation in dense environments may involve tuning the hyperparameters of Grounding DINO.

Despite variations across environments, the visualized results revealed the following trends:

Fish detection is feasible even in dense placements but becomes challenging when fish overlap in the depth direction, particularly when decreasing the confidence of “fish” in the Grounding DINO.
Instances of missing detection occur when only a portion of the head or tail is visible at the screen’s edge, as illustrated in Figure 8a.

Most of the misses can be attributed to these two factors. Therefore, creating an environment where fish do not overlap substantially can ensure accurate instance segmentation of fish, regardless of the shooting date or background environment.

4.2. RQ2. Keypoint Detection Performance

In this section, addressing RQ2, we assess the robustness of keypoint detection for fish, especially in cases where most regions are detected due to Grounded-SAM inference on the Multiple (2) dataset (Robustness to fish species is discussed further in Section 4.4).

Table 5 shows the accuracy of keypoint detection in various environments. The visual inspection categorized the outputs into three groups: “Mostly correct’’, where all keypoints were generally predicted correctly; “Part of tail missing’’, where all keypoints could be utilized for resource surveys but some tail keypoints were not detected; “Critical missing,’’ where essential keypoints such as the mouth tip and most of the tail could not be predicted. For instance, in Figure 10a, the fish is labeled as “Mostly correct’’, and in (b), it is labeled as “Part of tail missing’’ due to a portion of the tail tip not being detected, and in (c), it is labeled as “Critical missing’’ because its tail was not accurately detected. Additionally, based on Grounded-SAM’s detection results, the mask was classified as “Correct mask’’ when it was entirely predicted and “Part of mask missing’’ when some of the masks were not detected. For example, the fish on the left and right in Figure 10a is classified as a “Correct mask’’ because the entire fish body was detected. In contrast, the fish in the center was classified as “Part of mask missing’’ because a segment of the fish’s body was obscured by a tag used in the market.

Table 5 indicates that in cases where the correct mask was predicted, 79.3% of the fish accurately detected all keypoints, irrespective of the environment. Figure 10 visually confirms the coordinates’ accuracy for each keypoint. Conversely, in images where part of the mask was missing, 38.9% of the fish were correctly detected, increasing to 81.5% if the absence of the part of the tail was acceptable. However, in both situations, 6.1% and 18.5% of the errors were deemed critical, rendering them unsuitable for resource assessments.

We investigated the factors contributing to performance degradation due to missing mask parts. One factor is that the masked area excludes the regions displaying the keypoints, as illustrated by the tail of fish in the center of Figure 10b. This issue can be resolved using techniques such as expanding the detected masks and enhancing the performance of instance segmentation by improving the bbox that serves as a prompt for SAM. Another factor is the division of the masked region of the fish body, as observed in the fish on the left of Figure 10c. Here, the fish body is substantially obscured by market tags. While the tail is visible in the image, the broken-up mask may misidentify the boundary as the tail. Altering the text prompt in Grounded-SAM can detect tags; thus, measures such as inpainting the tag area are expected to enhance performance.

4.3. RQ3. Error-Checking Performance

In this section, addressing RQ3, we assess the performance of the error-checking model when applied to the results derived from Grounded-SAM reasoning on Multiple (2) datasets. The error-checking model involves two inquiries: whether the entire fish body was accurately detected (overall) and whether the detected region encompasses more than one fish (multi) simultaneously. Each output is examined independently.

Table 6 displays the results of predicting the “overall’’ category using a confusion matrix. In addition to the correct label from Table 5, annotations included cases in which the entire fish was detected. Still, the mask was over-detected (Excess mask), and cases where critical areas were not detected (Critical missing). The predictive values were categorized as either correct or missing. The table indicates that while we successfully detected 80.6% of the correct instances (Recall), the overall accuracy was 69.2%. Notably, the precision was 70.8%. Considering that 29.2% of the results inferred as correct are incorrect, and recognizing that only accurate results should be utilized in resource surveys, it is advisable to enhance precision by adjusting the threshold value.

Considering fish that touch the edges of the screen, such as the fish at the top and bottom of Figure 10b, as not fully captured due to being out of view, is a valid perspective. Therefore, we introduced a rule-based postprocessing step that categorizes fish bordering the screen edge as missing, and the results are presented in Table 7. Postprocessing enhanced the overall accuracy to 75.4% and the precision to 81.8%. Although the number of missed detections increased, there was no excessive detection of incomplete masks, thereby demonstrating its potential for application in resource surveys.

The results of predicting multiple tails are presented in Table 8. This dataset is imbalanced, owing to only 10 cases in which Grounded-SAM detected multiple tails together. Consequently, the overall accuracy was 93.4%, but the Macro F1 score was 64.9%. It is noteworthy that the Single’s F1 score is remarkably high at 96.5%. Specifically, the precision was 99.6%, signifying that the fish predicted as single by the proposed model was almost certainly a single fish. Therefore, it can be concluded that the multidetection model achieved sufficient performance for use in resource surveys.

4.4. Robustness to Species

We assessed the performance of instance segmentation and keypoint detection on the Webfish dataset, which comprises 205 species in 19 orders. The aim was to assess the proposed method’s robustness across various fish species for resource assessment. In this experiment, we applied the procedures of the proposed method to each Webfish image. We visually verified the correctness of instance segmentation and keypoint detection, employing the same criteria as in previous experiments.

Because of the wide variety of species, the aggregated results are presented in Table 9. First, instance segmentation achieved high accuracy, successfully detecting 94.1% of the images overall. Subsequently, keypoint detection demonstrated accuracy in 41.3% of the images, increasing to 72.6% when allowing for partial tail detection. The accuracy varied across orders, with relatively high performance for Aulopiformes, Clupeiformes, and Tetraodontiformes and lower accuracy for Anguilliformes, Lophiiformes, and Zeiformes. Since the keypoint detection model was trained on the FIB dataset, the accuracy for fish not included in the FIB was expected to be lower. In contrast, keypoints of Clupeiformes and Tetraodontiformes, which were not included in the FIB, were also detected with high accuracy. This is thought to be because the keypoint detection model was trained without discriminating fish species, and keypoint features were learned using fish species with similar appearances.

Given these results, the robustness of the IS for fish species is high, but the accuracy of KP detection varies and may not be sufficient for certain species. The successful estimation of keypoints for fish species not included in the FIB if they have a similar appearance suggests that enriching the knowledge of the KP detection model by adding various fish species with different appearances to the FIB could improve KP detection performance. Moreover, while Keypoint R-CNN was adopted as the baseline due to its ease of implementation, it is believed that switching to the latest keypoint-detection models such as HRNet [39], VitPose [40], and PCT [41] could also lead to performance enhancements. Additionally, while keypoints are utilized for accurately estimating the fish length, it is possible to avoid using keypoints for species where keypoint detection accuracy is poor. Due to the high segmentation performance of the proposed method, it is feasible to roughly estimate the fish length, albeit with slightly lower accuracy, by employing a simple method that connects the ends of the segmentation mask.

In addition, upon examining misclassified images in Webfish, it was observed that Grounded-SAM segmentation results could be inaccurate, particularly when confronted with complex backgrounds, as illustrated in Figure 11 (note that these are not Webfish data but a reference image). The instance involving Seriola quinqueradiata in the lower section of the figure highlights accurate box detection by Grounding DINO but SAM segmentation failure. Because SAM employs a box as a prompt to detect plausible objects within the specified region, it seems that objects other than fish are sometimes detected.

5. Conclusions

In this study, we introduced deep-learning-based image-recognition models for automating aspects of resource survey tasks, focusing on instance segmentation, keypoint detection, and error checking for fish. A Foundation model Grounded-SAM served for instance segmentation, while the keypoint-detection and error-checking models were trained using a synthesized image dataset using the original FIB dataset. Experimental assessments were conducted using image data captured in the laboratory and Japanese market settings. The results demonstrated the effectiveness of the proposed method, achieving nearly accurate instance segmentation in scenarios where fish do not overlap. However, challenges arose in detecting closely overlapping fish in some instances. The keypoint detection model exhibited high accuracy when the entire fish was visible; however, accuracy declined when tail regions were fragmented. The error-checking model displayed robust performance in detecting multiple tails, while accuracy in identifying missing fish reached approximately 75%. The model demonstrated general robustness across 205 fish species surveyed in Japanese resource assessments.

As described, the proposed method can accurately detect the fish regions of species targeted for resource survey in Japan, as long as there is minimal overlap among the fish. Even fish detected as overlapping can be excluded from the survey target by using our error-checking model. However, in actual fishing operations, where the catch is densely packed in boxes for shipment, operational ingenuity is required to utilize the proposed method. For example, arranging the fish in boxes without overlap in the depth direction, as shown in Figure 10, allows the proposed method to detect the catch efficiently. Similarly, in conveyor environments like Lab3 or sorting tables like Lab4, human assistance in arranging to avoid overlaps during photography can lead to the construction of a resource survey system that enhances the collaboration between AI and humans. Furthermore, enhancing the FIB at each fishing port can improve keypoint detection performance, and fine-tuning SAM itself is expected to enhance the detection performance of the fish region.

This study has several limitations. First, its validation is restricted to fish species found in Japanese waters, and it has not been tested on a global scale using fish species from various countries. Future studies should involve expanding the dataset and evaluating the model’s robustness across diverse fish species. The second limitation pertains to the reduced accuracy of instance segmentation in high-density environments. The model struggled, especially when fish overlapped in the depth direction, failing to detect even the topmost fish. The third limitation relates to detection errors in SAM, as illustrated in Figure 11. While box detection performs well, addressing SAM detection errors could benefit from the method proposed by Dai et al. [42].

Author Contributions

Conceptualization, T.H.; Data curation, T.H. and D.N.; Formal analysis, T.H.; Funding acquisition, T.H.; Investigation, T.H.; Methodology, T.H.; Project administration, T.H.; Resources, T.H.; Software, T.H.; Supervision, T.H.; Validation, T.H.; Visualization, T.H.; Writing—original draft, T.H.; Writing—review and editing, T.H. and D.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by JST ACT-X grant number JPMJAX20AJ.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Some of the data underlying this article were provided by Zukan.com, Inc., with permission. Data will be shared directly from each organization by requesting it from them. The FIB dataset we collected will be published at the following URL after the publication of this manuscript. https://haselab.fuis.u-fukui.ac.jp/research/fish/fib.html (accessed on 1 February 2024).

Acknowledgments

The WebFish dataset used in this study was provided by Zukan.com, Inc. The authors would like to express their gratitude to them. We also thank the Tsuruga Branch, Fukui Prefecture of Fisheries Cooperative Association for their cooperation with our survey at the fish market. This research was supported by the Stock Assessment Program funded by the Fisheries Agency, Japan.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Notes

1	Grounded-SAM: https://github.com/IDEA-Research/Grounded-Segment-Anything (accessed on 1 February 2024)
2	Segmentation anything https://github.com/facebookresearch/segment-anything (accessed on 1 February 2024)
3	github: coco-annotator: https://github.com/jsbroks/coco-annotator (accessed on 1 February 2024)
4	Web Sakana Zukan by Zukan.com, Inc., Chiyoda-ku, Tokyo, Japan. https://zukan.com/fish/ (accessed on 1 February 2024)

References

Keimer, A.; Laurent-Brouty, N.; Farokhi, F.; Signargout, H.; Cvetkovic, V.; Bayen, A.M.; Johansson, K.H. Information Patterns in the Modeling and Design of Mobility Management Services. Proc. IEEE 2018, 106, 554–576. [Google Scholar] [CrossRef]
Chen, X.; Lv, S.; long Shang, W.; Wu, H.; Xian, J.; Song, C. Ship energy consumption analysis and carbon emission exploitation via spatial-temporal maritime data. Appl. Energy 2024, 360, 122886. [Google Scholar] [CrossRef]
Worm, B.; Hilborn, R.; Baum, J.K.; Branch, T.A.; Collie, J.S.; Costello, C.; Fogarty, M.J.; Fulton, E.A.; Hutchings, J.A.; Jennings, S.; et al. Rebuilding global fisheries. Science 2009, 325, 578–585. [Google Scholar] [CrossRef] [PubMed]
Xu, P.; Xie, M.; Zhou, W.; Suo, A. Research on Fishery Resource Assessment and Sustainable Utilization (FRASU) during 1990–2020: A bibliometric review. Glob. Ecol. Conserv. 2021, 29, e01720. [Google Scholar] [CrossRef]
Garcia, R.; Prados, R.; Quintana, J.; Tempelaar, A.; Gracias, N.; Rosen, S.; Vågstøl, H.; Løvall, K. Automatic segmentation of fish using deep learning with application to fish size measurement. ICES J. Mar. Sci. 2019, 77, 1354–1366. [Google Scholar] [CrossRef]
Álvarez-Ellacuría, A.; Palmer, M.; Catalán, I.A.; Lisani, J.L. Image-based, unsupervised estimation of fish size from commercial landings using deep learning. ICES J. Mar. Sci. 2019, 77, 1330–1339. [Google Scholar] [CrossRef]
Tseng, C.H.; Kuo, Y.F. Detecting and counting harvested fish and identifying fish types in electronic monitoring system videos using deep convolutional neural networks. ICES J. Mar. Sci. 2020, 77, 1367–1378. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Bravata, N.; Kelly, D.; Eickholt, J.; Bryan, J.; Miehls, S.; Zielinski, D. Applications of deep convolutional neural networks to predict length, circumference, and weight from mostly dewatered images of fish. Ecol. Evol. 2020, 10, 9313–9325. [Google Scholar] [CrossRef]
Saleh, A.; Sheaves, M.; Jerry, D.; Rahimi Azghadi, M. Applications of deep learning in fish habitat monitoring: A tutorial and survey. Expert Syst. Appl. 2024, 238, 121841. [Google Scholar] [CrossRef]
Suo, F.; Huang, K.; Ling, G.; Li, Y.; Xiang, J. Fish Keypoints Detection for Ecology Monitoring Based on Underwater Visual Intelligence. In Proceedings of the 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), Shenzhen, China, 13–15 December 2020; pp. 542–547. [Google Scholar]
Yu, Y.; Zhang, H.; Yuan, F. Key point detection method for fish size measurement based on deep learning. IET Image Proc. 2023, 17, 4142–4158. [Google Scholar] [CrossRef]
Dong, J.; Shangguan, X.; Zhou, K.; Gan, Y.; Fan, H.; Chen, L. A detection-regression based framework for fish keypoints detection. Intell. Mar. Technol. Syst. 2023, 1, 9. [Google Scholar] [CrossRef]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Boom, B.J.; Huang, P.X.; He, J.; Fisher, R.B. Supporting ground-truth annotation of image datasets using clustering. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 11–15 November 2012; pp. 1542–1545. [Google Scholar]
Ulucan, O.; Karakaya, D.; Turkan, M. A Large-Scale Dataset for Fish Segmentation and Classification. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–5. [Google Scholar]
Garcia-d’Urso, N.; Galan-Cuenca, A.; Pérez-Sánchez, P.; Climent-Pérez, P.; Fuster-Guillo, A.; Azorin-Lopez, J.; Saval-Calvo, M.; Guillén-Nieto, J.E.; Soler-Capdepón, G. The DeepFish computer vision dataset for fish instance segmentation, classification, and size estimation. Sci. Data 2022, 9, 287. [Google Scholar] [CrossRef]
Hasegawa, T.; Kondo, K.; Senou, H. Transferable Deep Learning Model for the Identification of Fish Species for Various Fishing Grounds. J. Mar. Sci. Eng. 2024, 12, 415. [Google Scholar] [CrossRef]
Shah, S.Z.H.; Rauf, H.T.; IkramUllah, M.; Khalid, M.S.; Farooq, M.; Fatima, M.; Bukhari, S.A.C. Fish-Pak: Fish species dataset from Pakistan for visual features based classification. Data Brief 2019, 27, 104565. [Google Scholar] [CrossRef]
Zhuang, P.; Wang, Y.; Qiao, Y. WildFish: A Large Benchmark for Fish Recognition in the Wild. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1301–1309. [Google Scholar]
Zhuang, P.; Wang, Y.; Qiao, Y. Wildfish++: A Comprehensive Fish Benchmark for Multimedia Research. IEEE Trans. Multimed. 2021, 23, 3603–3617. [Google Scholar] [CrossRef]
Hasegawa, T.; Tanaka, M. Few-shot Fish Length Recognition by Mask R-CNN for Fisheries Resource Management. IPSJ Trans. Consum. Devices Syst. 2022, 12, 38–48. (In Japanese) [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv 2023, arXiv:2303.05499. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Zhang, C.; Liu, L.; Cui, Y.; Huang, G.; Lin, W.; Yang, Y.; Hu, Y. A Comprehensive Survey on Segment Anything Model for Vision and Beyond. arXiv 2023, arXiv:2305.08196. [Google Scholar]
Ren, S.; Luzi, F.; Lahrichi, S.; Kassaw, K.; Collins, L.M.; Bradbury, K.; Malof, J.M. Segment anything, from space? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–10 January 2024; pp. 8355–8365. [Google Scholar]
Huang, Y.; Yang, X.; Liu, L.; Zhou, H.; Chang, A.; Zhou, X.; Chen, R.; Yu, J.; Chen, J.; Chen, C.; et al. Segment anything model for medical images? Med. Image Anal. 2024, 92, 103061. [Google Scholar] [CrossRef]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded Language-Image Pre-training. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10955–10965. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jegou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 9630–9640. [Google Scholar]
Zhang, C.; Puspitasari, F.D.; Zheng, S.; Li, C.; Qiao, Y.; Kang, T.; Shan, X.; Zhang, C.; Qin, C.; Rameau, F.; et al. A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering. arXiv 2023, arXiv:2306.06211. [Google Scholar]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Features. In Computer Vision—ECCV 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
Prokop, K.; Połap, D. Heuristic-based image stitching algorithm with automation of parameters for smart solutions. Expert Syst. Appl. 2024, 241, 122792. [Google Scholar] [CrossRef]
Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. AAAI 2020, 34, 13001–13008. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Hasegawa, T.; Tanaka, M. Validation of the effectiveness of Detic as a zero-shot fish catch recognition system. In Proceedings of the 11th IIAE International Conference on Industrial Application Engineering (ICIAE), Okinawa, Japan, 26–30 March 2023. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
Geng, Z.; Wang, C.; Wei, Y.; Liu, Z.; Li, H.; Hu, H. Human Pose as Compositional Tokens. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 660–671. [Google Scholar]
Dai, H.; Ma, C.; Liu, Z.; Li, Y.; Shu, P.; Wei, X.; Zhao, L.; Wu, Z.; Zeng, F.; Zhu, D.; et al. SAMAug: Point Prompt Augmentation for Segment Anything Model. arXiv 2023, arXiv:2307.01187. [Google Scholar]

Figure 1. Overview of automatic fish resource survey system.

Figure 2. An example of a fish length estimation method.

Figure 3. Process outline of the proposed method for detecting regions and keypoints in each fish.

Figure 4. (a) Segmentation results by SAM: the left horse mackerel is detected based on the green bbox, and the bottom flatfish is detected based on the points of the green star mark. (b) Object detection results by Grounding DINO: all boxes are detected based on the text prompt “mackerel”. (c) Instance segmentation results by Grounded-SAM: all fish regions are detected by SAM based on boxes detected by Grounding DINO with the prompt “fish”.

Figure 5. Examples of FIB dataset comprising 4K image (only one fish), a species label, a segmentation mask, and a keypoint label.

Figure 6. Synthesized images for training of error-checking model. (1) Correct synthetic image labeled {single, correct}, (2) multiple synthetic image labeled {multi, correct}, (3) missing synthetic image labeled {single, missing}, and (4) multiple and missing synthetic image labeled {multi, missing}.

Figure 7. Sample images of test datasets. Lab1 and Lab4 exhibit relatively dense fish arrangements, while Lab2 and Lab3 have minimal overlap in fish passage. All market images feature actual boxed fish, so the overlap varies from minimal to severe.

Figure 8. Cases of partial detection failure. (a) Seriola quinqueradiata with only its mouth appearing in the upper right corner was not detected. (b) Arctoscopus japonicus located in the center was missed.

Figure 9. Detection results of sample images of the test datasets.

Figure 10. Keypoints detection results of sample images (a–c) from Market 1–3 datasets. Different color masks indicate different instances.

Figure 11. Examples of SAM segmentation failures.

Table 1. Summary of the test dataset. ✓ denotes where we tested, and ✗ indicates areas we could not use for testing due to potential utilization as a training dataset.

	# of Pics.	# of Fishes	# of Species	IS (mAP)	IS (Ratio)	KP (Ratio)	EC (Ratio)
FIB	405	405	24	✓	✗	✗	✗
Multiple (1)	12	97	9	✓
Multiple (2)	86	553	26		✓	✓	✓
Webfish	842	842	205	✗	✓	✓

Table 2. Summary of the Multiple (2) dataset.

	# of Picts.	# of Fishes	# of Species	Date	Resolution
Lab1	9	132	8	11 March 2021	full HD
Lab2	24	102	5	1 April 2021	full HD
Lab3	12	87	5	21 July 2022	4 K
Lab4	5	79	6	11 May 2023	4K
Market1	12	39	8	4 March 2021	4K
Market2	11	43	11	21 April 202	4K
Market3	13	71	11	7 December 2023	4K
Total	86	553	26

Table 3. Evaluation results of instance segmentation performance using FIB and Multiple (1) datasets.

	mAP	AP50	AP75	AR¹	AR¹⁰	AR¹⁰⁰
FIB	0.997	0.998	0.998	0.999	0.999	0.999
Multiple (1)	0.864	0.976	0.976	0.114	0.760	0.889

Table 4. Evaluation results of instance segmentation performance using Multiple (2) dataset.

	Recall [%]	Precision [%]	Fishes	Success	Misdetection	Union
Lab1	70.8	100.0	130	92	0	0
Lab2	98.0	100.0	102	100	0	0
Lab3	100.0	100.0	89	89	0	0
Lab4	59.0	97.9	78	46	0	1
Market1	79.5	96.9	39	31	0	1
Market2	68.2	90.9	44	30	1	2
Market3	32.5	86.2	77	25	0	4

Table 5. Evaluation results of keypoint detection performance.

Mask Condition	Env.	Mostly Correct	Part of Tail Missing	Critical Missing
Correct mask	Lab1	48 (85.7%)	6 (10.7%)	2 (3.6%)
	Lab2	66 (78.6%)	11 (13.1%)	7 (8.3%)
	Lab3	61 (76.3%)	13 (16.3%)	6 (7.5%)
	Lab4	27 (84.4%)	4 (12.5%)	1 (3.1%)
	Market1	9 (90.0%)	1 (10.0%)	0 (0.0%)
	Market2	8 (66.7%)	3 (25.0%)	1 (8.3%)
	Market3	3 (50.0%)	3 (50.0%)	0 (0.0%)
	Total	222 (79.3%)	41 (14.6%)	17 (6.1%)
Part of mask missing	Lab1	1 (14.3%)	6 (85.7%)	0 (0.0%)
	Lab4	3 (33.3%)	0 (0.0%)	6 (66.7%)
	Market1	6 (35.3%)	8 (47.1%)	3 (17.6%)
	Market2	6 (54.5%)	5 (45.5%)	0 (0.0%)
	Market3	5 (50.0%)	4 (40.0%)	1 (10.0%)
	Total	21 (38.9%)	23 (42.6%)	10 (18.5%)

Table 6. Confusion matrix of the prediction result of the error-checking model (overall, whether the entire body of fish can be detected).

Pred.\True	Correct Mask	Excess Mask	Part of Mask Missing	Critical Missing
Correct	228	0	19	75
Missing	52	3	35	71
Recall	80.6%		53.0%
Precision	70.8%		65.8%
F1 score	75.4%		58.7%
Accuracy	69.2%
Macro F1 score	67.0%

Table 7. Confusion matrix of the prediction result of the error-checking model with postprocessing (overall, whether the entire body of fish can be detected).

Pred.\True	Correct Mask	Excess Mask	Part of Mask Missing	Critical Missing
Correct	211	0	19	28
Missing	69	3	35	118
Recall	74.6%		76.5%
Precision	81.8%		68.0%
F1 score	78.0%		72.0%
Accuracy	75.4%
Macro F1 score	75.0%

Table 8. Confusion matrix of the prediction result of the error-checking model (multi, whether the multiple fish can be detected in the same region).

Pred.\True	Single	Multiple
Single	443	2
Multiple	30	8
Recall	93.7%	80.0%
Precision	99.6%	21.1%
F1 score	96.5%	33.3%
Accuracy	93.4%
Macro F1 score	64.9%

Table 9. Instance segmentation (IS) and keypoint detection (KP) accuracy for each fish order. KP (strict) is composed of only “Mostly correct”, and KP comprises “Mostly correct” and “Part of tail missing” of Table 5.

Order	# of Fish	IS	KP (Strict)	KP
Perciformes	417	94.0%	42.4%	72.2%
Pleuronectiformes	151	92.7%	18.5%	63.6%
Aulopiformes	87	97.7%	63.2%	89.7%
Clupeiformes	30	96.7%	76.7%	93.3%
Tetraodontiformes	29	100.0%	48.3%	89.7%
Stomiiformes	23	91.3%	56.5%	73.9%
Beloniformes	17	88.2%	47.1%	58.8%
Gadiformes	16	81.3%	62.5%	81.3%
Anguilliformes	15	93.3%	20.0%	46.7%
Lophiiformes	10	90.0%	10.0%	30.0%
Zeiformes	10	90.0%	20.0%	50.0%
Ophidiiformes	5	100.0%	0.0%	100.0%
Beryciformes	5	100.0%	60.0%	100.0%
Salmoniformes	5	100.0%	100.0%	100.0%
Gasterosteiformes	5	80.0%	0.0%	20.0%
Myliobatiformes	5	100.0%	0.0%	20.0%
Mugiliformes	5	100.0%	60.0%	100.0%
Argentiniformes	4	100.0%	75.0%	100.0%
Lampriformes	3	100.0%	0.0%	33.3%
Total	842	94.1%	41.3%	72.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hasegawa, T.; Nakano, D. Robust Fish Recognition Using Foundation Models toward Automatic Fish Resource Management. J. Mar. Sci. Eng. 2024, 12, 488. https://doi.org/10.3390/jmse12030488

AMA Style

Hasegawa T, Nakano D. Robust Fish Recognition Using Foundation Models toward Automatic Fish Resource Management. Journal of Marine Science and Engineering. 2024; 12(3):488. https://doi.org/10.3390/jmse12030488

Chicago/Turabian Style

Hasegawa, Tatsuhito, and Daichi Nakano. 2024. "Robust Fish Recognition Using Foundation Models toward Automatic Fish Resource Management" Journal of Marine Science and Engineering 12, no. 3: 488. https://doi.org/10.3390/jmse12030488

APA Style

Hasegawa, T., & Nakano, D. (2024). Robust Fish Recognition Using Foundation Models toward Automatic Fish Resource Management. Journal of Marine Science and Engineering, 12(3), 488. https://doi.org/10.3390/jmse12030488

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Fish Recognition Using Foundation Models toward Automatic Fish Resource Management

Abstract

1. Introduction

2. Proposed Method

2.1. Overall Process and Advantages

2.2. Instance Segmentation by Foundation Models

2.2.1. Foundation Models

2.2.2. Prompt Engineering

2.3. Keypoint and Error Detection by FIB

2.3.1. FIB

2.3.2. Keypoint Detection

2.3.3. Error Check

3. Experimental Settings

4. Results

4.1. RQ1. Instance Segmentation Performance

4.2. RQ2. Keypoint Detection Performance

4.3. RQ3. Error-Checking Performance

4.4. Robustness to Species

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI