Figure 1.
Existing dataset examples with different feeding states. These images show difference between feeding and non-feeding image states. There is motion blur, adhesion, occlusion, overlap, and clustering in the feeding image. (a) Non-feeding; (b) moderate feeding; (c) weak feeding; (d) strong feeding.
Figure 1.
Existing dataset examples with different feeding states. These images show difference between feeding and non-feeding image states. There is motion blur, adhesion, occlusion, overlap, and clustering in the feeding image. (a) Non-feeding; (b) moderate feeding; (c) weak feeding; (d) strong feeding.
Figure 2.
Multi-scale in the fish feeding image. (A) multi-scale between different categories; (B,C) multi-scale between the fish2 semantic category; (D,E) multi-scale between the fish1 semantic category. The white font indicates the semantic category of fish objects.
Figure 2.
Multi-scale in the fish feeding image. (A) multi-scale between different categories; (B,C) multi-scale between the fish2 semantic category; (D,E) multi-scale between the fish1 semantic category. The white font indicates the semantic category of fish objects.
Figure 3.
Examples image of fish segmentation in previous works. (A1): single fish and single category in the simple background; (A2): multi-fish and single category in the underwater environment; (A3): single fish and multi-category in the simple background; (A4): single fish and multi-category, but only single category in an image; (B1): multi-fish and multi-category in underwater; (B2,B3): single fish in underwater fish school image; (B4): multi-fish and multi-category in the complex underwater environment.
Figure 3.
Examples image of fish segmentation in previous works. (A1): single fish and single category in the simple background; (A2): multi-fish and single category in the underwater environment; (A3): single fish and multi-category in the simple background; (A4): single fish and multi-category, but only single category in an image; (B1): multi-fish and multi-category in underwater; (B2,B3): single fish in underwater fish school image; (B4): multi-fish and multi-category in the complex underwater environment.
Figure 4.
Multi-scale module for extracting contextual semantic information.
Figure 4.
Multi-scale module for extracting contextual semantic information.
Figure 5.
The overall architecture of our proposed FSFS-Net. (A) is original U-Net network. (B) is our method proposed. The encoder downsamples the input image with 360 × 480 until it reaches a size 22 × 30 in the bottleneck. Then, the decoder upsamples it to the original input. An additional three output heads are used for deep supervision loss.
Figure 5.
The overall architecture of our proposed FSFS-Net. (A) is original U-Net network. (B) is our method proposed. The encoder downsamples the input image with 360 × 480 until it reaches a size 22 × 30 in the bottleneck. Then, the decoder upsamples it to the original input. An additional three output heads are used for deep supervision loss.
Figure 6.
Shuffle Polarized Self-Attention architecture.
Figure 6.
Shuffle Polarized Self-Attention architecture.
Figure 7.
Polarized Self-Attention Module.
Figure 7.
Polarized Self-Attention Module.
Figure 8.
(A) Stack of original convolutions. (B) Our proposed LMSM.
Figure 8.
(A) Stack of original convolutions. (B) Our proposed LMSM.
Figure 9.
Segmentation head of deep supervision. Notation: Con is the convolution operation. BN denotes batch normalization. Upsample means bilinear interpolation. N denotes the number of segmentation objects categories. C denotes the number of the output channels.
Figure 9.
Segmentation head of deep supervision. Notation: Con is the convolution operation. BN denotes batch normalization. Upsample means bilinear interpolation. N denotes the number of segmentation objects categories. C denotes the number of the output channels.
Figure 10.
An image acquisition station. Images were taken from real industrialized recirculating aquaculture, and there were approximately 60 fish in the pond.
Figure 10.
An image acquisition station. Images were taken from real industrialized recirculating aquaculture, and there were approximately 60 fish in the pond.
Figure 11.
Sample Images Labeled. The three rows top-bottom presents the marking rules for two target categories, the original image, and the ground truth. The red box and the yellow box, respectively, represent the fish1 and fish2 category.
Figure 11.
Sample Images Labeled. The three rows top-bottom presents the marking rules for two target categories, the original image, and the ground truth. The red box and the yellow box, respectively, represent the fish1 and fish2 category.
Figure 12.
Improved ASPP module. The goal is to adapt to our fish segmentation task.
Figure 12.
Improved ASPP module. The goal is to adapt to our fish segmentation task.
Figure 13.
Visualized segmentation results on fish school feeding datasets. The result shows that our proposed method can segment multiple instance objects. Examples of fish images are randomly selected from the test dataset. The four column left-to-right represents the input image, the ground truth (Label), the output predicted of FSFS-Net, and the final image. The red rectangles in (a–f) rows show that the instances are successfully segmented. The yellow rectangles in (g–i,m) rows present some segmentation failures.
Figure 13.
Visualized segmentation results on fish school feeding datasets. The result shows that our proposed method can segment multiple instance objects. Examples of fish images are randomly selected from the test dataset. The four column left-to-right represents the input image, the ground truth (Label), the output predicted of FSFS-Net, and the final image. The red rectangles in (a–f) rows show that the instances are successfully segmented. The yellow rectangles in (g–i,m) rows present some segmentation failures.
Figure 14.
Visualization of the number of pixels that was extracted from a 3 min video clip with fish feeding. (a) Trends of P1 and P2 values across 3-minute video clip; (b) Trends of P and PR values across 3-minute video clip. P1, P2, P and PR represent the number pixels of fish1, fish2, total and pixel ratio of fish2 to fish1, respectively.
Figure 14.
Visualization of the number of pixels that was extracted from a 3 min video clip with fish feeding. (a) Trends of P1 and P2 values across 3-minute video clip; (b) Trends of P and PR values across 3-minute video clip. P1, P2, P and PR represent the number pixels of fish1, fish2, total and pixel ratio of fish2 to fish1, respectively.
Table 1.
Summary of fish segmentation method and its application. Category and Number, respectively, represent the segmentation category and the number of fishes in an image.
Table 1.
Summary of fish segmentation method and its application. Category and Number, respectively, represent the segmentation category and the number of fishes in an image.
Method | Category | Number | Application |
---|
Traditional method | Background models [15], clustering [23]. | One-class | Fish school | Counting, biomass and behavior. |
Contour-based [16], improved thinning [24], color-based [17,18]. | One-class | Single fish | Mass and size measurement, identification. |
Semantic segmentation | SegNet [25] | Two-class | Single fish | Size measurement. |
ResNet-FCN [26], U-Net [27], DPANet [28], SegNet [29]. | One-class | Fish school, multi-fish. | Counting, size measurement. |
Instance segmentation | Self-proposed [30] | Multi-category | Multi-fish | Identification |
Mask R-CNN [31] and other [32] | One-class | Single fish | Size measurement, identification. |
Mask R-CNN [33,34] | One-class | Multi-fish | Tracking, size measurement. |
Mask R-CNN [35] | Four-class | Single fish | Morphological features. |
Table 2.
MIoU score comparison with other methods on fish feeding behavior dataset. Our proposed method achieves state-of-the-art performance.
Table 2.
MIoU score comparison with other methods on fish feeding behavior dataset. Our proposed method achieves state-of-the-art performance.
Method | Backbone | Image Size | MIoU (%) |
---|
LinkNet [51] | ResNet18 | 704 × 1280 | 59.06 |
ENet [52] | __ | 720 × 1280 | 65.03 |
BiSeNet v2 [47] | __ | 720 × 1280 | 72.49 |
DDRNet [50] | 39 | 704 × 1280 | 75.68 |
OCNet [43] | ResNet101 | 360 × 480 | 42.39 |
DANet [42] | ResNet101 | 360 × 480 | 43.29 |
FCN-8s [53] | VGG16 | 352 × 480 | 69.44 |
SegNet [36] | VGG16 | 360 × 480 | 69.52 |
DFN [41] | ResNet101 | 352 × 480 | 69.77 |
DFN [41] | ResNet50 | 352 × 480 | 55.26 |
ExFuse [54] | ResNet50 | 352 × 480 | 70.46 |
ExFuse [54] | ResNet101 | 352 × 480 | 70.05 |
PSNet [38] | ResNet50 | 473 × 473 | 71.18 |
HRNet V2 [55] | W48 | 352 × 480 | 75.36 |
DeepLab v3 plus [39] | VGG16 | 360 × 480 | 75.83 |
GCN [49] | VGG16 | 352 × 480 | 75.81 |
GCN [49] | ResNet152 | 352 × 480 | 76.14 |
U-Net [37] | __ | 360 × 480 | 77.98 |
SPSA-Net (ours) | __ | 360 × 480 | 79.62 |
Table 3.
Performance comparisons of our proposed and baseline U-Net method. U-Net+ shows that the U-Net model uses five down-sampling. DA: data augmentation; NSC: number of skip connections.
Table 3.
Performance comparisons of our proposed and baseline U-Net method. U-Net+ shows that the U-Net model uses five down-sampling. DA: data augmentation; NSC: number of skip connections.
Method | DA | NSC | MIoU (%) | Params | GFLOPs |
---|
U-Net | W/O | 4 | 77.98 | 3.45 M | 166.2 |
W | 4 | 78.15 | 3.45 M | 166.2 |
U-Net+ | W/O | 5 | 78.80 | 5.35 M | 99.1 |
W/O | 4 | 78.97 | 2.99 M | 89.3 |
W | 4 | 79.29 | 2.99 M | 89.3 |
FSFS-Net | W | 4 | 79.37 | 2.45 M | 89.6 |
W/O | 4 | 79.62 | 2.45 M | 89.6 |
Table 4.
Ablation study of adding LMSM on proposed methods. LMSM *: depth separable convolution is replaced with general convolution; LMSM: proposed method.
Table 4.
Ablation study of adding LMSM on proposed methods. LMSM *: depth separable convolution is replaced with general convolution; LMSM: proposed method.
Context Module | MIoU (%) | Params | GFLOPs |
---|
PPM | 79.14 | 3.33 M | 94.7 |
ASPP | 79.29 | 4.51 M | 102.9 |
LMSM * | 79.38 | 4.54 M | 103.4 |
LMSM | 79.62 | 2.45 M | 89.6 |
Table 5.
Ablation study of adding SPSA on proposed methods.
Table 5.
Ablation study of adding SPSA on proposed methods.
Attention Module | MIoU (%) | Params | GFLOPs |
---|
PSA | 79.49 | 2.52 M | 97.9 |
SPSA | 79.62 | 2.45 M | 89.6 |
Table 6.
Ablation study of gradually increasing LMSM, SPSA, and the deep supervision (DS) based on baseline U-Net. The experimental results show that the performance can be further improved by using the LMSM and the SPSA strategy.
Table 6.
Ablation study of gradually increasing LMSM, SPSA, and the deep supervision (DS) based on baseline U-Net. The experimental results show that the performance can be further improved by using the LMSM and the SPSA strategy.
U-Net | SPSA | LMSM | DS | MIoU (%) | Gain |
---|
✓ | | | | 77.98 | |
✓ | ✓ | | | 79.14 | ↑ 1.16 |
✓ | | ✓ | | 79.46 | ↑ 1.48 |
✓ | ✓ | ✓ | | 79.52 | ↑ 1.54 |
✓ | ✓ | ✓ | ✓ | 79.62 | ↑ 0.1 |
Table 7.
Different loss functions bring different improvements.
Table 7.
Different loss functions bring different improvements.
Method | MIoU (%) |
---|
Focal loss | 76.75 |
Weighted cross-entropy | 77.70 |
LDAM loss | 77.72 |
OHEM | 77.94 |
Lovas loss | 79.01 |
RMI loss | 79.62 |