FDMNet: A Multi-Task Network for Joint Detection and Segmentation of Three Fish Diseases

Liu, Zhuofu; Yan, Zigan; Li, Gaohan

doi:10.3390/jimaging11090305

Open AccessArticle

FDMNet: A Multi-Task Network for Joint Detection and Segmentation of Three Fish Diseases

by

Zhuofu Liu

^*,

Zigan Yan

and

Gaohan Li

The Higher Educational Key Laboratory for Measuring and Control Technology and Instrumentations of Heilongjiang Province, Harbin University of Science and Technology, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

J. Imaging 2025, 11(9), 305; https://doi.org/10.3390/jimaging11090305

Submission received: 3 August 2025 / Revised: 3 September 2025 / Accepted: 5 September 2025 / Published: 6 September 2025

(This article belongs to the Special Issue Image Segmentation Techniques: Current Status and Future Directions (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Fish diseases are one of the primary causes of economic losses in aquaculture. Existing deep learning models have progressed in fish disease detection and lesion segmentation. However, many models still have limitations, such as detecting only a single type of fish disease or completing only a single task within fish disease detection. To address these limitations, we propose FDMNet, a multi-task learning network. Built upon the YOLOv8 framework, the network incorporates a semantic segmentation branch with a multi-scale perception mechanism. FDMNet performs detection and segmentation simultaneously. The detection and segmentation branches use the C2DF dynamic feature fusion module to address information loss during local feature fusion across scales. Additionally, we use uncertainty-based loss weighting together with PCGrad to mitigate conflicting gradients between tasks, improving the stability and overall performance of FDMNet. On a self-built image dataset containing three common fish diseases, FDMNet achieved 97.0% mAP50 for the detection task and 85.7% mIoU for the segmentation task. Relative to the multi-task YOLO-FD baseline, FDMNet’s detection mAP50 improved by 2.5% and its segmentation mIoU by 5.4%. On the dataset constructed in this study, FDMNet achieved competitive accuracy in both detection and segmentation. These results suggest potential practical utility.

Keywords:

fish disease detection; lesion segmentation; multi-task network

Key Contribution: This study proposes a multi-task framework integrating fish disease detection and semantic segmentation. It includes a C2DF module for cross-scale feature fusion, and uses uncertainty-based loss weighting with PCGrad to improve training stability and task coordination.

1. Introduction

Aquaculture, a globally important food production sector, contributes to food security and promotes sustainable economic development [1,2,3]. However, as stocking density increases, fish diseases—such as bacterial hemorrhagic septicemia, saprolegniasis, and fish lice—occur more frequently and remain a significant constraint to profitability [4,5,6]. Outbreaks spread rapidly, causing large-scale mortality and secondary environmental pollution [7,8,9]. Therefore, early detection and precise lesion segmentation are critical for intelligent aquaculture systems [10].

Traditional diagnosis relies on manual sampling and visual inspection, which is subjective, time-consuming, and prone to missing the optimal treatment window. With advances in computer vision, deep learning provides alternatives to manual inspection for object detection and image segmentation in fish-health monitoring. However, many existing systems treat detection or segmentation in isolation or target only a single fish disease.

To address these gaps, we propose FDMNet, a multi-task framework built on YOLOv8. It adds a semantic segmentation branch and enables simultaneous detection and segmentation tasks. The segmentation branch employs multi-scale feature fusion to capture lesion boundaries and fine details better. FDMNet uses C2DF modules that combine channel and spatial attention to enhance segmentation of blurred boundaries and small lesions. We adopt uncertainty-based loss weighting with PCGrad for joint training to mitigate gradient conflicts between tasks and improve training stability.

The main contributions of this work are as follows:

We present FDMNet, a lightweight multi-task model that integrates a YOLOv8 detection head and a semantic segmentation branch to simultaneously detect and segment three fish diseases (bacterial hemorrhagic septicemia, saprolegniasis, and fish lice).
We adapt the C2DF module by integrating dynamic feature fusion into C2f blocks. C2DF replaces C2f modules in the neck and the segmentation branch, improving the local-detail representation and boundary modelling.
The study adopts a multi-task optimisation strategy that combines uncertainty-based loss weighting with PCGrad to improve training stability and reduce inter-task gradient interference.
This study developed a multi-disease fish image dataset and evaluated all modules used in the dataset. We compared FDMNet with Faster R-CNN, YOLOv8n, YOLOv11n, RT-DETR, YOLO-FD, and Mask R-CNN for the detection task. We compared FDMNet with U-Net, DeepLabv3-ResNet50, DeepLabv3+-ResNet50, YOLO-FD and Mask R-CNN for the segmentation task. On our dataset, FDMNet achieved competitive performance on both tasks, aligning with the needs of image-based fish disease diagnosis.

This paper is organised as follows. Section 2 reviews related work on fish disease detection and lesion segmentation. Section 3 introduces our self-constructed dataset, then describes the overall structure of FDMNet and introduces the C2DF block and the multi-scale semantic segmentation branch. It also describes a training strategy that combines uncertainty weighting with PCGrad. Section 4 details the training configuration and specifies the evaluation metrics. It then reports the experimental results. These include detection comparisons with Faster R-CNN, YOLOv8n, YOLOv11n, RT-DETR, YOLO-FD, and Mask R-CNN, segmentation comparisons with U-Net, DeepLabv3, DeepLabv3+, YOLO-FD, and Mask R-CNN, ablation studies, and Grad-CAM visualisation. Section 5 discusses the main findings, notes model and data limitations, and outlines future work. Section 6 presents the conclusions.

2. Related Work

Research on fish disease image analysis has progressed from classical machine-learning pipelines to deep learning approaches. Classical machine learning pipelines combined pre-processing, feature extraction, and a classifier [11]. Early methods relied on handcrafted features with simple classifiers, which improved efficiency over manual inspection. Ahmed et al. reported an SVM-based approach using image enhancement, k-means clustering, and texture features, achieving 94.1% accuracy on a salmon-disease image dataset with augmentation [12]. Waleed et al. investigated RGB, YCbCr, and XYZ colour spaces with Gaussian modelling to segment infected regions [13]. However, the reliance on manual feature design made adaptation and maintenance increasingly demanding as datasets and task complexity grew. With the advent of convolutional neural networks, two main lines have emerged: detection models that localise diseased fish and segmentation models that delineate lesions.

Building on this shift, advances in CNN-based methods have further accelerated fish disease image analysis [14]. In detection, Yu et al. proposed a YOLOv4-MobileNetV3 variant for identifying four deep-sea fish diseases and reported low latency on their dataset [15]. Hamzaoui et al. introduced FishDETECT, which reported high precision in low-light conditions using transfer learning and tailored optimisation [16]. Zhou et al. designed a YOLOv7-based feature-enhancement module and reported accuracy and speed improvements for static underwater imagery [17]. Li et al. presented DDEYOLOv9 with the DRNELAN4 module and a dynamic-convolution head, improving abnormal behaviour detection in their setting [18]. Cai et al. proposed NAM-YOLOv7, which combines Auto-MSRCR enhancement, NAM attention in ELAN, and MPDIoU loss [19]. They reported 97.3% accuracy, 93.8% recall, and approximately 0.18 s per image, outperforming several YOLO-series baselines for SVC symptom localisation.

In segmentation, Zhang et al. proposed an iterative instance segmentation mechanism combining YOLOv5 and YOLOv8, which reported strong performance on the DeepFish dataset [20]. Long et al. enhanced YOLOv5s with Coordinate Attention and CBAM, plus positive-sample matching and EMA, and reported 95.88% mAP and improvements over YOLOv3, YOLOv4, YOLOv5s, YOLOv5m and SSD on their dataset [21]. Ben Tamou, Benzinou, and Nasreddine explored ResNeXt-101 transfer learning with targeted augmentation and a hierarchical family-to-species classifier, reporting 99.86% accuracy on FRGT and 81.53% on LifeClef-2015 [22]. Li et al. improved DeepLabv3+ with adaptive-threshold pre-processing and reported 92.6% mIoU on Fish4Knowledge [23]. Kong et al. proposed AASNet with Linear Correlation Attention and Dynamic Adaptive Focal Loss, reporting an mAP of 47.4 on USIS and 28.9 ms inference time [24]. Akram et al. proposed an autonomous net-pen defect detector using a multi-scale semantic-segmentation topology that fuses attention across decomposition levels and reported gains of 6.58%, 3.69%, 6.44%, and 4.78% mAP on LABUST, KU, NDv1, and NDv2, respectively [25]. Paul et al. introduced FishSegSSL, a semi-supervised fisheye image segmentation method with pseudo-label filtering and dynamic thresholding, reporting a 10.49% improvement over supervised baselines [26]. To address both tasks simultaneously, Li et al. proposed YOLO-FD, integrating a segmentation branch into a YOLO detector for fish disease analysis [27].

Although these studies have advanced aquaculture image analysis, most remain limited to a single task. The multi-task model YOLO-FD detects only a single disease and cannot accommodate additional categories. Moreover, its segmentation performance can be further improved. These limitations motivate the development of a unified multi-task framework that couples detection with pixel-level lesion segmentation, enabling coverage of more fish disease categories and improving both detection and segmentation performance.

3. Materials and Methods

3.1. Dataset Acquisition

The image dataset used in this study was obtained from laboratory-reared fish and field aquaculture environments, covering three common freshwater fish diseases: bacterial hemorrhagic septicemia, saprolegniasis, and fish lice. As shown in Figure 1, bacterial hemorrhagic septicemia typically presents with congestion or haemorrhaging in areas such as the mouth, gills, jaws, eyes, and fins. Saprolegniasis is characterised by white, cotton-like mycelial growths on the body surface. Fish lice are ectoparasites that attach to the fish’s skin, exhibiting a semi-transparent or light brown appearance, with some species presenting as black spots or striped patterns.

This study collected images of healthy and diseased fish above and below the water surface using a smartphone with high-resolution imaging capabilities. Data augmentation techniques were applied to expand the dataset, followed by rigorous screening and cleaning to retain only high-quality images for experimentation to enhance the model’s generalisation ability. The final dataset comprised 2148 images, including 528 images of bacterial hemorrhagic septicemia, 512 saprolegniasis, 508 fish lice, and 600 healthy fish. We split the dataset into training, validation, and test sets in a 7:2:1 ratio. For detection, bounding boxes were annotated with LabelImg. For segmentation, masks were annotated with LabelMe. We applied the same label taxonomy in both tasks: bacterial hemorrhagic septicemia as “red,” saprolegniasis as “sap,” and fish lice as “lice.” Table 1 displays the image counts by class and data split (train/validation/test) for bacterial hemorrhagic septicemia, saprolegniasis, and fish lice. It also lists the overall totals of detection bounding boxes and segmentation masks.

3.2. Overall Structure of FDMNet

We adopt YOLOv8n as the base model because of its established architecture, modular design, and mature deployment ecosystem. However, standard YOLOv8 follows a single-task paradigm. We therefore extend YOLOv8n with a semantic-segmentation branch and a multi-task training scheme, forming FDMNet, which jointly performs disease detection and lesion segmentation for three fish diseases. As shown in Figure 2, the entire network structure of FDMNet consists of four main components: a backbone, a neck, a detection head, and a segmentation branch. The backbone extracts multi-scale feature representations from input images. In contrast, the neck uses a feature pyramid to enhance and fuse cross-scale features. The detection head localises and classifies fish disease targets. The segmentation branch uses multi-scale feature fusion to improve granularity and delineate lesion regions at the pixel level. In this architecture, the backbone and neck jointly function as shared encoders, whereas the detection head and segmentation branch serve as task-specific decoders. Object detection operates at the image level to identify diseased fish, whereas semantic segmentation functions at the pixel level to precisely outline lesion areas. By unifying these complementary tasks within a multi-task learning framework, FDMNet enables more accurate and comprehensive diagnosis of fish diseases.

3.2.1. C2DF Module

The three fish diseases—bacterial hemorrhagic septicemia, saprolegniasis, and fish lice—often show small lesion areas, indistinct boundaries, and substantial inter-class variability. These characteristics challenge both detection and segmentation. During multi-scale feature fusion, the original YOLOv8 may attenuate fine local cues, particularly near subtle boundaries and complex textures. This attenuation produces imprecise localisation and blurred masks. To address these issues, we integrate the Dynamic Feature Fusion (DFF) mechanism from MSCB-Unet [28] into the YOLOv8 C2f block, yielding a modified unit termed C2DF. As shown in Figure 2, from layer 12 onward throughout the neck, detection head, and segmentation branch, C2f is replaced by C2DF. This fusion method enables the model to dynamically enhance local details and effectively integrate global semantics while preserving the lightweight main structure, improving detection accuracy and segmentation performance.

As shown in Figure 3, the proposed C2DF module comprises the following components: input channel expansion convolution, Split operation, bottleneck_DFF modules, a Concat fusion layer, and a final output convolution.

We denote the input feature

X_{c f} [b, m, i, j] \in R^{B \times C_{i n} \times H \times W}

, where

b \in \{1, 2, \dots \dots, B\}

indexes the batch,

m \in \{1, 2, \dots \dots, C_{i n}\}

indexes the input channels, and

i \in \{1, 2, \dots \dots, H\}, j \in \{1, 2, \dots \dots, W\}

index the spatial height and width. Here, B is the batch size,

C_{i n}

is the number of input channels, and H and W are the spatial dimensions of the input feature map. The model outputs a feature tensor

Y_{c f} \in R^{B \times C_{o u t} \times H \times W}

, where

C_{o u t}

represents the number of output channels, and

C_{i n} = C_{o u t}

.

As shown in Equation (1), the input feature

X_{c f} [b, m, i, j]

is first processed by a

1 \times 1

convolutional layer (Conv1), which performs a pointwise channel expansion. This operation yields an intermediate feature map

F_{0} [b, k_{1}, i, j] \in R^{B \times 2 C^{'} \times H \times W}

, where

C^{'} = C_{o u t} \cdot e, 0 < e < 1

, and

e

denotes the channel compression ratio. Here,

C^{'}

is the reduced number of channels after compression. Let

W_{c 1} [k_{1}, m]

denote the weight matrix of Conv1, where

k_{1} \in [1, 2, \dots \dots 2 C^{'}]

indexes the output channels. The element

X_{c f} [b, m, i, j]

is the value at sample b, channel m, and spatial position (i, j).

F_{0} [b, k_{1}, i, j] = Conv 1 (X_{c f} [b, m, i, j]) = \sum_{m = 1}^{C_{i n}} W_{c 1} [k_{1}, m] X_{c f} [b, m, i, j]

(1)

The feature map

F_{0} [b, k_{1}, i, j]

is evenly split along the channel dimension using a Split operation, producing two sub-feature maps. One part is used as the initial residual feature, denoted as

F_{0}^{'} [b, k_{2}, i, j] \in R^{B \times C^{'} \times H \times W}

. The other part denoted as

F_{i n} [b, k_{2}, i, j] \in R^{B \times C^{'} \times H \times W}

, serves as the input to the subsequent Bottleneck_DFF module.

The Bottleneck_DFF module performs deep feature extraction and adaptive fusion. It applies two sequential 3 × 3 convolutional layers (Conv3), followed by the Dynamic Feature Fusion (DFF) operation. The input

F_{i n} [b, k_{2}, i, j]

is first passed through a Conv3 layer to produce

F_{i n}^{'} \in R^{B \times C^{'} \times H \times W}

, which is then processed by a second Conv3 layer to yield

F_{i n}^{″} [b, k_{2}, i, j] \in R^{B \times C^{'} \times H \times W}

. Both

F_{i n} [b, k_{2}, i, j]

and

F_{i n}^{″} [b, k_{2}, i, j]

are subsequently fed into the DFF module for feature fusion, as shown in Equation (2). We denote the output channel index of each convolution operation as

k_{2} \in \{1, 2, \dots \dots, C^{'}\}

:

F_{i n}^{'} [b, k_{2}, i, j] = Conv 3 (F_{i n} [b, k_{2}, i, j])

F_{i n}^{'} [b, k_{2}, i, j] = \sum_{m = 1}^{C^{'}} \sum_{u = - 1}^{1} \sum_{v = - 1}^{1} W_{c 2} [k_{2}, m, u + 1, v + 1] F_{i n} [b, m, i + u, j + v]

(2a)

F_{i n}^{″} [b, k_{2}, i, j] = Conv 3 (F_{i n}^{'} [b, k_{2}, i, j])

F_{i n}^{″} [b, k_{2}, i, j] = \sum_{m = 1}^{C^{'}} \sum_{u = - 1}^{1} \sum_{v = - 1}^{1} W_{c 3} [k_{2}, m, u + 1, v + 1] F_{i n}^{'} [b, m, i + u, j + v]

(2b)

Here,

W_{c 2} [k_{2}, m, u + 1, v + 1]

and

W_{c 3} [k_{2}, m, u + 1, v + 1]

denote the learnable weights of the first and second 3 × 3 convolutional layers in the Bottleneck_DFF module. Where

u, v \in [- 1, 0, 1]

represent the relative spatial offsets of the kernel along the height and width dimensions.

The DFF module integrates both channel attention and spatial attention mechanisms, enabling dynamic recalibration of feature importance based on image content. This attention increases sensitivity to small lesion regions that are difficult to detect. Compared with static fusion, DFF offers greater adaptability and selectivity and reduces performance drops when appearance varies across disease types. In our experiments, it yields higher detection accuracy and sharper segmentation boundaries.

Equation (3) shows the processing flow of the channel attention within the DFF. Specifically, we concatenate the input features

F_{i n} [b, k_{2}, i, j]

and

F_{i n}^{″} [b, k_{2}, i, j]

via the Concat to produce the fused featured map

F^{l} \in R^{B \times 2 C^{'} \times H \times W}

. The model passes this feature map through a global average pooling layer (AVGPool), a Conv1, and a Sigmoid to generate the channel attention weights

W_{c h}

.

F^{l} [b, k_{2}, i, j] = Concat (F_{i n}^{″} [b, k_{2}, i, j], F_{i n} [b, k_{2}, i, j])

(3)

As shown in Equation (4), the fused feature map

F^{l} [b, k_{2}, i, j]

is passed through an AVGPool to obtain the channel descriptor

S_{c} [b, k] \in R^{B \times 2 C^{'}}

, which captures the mean activation over all spatial positions (i, j) for channel

k_{2}

.

S_{c} [b, k_{2}] = \frac{1}{H W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F^{l} [b, k_{2}, i, j]

(4)

Subsequently, the descriptor

S_{c} [b, k_{2}]

is passed through a Conv1 with a weight matrix

W_{4} [k_{2}, m]

, followed by a Sigmoid activation function to generate the final channel attention weights

W_{c h} [b, k_{2}]

.

W_{4} [k_{2}, m] \in R^{C^{'} \times 2 C^{'}}

denotes the learnable kernel of the 1 × 1 convolution that maps the concatenated channel descriptor to the desired attention dimension.

W_{c h} [b, k_{2}] = Sigmoid (Conv 1 (S_{c} [b, k_{2}])) = Sigmoid (\sum_{m = 1}^{2 c^{'}} W_{4} [k_{2}, m] S_{c} [b, m])

W_{c h} [b, k_{2}] = \frac{1}{1 + \exp (- \sum_{m = 1}^{2 C^{'}} W_{4} [k_{2}, m] S_{c} [b, m])}

(5)

Applying the channel attention weights

W_{c h} [b, k_{2}]

to the fused feature map

F^{l} [b, k_{2}, i, j]

performs a channel-wise weighting. Next, a Conv1 with a weight matrix

W_{5} [k_{2}, m] \in R^{C^{'} \times 2 C^{'}}

generates the channel attention-enhanced output

F_{c h}^{l} [b, k_{2}, i, j]

. The weight matrix

W_{5} [k_{2}, m]

projects the weighted features into the intermediate space. Here,

⊙

denotes element-wise multiplication. As defined in Equation (6) is

F_{c h}^{l} [b, k_{2}, i, j] = Conv 1 (W_{c h} ⊙ F^{l}) = \sum_{m = 1}^{2 c^{'}} W_{5} [k_{2}, m] W_{c h} [b, m] F^{l} [b, m, i, j]

(6)

We formulate the spatial attention mechanism in the DFF module as shown in Equation (7)—the Conv1 layers, with learnable weight matrices

W_{6} [k_{2}, m]

and

W_{7} [k_{2}, m]

, individually process both

F_{i n} [b, k_{2}, i, j]

and

F_{i n}^{″} [b, k_{2}, i, j]

. The resulting feature maps are summed element-wise and passed through a Sigmoid activation function to generate the spatial attention map

W_{s p} [b, 0, i, j]

. Here,

W_{6} [k_{2}, m] \in R^{1 \times C^{'}}

and

W_{7} [k_{2}, m] \in R^{1 \times C^{'}}

control the contribution of each spatial channel in the respective features to the final spatial weighting:

W_{s p} [b, 0, i, j] = Sigmoid (Conv 1 (F_{i n} [b, k_{2}, i, j]) + Conv 1 (F_{i n}^{″} [b, k_{2}, i, j]))

W_{s p} [b, 0, i, j] = Sigmoid (\sum_{m = 1}^{c^{'}} W_{6} [k_{2}, m] F_{i n} [b, m, i, j] + \sum_{m = 1}^{c^{'}} W_{7} [k_{2}, m] F_{i n}^{″} [b, m, i, j])

W_{s p} [b, 0, i, j] = \frac{1}{1 + \exp (- \sum_{m = 1}^{c^{'}} W_{6} [k_{2}, m] F_{i n} [b, m, i, j] + \sum_{m = 1}^{c^{'}} W_{7} [k_{2}, m] F_{i n}^{″} [b, m, i, j])}

(7)

As defined in Equation (8), the spatial attention weight

W_{s p} [b, 0, i, j]

is then applied to

F_{c h}^{l} [b, k_{2}, i, j]

via element-wise multiplication to obtain the final output of the Bottleneck_DFF module, denoted as

F_{o u t}

. Here,

⊙

denotes element-wise multiplication:

F_{o u t} = W_{s p} ⊙ F_{c h}^{l} = W_{s p} [b, 0, i, j] F_{c h}^{l} [b, k_{2}, i, j]

(8)

Outputs from multiple Bottleneck_DFF modules, together with the initial residual feature

F_{0}^{'} [b, k_{2}, i, j]

, are concatenated using the Concat operation to form the aggregated feature map

F_{c a t} [b, m, i, j] \in R^{B \times (n + 1) C^{'} \times H \times W}

, where n is the number of Bottleneck_DFF modules, as formulated in Equation (9):

F_{c a t} [b, m, i, j] = Conact (F_{0}^{'}, F_{1}, F_{2}, \dots \dots, F_{n})

(9)

Finally, a convolution layer with a kernel

W_{8} [k_{3}, m]

is applied to produce the final output

Y_{c f} \in R^{B \times C_{o u t} \times H \times W}

, formulated in Equation (10). The kernel

W_{8} [k_{2}, m]

, where

k_{3} \in [1, 2, \dots \dots, C_{o u t}]

, projects the concatenated features to the desired output channel dimension

C_{o u t}

:

Y_{c f} [b, k_{3}, i, j] = \sum_{m = 1}^{(n + 1) c^{'}} W_{8} [k_{3}, m] F_{c a t} [b, m, i, j]

(10)

3.2.2. Multi-Scale Semantic Segmentation Branch

Object detection localises lesions at the object level but does not describe lesion shape or boundaries in detail. To complement detection, we add a semantic segmentation branch that provides pixel-level segmentation of lesion extent. The branch builds on YOLOv8 core modules and is inspired by multi-scale perceptual parsing in Unified Perceptual Parsing Network (UPerNet) [29], with additional design cues from the segmentation branch of YOLO-FD.

We extract multi-layer features with varying semantic depths from the Backbone and Neck to serve as input for the segmentation branch. After spatial alignment and channel compression, we fuse the features to balance global semantic understanding and local edge information recovery. This fusion process enhances the model’s ability to model complex lesion regions. As shown in Figure 4, the segmentation branch selects feature maps from layers 9, 12, and 15 as input, representing high semantic, balanced, and high spatial resolution features, respectively. Specifically, the ninth layer provides the strongest semantic representations, making it ideal for modelling the global semantic structure of lesion regions. The 12th layer captures a balanced combination of semantic and spatial information, bridging global understanding and fine-grained detail. In contrast, the 15th layer retains the richest edge and texture details, crucial for accurately extracting lesion contours.

To ensure effective alignment and fusion of multi-scale feature maps, as shown in Figure 4, the 20 × 20 × 1024 feature map from the ninth layer is first upsampled by a factor of four and passed through a convolutional layer to reduce the channel dimension to 256, resulting in an 80 × 80 × 256 feature map. The 40 × 40 × 512 feature map from the 12th layer is upsampled by a factor of two and compressed to 256 channels, which results in a second feature map of the exact resolution. The feature map from the 15th layer is inherently 80 × 80 × 256. As a result, the three feature maps are spatially and dimensionally aligned, enabling efficient and seamless fusion via the Concat operation.

The decoder adopts a hierarchical upsampling strategy. Each stage comprises an upsampling layer, a Conv for refinement, and a C2DF block to enhance feature interactions. This progression restores resolution from 80 × 80 to 160 × 160, 320 × 320, and 640 × 640. A final Conv produces a four-channel output corresponding to three disease classes and background, yielding masks. On our dataset, the segmentation branch contributes to clearer boundary depiction and stable multi-scale responses, and it works jointly with the detection head within the multi-task framework.

3.3. Multi-Task Optimisation Strategy

In various fish disease detection and segmentation tasks, detection tasks focus on target localization and classification, while segmentation tasks focus on pixel-level region division. These two types of tasks differ regarding task objectives, feature focus, and gradient propagation direction. These differences can lead to loss imbalance, gradient conflicts, and task competition during multi-task joint training, weakening the overall model performance. This paper introduces weight uncertainty and PCGrad into FDMNet to mitigate these issues. Weight uncertainty can balance the convergence speed of multi-task learning in the loss function, while PCGrad can reduce optimisation interference between tasks in the gradient direction. The synergistic effect of the two improves the stability and performance of FDMNet.

3.3.1. Weight Uncertainty

Following Kendall et al. [30], we use uncertainty-based multi-task weighting to balance detection and segmentation losses. The method maximises the joint likelihood across tasks, yielding an adaptive loss with one learnable uncertainty parameter per task. During training, these scalars are learned jointly with the network parameters, so the effective loss weights adapt to task difficulty rather than being fixed a priori.

Let the input image be

x

and the model parameters be

w

. The detection head prediction is denoted by

f^{w} (x)

. The detection target is

y_{1}

. The scalar

σ_{1}

denotes the task uncertainty for detection. As shown in Equation (11), we model detection regression with a Gaussian whose mean is the model prediction and whose variance is

{σ_{1}}^{2}

:

p (y_{1} | x, w, σ_{1}) = N (y_{1}; f^{w} (x), {σ_{1}}^{2}) .

(11)

For semantic segmentation, the target is

y_{2}

. The segmentation score map is denoted by

g^{w} (x)

. The scalar

σ_{2}

denotes the task uncertainty for segmentation. In the uncertainty model, we scale the input to the Softmax by

\frac{1}{σ_{2}^{2}}

. The per-class likelihood is written as in Equation (12). In practice, this is applied per pixel:

p (y_{2} | x, w, σ_{2}) = Softmax (\frac{1}{σ_{2}^{2}} g^{w} (x)) .

(12)

Assuming conditional independence across tasks given

(x, w)

, the joint likelihood factorises as in Equation (13):

p (y_{1}, \dots, y_{k} | x, w) = p (y_{1} | x, w) \dots p (y_{k} | x, w) .

(13)

Combining the Gaussian likelihood in Equation (11) with the Softmax likelihood in Equation (12) for the two-task case, Equation (14) gives:

p (y_{1}, y_{2} | x, w, σ_{1}, σ_{2}) = p (y_{1} | x, w, σ_{1}) \cdot p (y_{2} | x, w, σ_{2})

= N (y_{1}; f^{w} (x), {σ_{1}}^{2}) \cdot Softmax (\frac{1}{σ_{2}^{2}} g^{w} (x))

(14)

By maximum likelihood, we minimise the negative log-likelihood in Equation (15):

L (W, σ_{1}, σ_{2}) = - \log p (y_{1}, y_{2} | x, w, σ_{1} . σ_{2}) = - \log p (y_{1} | x, w, σ_{1}) - \log p (y_{2} | x, w, σ_{2})

(15)

Expand the joint loss function

L (W, σ_{1}, σ_{2})

to obtain the final multi-task total loss function. As shown in Equation (16),

L_{1} (W)

is the detection task loss, and

L_{2} (W)

is the segmentation task loss:

(W, σ_{1}, σ_{2}) = \frac{1}{2 σ_{1}^{2}} {‖y_{1} - f^{w} (x)‖}^{2} + \log σ_{1} - \frac{1}{σ_{2}^{2}} \log Softmax (y_{2}, f^{w} (x)) + \log σ_{2}

\approx \frac{1}{2 σ_{1}^{2}} L_{1} (W) + \frac{1}{σ^{2}} L_{2} (W) + \log σ_{1} + \log σ_{2}

(16)

The total loss function consists of the detection task loss

L_{d e t}

and the segmentation task loss

L_{s e g}

. The detection loss is a weighted combination of

L_{c l s}

(classification loss),

L_{i o u}

(bounding box regression loss), and

L_{d f l}

(distributed regression loss). Equation (17) shows the detection loss, where α₁ = 0.5, α₂ = 7.5, and α₃ = 1.5, all of which are conventional default values.

L_{d e t} = α_{1} L_{c l s} + α_{1} L_{i o u} + α_{3} L_{d f l}

(17)

The multi-task total loss function is defined in Equation (18):

L_{a l l} = \frac{1}{2 σ_{1}^{2}} L_{d e t} + \frac{1}{σ_{2}^{2}} L_{s e g} + \log σ_{1} + \log σ_{2}

(18)

3.3.2. PCGrad

In multi-task learning, detection and segmentation tasks are combined with different objectives. It is common for gradients generated during backpropagation across different tasks to cancel each other out, leading to issues such as hindered model optimisation and slow task convergence. To address this issue, this paper introduces the PCGrad algorithm proposed by Yu et al. [31]. At each backpropagation step, the algorithm computes per-task gradients and checks for conflicts. When a task’s gradient is misaligned with others, PCGrad adjusts that gradient via orthogonal projection to avoid interfering with the optimisation direction of the other tasks. As shown in Equation (19), let

g_{i}

denote the gradient for the object-detection loss and

g_{j}

the gradient for the semantic-segmentation loss. Their cosine similarity is:

\cos θ_{i j} = \frac{g_{i} \cdot g_{j}}{| |g_{i}| |^{2}}

(19)

When

\cos θ_{i j} < 0

, the two task gradients are in conflict. PCGrad projects

g_{i}

to remove its component along

g_{j}

:

g_{i}^{'} = g_{i} - \frac{g_{i} \cdot g_{j}}{\begin{matrix} | |g_{i}| |^{2} \end{matrix}} g_{j}

(20)

If

\cos θ_{i j} \geq 0

indicates no conflict between gradients, and PCGrad will keep the original gradient unchanged. Unlike the uncertainty weighting mechanism, which adjusts the task contributions from the perspective of task loss scale, PCGrad focuses on gradient direction optimisation.

4. Results

4.1. Experimental Details

All models were trained for 200 epochs on 2148 images, with 640 × 640 inputs and batch size 8, using a single RTX 4070 GPU (Intel, Santa Clara, CA, USA). Training used automatic mixed precision (AMP) and a fixed random seed (11) with deterministic flags for reproducibility. The initial uncertainty parameters in the loss,

σ_{1}

and

σ_{2}

, were set to 10.689 and 0.17. The system configuration is summarised in Table 2.

Optimisation used AdamW with an initial learning rate of 4.7 × 10⁻⁴, cosine decay to 1% of the initial value over 200 epochs, with a 3-epoch warm-up. The exponential decay rates were β₁ = 0.937 and β₂ = 0.999, and the weight decay was 5 × 10⁻⁴. This configuration yields stable convergence and good performance on a held-out in-distribution test set.

To mitigate overfitting, we applied data augmentation to reduce the model’s memorization of sample-specific details. The data augmentation comprised HSV jitter (h = 0.015, s = 0.7, v = 0.4), random translation (±0.1 of the image size), random scaling (±0.5), and horizontal flip (p = 0.5). Mosaic augmentation was enabled for most training and disabled during the final 12 epochs to stabilise optimisation. Additionally, we used a MixUp ratio of 0.1. We applied weight decay (5 × 10⁻⁴) to penalise large weights and constrain effective model capacity. Finally, we used a held-out validation set with early stopping (patience = 30) based on validation mAP. We maintained an exponential moving average (EMA) of the parameters (decay = 0.9999) to smooth updates and improve generalisation.

4.2. Experimental Results

4.2.1. Evaluation Metrics

We evaluate experimental results using established metrics: Precision, Recall, mAP50, and mIoU. This choice follows common practice and enables direct comparison with widely used models. Precision measures the proportion of predicted positives that are correct. It reflects sensitivity to false positives. Recall measures the proportion of ground-truth objects that are detected. It reflects the risk of missed detections. Using Precision and Recall together makes the trade-off between false positives and false negatives explicit. The metric mAP50 is the mean Average Precision computed at an IoU threshold of 0.5. It averages this value across classes, summarising detection accuracy and localization at a fixed overlap threshold.

As shown in Equations (21), the calculation formulas for Precision and Recall are as follows. A prediction is a true positive (TP) if its intersection over union with the best matching ground truth is at least 0.5. Unmatched predictions are false positives (FP), and unmatched ground truths are false negatives (FN):

Precision = \frac{TP}{TP + FP}

Recall = \frac{TP}{TP + FN}

(21)

For each class

k

, Average Precision

{AP}_{k}

is the area under the precision–recall curve at IoU = 0.5. With

K

classes,

mAP 50 = \frac{1}{K} \sum_{k = 1}^{K} {AP}_{k}

(22)

For segmentation, we report mean Intersection over Union (mIoU). For class

k

,

{IoU}_{k} = \frac{{TP}_{k}}{{TP}_{k} + {FP}_{k} + {FN}_{k}}

(23)

where

T P_{k}

,

F P_{k}

and

F N_{k}

are pixel counts for class

k

. The mean IoU is

mIoU = \frac{1}{K} \sum_{k = 1}^{K} {IoU}_{k}

(24)

In this study,

K = 3

disease classes (excluding background). All metrics were computed on the held-out test set.

4.2.2. Object Detection Results

To assess FDMNet’s effectiveness in object detection, we compared it with representative detectors: Faster R-CNN, YOLOv8n and YOLOv11n, Transformer-based RT-DETR, and the multi-task models YOLO-FD and Mask R-CNN. Evaluation metrics include precision, recall, mAP50, parameter count, and per-image speed. The results are summarised in Table 3.

On our dataset, FDMNet achieves the highest overall detection performance among the compared methods, with a precision of 95.3%, a recall of 92.1%, and an mAP50 of 97.0%. Relative to the YOLOv8n baseline, FDMNet improves precision by 3.9%, recall by 1.4%, and mAP50 by 3.4%. FDMNet adds just 0.1 M parameters over YOLOv8n and supports semantic segmentation. RT-DETR attains the highest precision (96.1%) but exhibits substantially lower recall, with a larger parameter budget and longer latency, which limits its suitability for real-time and resource-constrained deployments. Compared with Mask R-CNN and YOLO-FD, FDMNet achieves higher precision, recall, and mAP50. While its parameter count is slightly larger than YOLO-FD, improving key accuracy metrics may justify this trade-off for accuracy-sensitive scenarios. Overall, FDMNet provides a favourable accuracy–efficiency balance for fish disease detection and indicates potential for practical deployment within the evaluated conditions.

All models were trained under the same input resolution and evaluated on our dataset. The best checkpoint for each model was selected on the validation performance and then evaluated on the test set. Figure 5 presents qualitative comparisons for bacterial hemorrhagic septicemia, saprolegniasis, and fish lice using Faster R-CNN, Mask R-CNN, and FDMNet. For saprolegniasis, the single-task detector Faster R-CNN often yields incomplete detections, particularly on the tail and fins. The multi-task Mask R-CNN improves coverage and detects tail lesions, yet misses portions of the affected regions. In these examples, FDMNet provides more complete detections across the tail and fin areas.

4.2.3. Semantic Segmentation Results

Classic semantic segmentation networks, including U-Net, DeepLabv3-ResNet50, and DeepLabv3+-ResNet50, were chosen to compare with the proposed model. We also evaluated lightweight DeepLabv3 and DeepLabv3+ variants with MobileNet backbones to keep parameter counts comparable to FDMNet. In addition, we included YOLO-FD and Mask R-CNN as multi-task comparison models. We report mIoU, parameter count, FLOPs, and speed in Table 4. On our dataset, FDMNet attains the highest mIoU. Relative to U-Net, DeepLabv3-ResNet50, DeepLabv3+-ResNet50, Mask R-CNN, and YOLO-FD, the mIoU gains are 20.9%, 12.3%, 9.0%, 16.2%, and 5.4%. While FDMNet’s parameter count and latency are slightly higher than YOLO-FD, the overall accuracy–efficiency trade-off remains favourable for practical deployment. These quantitative results support the effectiveness of joint detection–segmentation training for this application while maintaining a compact model design.

Figure 6 provides qualitative comparisons among U-Net, Mask R-CNN, and FDMNet. Across the three diseases—bacterial hemorrhagic septicemia, saprolegniasis, and fish lice—FDMNet produces more complete masks and more precise lesion segmentation in these examples. U-Net and Mask R-CNN show task-specific limitations. For bacterial hemorrhagic septicemia, both tend to under-segment small lesion regions. For saprolegniasis, they frequently miss portions of tail- and fin-area lesions. For fish lice, they exhibit noticeable boundary errors and fragmented masks. By contrast, FDMNet more consistently localises lesions and yields accurate, contiguous segmentations across the three conditions on the same images.

4.3. Ablation Experiments

To investigate the coupling effects between detection and segmentation tasks in multi-task learning, we compare single-task training (detection-only or segmentation-only) with joint multi-task training. As shown in Table 5, the results demonstrate consistent performance improvements across all metrics when using multi-task learning, regardless of whether the C2f or C2DF module was employed. Specifically, multi-task training yielded higher detection precision, mAP50, and mIoU values than single-task approaches. These findings suggest a synergistic relationship between object detection and semantic segmentation, where each task enhances the other’s performance. The segmentation task contributes by refining the model’s understanding of spatial structures and semantic features through pixel-level boundary analysis. In contrast, the detection task provides precise object localization, improving segmentation boundary accuracy.

Replacing C2f with C2DF during multi-task training yields additional gains. Relative to the C2f variant, precision increases from 0.933 to 0.953, recall from 0.898 to 0.921, mAP50 from 0.945 to 0.970, and mIoU from 0.823 to 0.857. These increases suggest that C2DF provides improved representational capacity for joint detection–segmentation under our settings. In our setup, Detection-only models train only the detection head and do not report mIoU. Segmentation-only models train only the segmentation branch and do not report detection metrics. Non-applicable metrics are marked with “—”.

To compare the attention patterns of C2DF with those of the original C2f for lesion-relevant features, we select layer 15 in the detection head as the site for heat map analysis. This layer aggregates fine-grained information from lower levels. It is informative for small targets and edge-sensitive cues, common in fish disease lesions characterised by small size, complex morphology, and blurred boundaries.

We apply Grad-CAM to the output of Layer 15 to display how C2DF and C2f attend to lesion regions. As shown in Figure 7, the C2f-based maps sometimes provide insufficient coverage of diseased areas for bacterial hemorrhagic septicemia and exhibit spurious activations for fish lice. In these examples, FDMNet with C2DF produces more contiguous activations that align more closely with the annotated lesions across all three diseases. These observations are consistent with the improved feature separation and fine-grained sensitivity afforded by C2DF and complement the quantitative gains reported in Table 5.

5. Discussion

5.1. Model Evaluation

FDMNet achieves consistent detection and segmentation performance, providing an integrated solution for multi-task fish disease analysis. On our dataset, FDMNet achieved 97.0% mAP50 for detection and 85.7% mIoU for segmentation. Relative to the multi-task baseline YOLO-FD, FDMNet improved mAP50 by 2.5% and mIoU by 5.4% under the same evaluation protocol, while maintaining a compact parameter count and low single-image latency on our setup. These gains are attributable to two design choices: a multi-scale segmentation branch that aggregates features across semantic levels for small, diffuse, or boundary-sensitive lesions; and the C2DF module, which combines channel and spatial attention to enhance feature separation and fine-grained perception. In addition, coupling uncertainty-based loss weighting with PCGrad mitigates inter-task gradient conflicts and stabilises joint training. Although FDMNet incurs a modest increase in latency and parameters compared with the lightest baselines, the resulting accuracy–efficiency trade-off aligns with practical deployment constraints.

5.2. Model Limitations and Future Work

Although FDMNet achieves favourable results on our dataset, it has more parameters and higher inference latency than the lightest baselines. This overhead is mainly due to the added segmentation branch. It may make deployment more difficult on edge devices with limited resources. Early in training, fusing low-level features can misalign multi-scale representations. This misalignment produces noisy or inconsistent gradients and destabilises training, especially near ambiguous lesion boundaries. The evaluation scope is limited to a random split of a single dataset, so the reported results reflect within-distribution performance only. Robustness under distribution shift has not yet been assessed. We did not conduct out-of-distribution validation across different farms, imaging modalities, or acquisition settings. In addition, the current dataset does not include examples of fish with more than one disease. This absence may restrict performance in real-world scenarios where multi-disease cases occur.

Future work will expand disease coverage beyond the three categories to include additional common fish diseases. Data collection will also be broadened across species, farms, devices, and acquisition conditions. We will systematically collect cases where a single fish has multiple diseases, with multi-label annotations at the image, object, and pixel levels. We will evaluate cross-domain generalisation and report both in-distribution and out-of-distribution performance. On the architectural side, we will explore more efficient cross-scale alignment and fusion, employ lightweight attention, pruning, quantisation, and knowledge distillation, and assess on-device latency and throughput to target a favourable accuracy–efficiency trade-off. For learning stability, we will investigate curriculum or warm-up fusion, uncertainty-aware loss reweighting, and strategies to mitigate gradient conflicts. Beyond vision, we aim to integrate sonar, ultrasound, infrared thermography, and environmental pressure sensing, examining early and late fusion, robustness to missing modalities, and sensor synchronisation. We will also pursue model calibration and uncertainty estimation, human-in-the-loop triage for low-confidence cases, periodic retraining, and on-site validation to improve reliability and facilitate deployment in resource-constrained aquaculture systems.

6. Conclusions

This study presents FDMNet, a lightweight multi-task framework built on YOLOv8n. The model adds a semantic segmentation branch and the C2DF dynamic feature-fusion modules to jointly perform detection and lesion segmentation for three common fish diseases. The training strategy combines uncertainty-based loss weighting with PCGrad to reduce conflicts between tasks and improve training stability. Together, these components provide an integrated pipeline for image-level detection and pixel-level segmentation while retaining a compact parameter budget.

On the held-out test split of our in-house dataset, FDMNet achieved 97.0% mAP50 for detection and 85.7% mIoU for segmentation, with improvements over representative single-task and multi-task baselines. Compared with YOLO-FD, improvements were 2.5% in mAP50 and 5.4% in mIoU. Ablation analyses indicate synergistic benefits of multi-task training and additional improvements from C2DF. Overall, the proposed architecture offers a favourable accuracy–efficiency trade-off under the evaluated conditions.

Author Contributions

Conceptualization, Z.L., Z.Y. and G.L.; data collection and analysis, Z.Y. and G.L.; methodology, Z.L. and Z.Y. original draft preparation, Z.Y.; review and editing, Z.L., Z.Y. and G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and IACUC regulations, and approved by the Faculty Research Committee of Measuring and Control Technology and Instrumentation (protocol code No. 20250317 and 17 March 2025).

Informed Consent Statement

This article does not contain any studies with human participants performed by any of the authors.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

AI was merely used as a translation aid.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jiang, Q.J.; Wu, M.M.; Zhang, D.Y. Evidence of the Contribution of the Technological Progress on Aquaculture Production for Economic Development in China-Research Based on the Transcendental Logarithmic Production Function Method. Agriculture 2023, 13, 544. [Google Scholar] [CrossRef]
Verdegem, M.; Buschmann, A.H.; Latt, U.W.; Dalsgaard, A.J.T.; Lovatelli, A. The contribution of aquaculture systems to global aquaculture production. J. World Aquac. Soc. 2023, 54, 206–250. [Google Scholar] [CrossRef]
Garlock, T.M.; Asche, F.; Anderson, J.L.; Eggert, H.; Anderson, T.M.; Che, B.; Chavez, C.A.; Chu, J.J.; Chukwuone, N.; Dey, M.M.; et al. Environmental, economic, and social sustainability in aquaculture: The aquaculture performance indicators. Nat. Commun. 2024, 15, 5274. [Google Scholar] [CrossRef] [PubMed]
Li, D.L.; Li, X.; Wang, Q.; Hao, Y.F. Advanced Techniques for the Intelligent Diagnosis of Fish Diseases: A Review. Animals 2022, 12, 2938. [Google Scholar] [CrossRef]
Ina-Salwany, M.Y.; Al-saari, N.; Mohamad, A.; Mursidi, F.A.; Mohd-Aris, A.; Amal, M.N.A.; Kasai, H.; Mino, S.; Sawabe, T.; Zamri-Saad, M. Vibriosis in Fish: A Review on Disease Development and Prevention. J. Aquat. Anim. Health 2019, 31, 3–22. [Google Scholar] [CrossRef]
Zhao, S.L.; Zhang, S.; Liu, J.C.; Wang, H.; Zhu, J.; Li, D.L.; Zhao, R. Application of machine learning in intelligent fish aquaculture: A review. Aquaculture 2021, 540, 736724. [Google Scholar] [CrossRef]
Ziarati, M.; Zorriehzahra, M.J.; Hassantabar, F.; Mehrabi, Z.; Dhawan, M.; Sharun, K.; Bin Emran, T.; Dhama, K.; Chaicumpa, W.; Shamsi, S. Zoonotic diseases of fish and their prevention and control. Vet. Q. 2022, 42, 95–118. [Google Scholar] [CrossRef]
Li, K.; Jiang, R.T.; Qiu, J.Q.; Liu, J.L.; Shao, L.; Zhang, J.H.; Liu, Q.G.; Jiang, Z.J.; Wang, H.; He, W.H.; et al. How to control pollution from tailwater in large scale aquaculture in China: A review. Aquaculture 2024, 590, 741085. [Google Scholar] [CrossRef]
Anjur, N.; Sabran, S.F.; Daud, H.M.; Othman, N.Z. An update on the ornamental fish industry in Malaysia: Aeromonas hydrophila-associated disease and its treatment control. Vet. World 2021, 14, 1143–1152. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Li, Z.; Wang, T.; Xu, X.B.; Zhang, X.S.; Li, D.L. Intelligent fish farm-the future of aquaculture. Aquac. Int. 2021, 29, 2681–2711. [Google Scholar] [CrossRef] [PubMed]
Liu, H.C.; Ma, X.; Yu, Y.N.; Wang, L.; Hao, L. Application of Deep Learning-Based Object Detection Techniques in Fish Aquaculture: A Review. J. Mar. Sci. Eng. 2023, 11, 867. [Google Scholar] [CrossRef]
Ahmed, M.S.; Aurpa, T.T.; Azad, M.A. Fish Disease Detection Using Image Based Machine Learning Technique in Aquaculture. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 5170–5182. [Google Scholar] [CrossRef]
Waleed, A.; Medhat, H.; Esmail, M.; Osama, K.; Samy, R.; Ghanim, T.M. Automatic Recognition of Fish Diseases in Fish Farms. In Proceedings of the 2019 14th International Conference on Computer Engineering and Systems (ICCES), Cairo, Egypt, 17 December 2019; pp. 201–206. [Google Scholar]
Islam, S.I.; Ahammad, F.; Mohammed, H. Cutting-edge technologies for detecting and controlling fish diseases: Current status, outlook, and challenges. J. World Aquac. Soc. 2024, 55, e13051. [Google Scholar] [CrossRef]
Yu, G.; Zhang, J.; Chen, A.; Wan, R. Detection and Identification of Fish Skin Health Status Referring to Four Common Diseases Based on Improved YOLOv4 Model. Fishes 2023, 8, 186. [Google Scholar] [CrossRef]
Hamzaoui, M.; Ould-Elhassen Aoueileyine, M.; Romdhani, L.; Bouallegue, R. An Improved Deep Learning Model for Underwater Species Recognition in Aquaculture. Fishes 2023, 8, 514. [Google Scholar] [CrossRef]
Zhou, S.Y.; Cai, K.W.; Feng, Y.H.; Tang, X.M.; Pang, H.S.; He, J.Q.; Shi, X. An Accurate Detection Model of <i>Takifugu rubripes</i> Using an Improved YOLO-V7 Network. J. Mar. Sci. Eng. 2023, 11, 1051. [Google Scholar] [CrossRef]
Li, Y.J.; Hu, Z.Y.; Zhang, Y.X.; Liu, J.H.; Tu, W.; Yu, H. DDEYOLOv9: Network for Detecting and Counting Abnormal Fish Behaviors in Complex Water Environments. Fishes 2024, 9, 242. [Google Scholar] [CrossRef]
Cai, Y.Y.; Yao, Z.K.; Jiang, H.B.; Qin, W.; Xiao, J.; Huang, X.X.; Pan, J.J.; Feng, H. Rapid detection of fish with SVC symptoms based on machine vision combined with a NAM-YOLO v7 hybrid model. Aquaculture 2024, 582, 740558. [Google Scholar] [CrossRef]
Zhang, J.S.; Wang, Y. A New Workflow for Instance Segmentation of Fish with YOLO. J. Mar. Sci. Eng. 2024, 12, 1010. [Google Scholar] [CrossRef]
Long, W.; Wang, Y.W.; Hu, L.X.; Zhang, J.T.; Zhang, C.; Jiang, L.H.; Xu, L.H. Triple Attention Mechanism with YOLOv5s for Fish Detection. Fishes 2024, 9, 151. [Google Scholar] [CrossRef]
Ben Tamou, A.; Benzinou, A.; Nasreddine, K. Targeted Data Augmentation and Hierarchical Classification with Deep Learning for Fish Species Identification in Underwater Images. J. Imaging 2022, 8, 214. [Google Scholar] [CrossRef]
Li, D.S.; Yang, Y.F.; Zhao, S.W.; Yang, H.H. A fish image segmentation methodology in aquaculture environment based on multi-feature fusion model. Mar. Environ. Res. 2023, 190, 106085. [Google Scholar] [CrossRef]
Kong, J.L.; Tang, S.N.; Feng, J.M.; Mo, L.P.; Jin, X.B. AASNet: A Novel Image Instance Segmentation Framework for Fine-Grained Fish Recognition via Linear Correlation Attention and Dynamic Adaptive Focal Loss. Appl. Sci. 2025, 15, 3986. [Google Scholar] [CrossRef]
Cai, X.J.; Zhu, Y.; Liu, S.W.; Yu, Z.Y.; Xu, Y.Y. FastSegFormer: A knowledge distillation-based method for real-time semantic segmentation of surface defects in navel oranges. Comput. Electron. Agric. 2024, 217, 108604. [Google Scholar] [CrossRef]
Paul, S.; Patterson, Z.; Bouguila, N. FishSegSSL: A Semi-Supervised Semantic Segmentation Framework for Fish-Eye Images. J. Imaging 2024, 10, 71. [Google Scholar] [CrossRef] [PubMed]
Li, X.F.; Zhao, S.L.; Chen, C.L.; Cui, H.W.; Li, D.L.; Zhao, R. YOLO-FD: An accurate fish disease detection method based on multi-task learning. Expert Syst. Appl. 2024, 258, 125085. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, H.W.; Fu, J.X.; Tian, H. MSCB-UNet: Elevating skin lesion segmentation performance with Multi-scale Spatial-Channel Bridging Network. Biomed. Signal Process. Control. 2025, 110, 107986. [Google Scholar] [CrossRef]
Xiao, T.T.; Liu, Y.C.; Zhou, B.L.; Jiang, Y.N.; Sun, J. Unified Perceptual Parsing for Scene Understanding. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 432–448. [Google Scholar]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar]
Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; Finn, C. Gradient Surgery for Multi-Task Learning. arXiv 2020, arXiv:2001.06782. [Google Scholar] [CrossRef]

Figure 1. Lesion characteristics of bacterial hemorrhagic septicemia, saprolegniasis, and fish lice.

Figure 2. Overview of FDMNet. Each module is labelled by its layer number. Tensor dimensions are annotated beneath modules as height (H) × width (W) × channels (C). Insets show the internal structures of the C2f block and the SPPF module.

Figure 3. C2DF module and its internal network structure. Inset diagrams detail the internal structures of C2DF, Bottleneck_DFF and DFF.

Figure 4. Seghead processing flowchart. Tensor dimensions are annotated beneath modules as height (H) × width (W) × channels (C).

Figure 5. Qualitative comparison of detection results on the held-out test set for three fish diseases—bacterial hemorrhagic septicemia, saprolegniasis, and fish lice—using Faster R-CNN, Mask R-CNN, and FDMNet. Panels map as follows: (a,d,g) bacterial hemorrhagic septicemia, (b,e,h) saprolegniasis, (c,f,i) fish lice, where (a–c) are Faster R-CNN, (d–f) are Mask R-CNN, and (g–i) are FDMNet. All models were evaluated under the same protocol on the same test split at an input resolution of 640 × 640. Visualisations use a confidence threshold of 0.5. Bounding boxes and mask overlays are shown as applicable. Colour legend: red mask for bacterial hemorrhagic septicemia, green mask for saprolegniasis, and yellow mask for fish lice.

Figure 6. Qualitative comparison of semantic segmentation on the same test images for three fish diseases—bacterial hemorrhagic septicemia, saprolegniasis, and fish lice—using U-Net, Mask R-CNN, and FDMNet. Panels map as follows: (a,d,g) bacterial hemorrhagic septicemia, (b,e,h) saprolegniasis, (c,f,i) fish lice, where (a–c) are U-Net, (d–f) are Mask R-CNN, and (g–i) are FDMNet. All models were evaluated under the same protocol on the held-out test split at an input resolution of 640 × 640. red mask = bacterial hemorrhagic septicemia, green = saprolegniasis, yellow = fish lice.

Figure 7. Attention heatmaps comparing the C2f and C2DF variants on the same test images for three fish diseases: bacterial hemorrhagic septicemia, saprolegniasis, and fish lice. Panels map as follows: (a–c) C2f variant and (d–f) C2DF variant; columns correspond to (a,d) bacterial hemorrhagic septicemia, (b,e) saprolegniasis, and (c,f) fish lice. Heatmaps are computed with Grad-CAM on the P3 detection-head feature (Figure 2, layer 15; 80 × 80, stride 8). All visualisations use the same test split, input resolution 640 × 640.

Table 1. Dataset characteristics and annotation statistics.

Disease Type	Images	Train	Validation	Test	Objects	Masks
bacterial hemorrhagic septicemia	528	370	106	52	989	1002
saprolegniasis	512	358	102	52	913	925
fish lice	508	355	102	51	907	919
Healthy	600	420	120	60	0	0
Total	2148	1503	430	215	2809	2846

Notes. “Disease type” lists the three disease categories considered. “Healthy” is included only as a negative class to model background and reduce false positives and is not treated as a fourth disease. Images: number of labelled images per class (classes are mutually exclusive and sum to the total). Train/Validation/Test: image counts per split using an approximately 7:2:1 ratio (rounded to whole images). Objects: total number of annotated detection boxes in YOLO-format txt files across all splits. Masks: total number of polygon instances in LabelMe annotations across all splits (a single image can contribute multiple masks).

Table 2. Experimental configuration.

Configuration	Parameter
CPU	12th Gen Intel Core i7-12800HX (Intel, Santa Clara, CA, USA)
GPU	NVIDIA GeForce RTX 4070
Operating system	Windows 11
Accelerated environment	CUDA Toolkit: 12.4
Development environment	Visual Studio Code
Deep learning framework	PyTorch 2.5.0

Table 3. Comparison of FDMNet with representative object detection models.

Algorithms	Precision	Recall	mAP50	Parameters (M)	Speed
Faster R-CNN	0.742	0.694	0.834	41.3 M	9.6 ms
YOLOv8n	0.914	0.907	0.936	3.2 M	3.8 ms
YOLOv11n	0.892	0.855	0.928	2.59 M	4.3 ms
RT-DETR	0.961	0.770	0.961	43.7 M	14.5 ms
Mask R-CNN	0.762	0.709	0.831	37.7 M	65.9 ms
YOLO-FD	0.933	0.898	0.945	3.23 M	5.9 ms
FDMNet	0.953	0.921	0.970	3.33 M	6.8 ms

Table 4. Segmentation performance comparison of FDMNet with baseline models.

Algorithms	mIoU	Params	FLOPs	Speed
U-Net	0.648	4.3 M	40.1 G	18.6 ms
Deeplabv3-MobileNet	0.719	5.1 M	5.8 G	7.4 ms
Deeplabv3-ResNet50	0.734	39.6 M	51.1 G	23.5 ms
Deeplabv3+-MobileNet	0.751	5.2 M	16.8 G	8.0 ms
Deep-labv3+-ResNet50	0.767	39.9 M	62.4 G	28.1 ms
Mask R-CNN	0.695	37.7 M	95.7 G	65.9 ms
YOLO-FD	0.803	3.23 M	14.7 G	5.9 ms
FDMNet	0.857	3.33 M	15.1 G	6.8 ms

Table 5. Results of the ablation experiment.

Algorithms	Precision	Recall	mAP50	mIoU
Det only	0.914	0.907	0.936	—
Seg only	—	—	—	0.808
Multi-task (C2f)	0.933	0.898	0.945	0.823
Multi-task (C2DF)	0.953	0.921	0.970	0.857

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; Yan, Z.; Li, G. FDMNet: A Multi-Task Network for Joint Detection and Segmentation of Three Fish Diseases. J. Imaging 2025, 11, 305. https://doi.org/10.3390/jimaging11090305

AMA Style

Liu Z, Yan Z, Li G. FDMNet: A Multi-Task Network for Joint Detection and Segmentation of Three Fish Diseases. Journal of Imaging. 2025; 11(9):305. https://doi.org/10.3390/jimaging11090305

Chicago/Turabian Style

Liu, Zhuofu, Zigan Yan, and Gaohan Li. 2025. "FDMNet: A Multi-Task Network for Joint Detection and Segmentation of Three Fish Diseases" Journal of Imaging 11, no. 9: 305. https://doi.org/10.3390/jimaging11090305

APA Style

Liu, Z., Yan, Z., & Li, G. (2025). FDMNet: A Multi-Task Network for Joint Detection and Segmentation of Three Fish Diseases. Journal of Imaging, 11(9), 305. https://doi.org/10.3390/jimaging11090305

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FDMNet: A Multi-Task Network for Joint Detection and Segmentation of Three Fish Diseases

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset Acquisition

3.2. Overall Structure of FDMNet

3.2.1. C2DF Module

3.2.2. Multi-Scale Semantic Segmentation Branch

3.3. Multi-Task Optimisation Strategy

3.3.1. Weight Uncertainty

3.3.2. PCGrad

4. Results

4.1. Experimental Details

4.2. Experimental Results

4.2.1. Evaluation Metrics

4.2.2. Object Detection Results

4.2.3. Semantic Segmentation Results

4.3. Ablation Experiments

5. Discussion

5.1. Model Evaluation

5.2. Model Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI