This paper selects five representative multi-task modeling methods as comparative baselines to comprehensively validate the effectiveness of the proposed method. This paper combines quantitative metrics and visual analysis to provide a systematic demonstration of model effectiveness in multi-task scenarios.
4.3.1. Comparative Experiments on the SCUT-FBP Dataset
Table 1 presents the experimental results of various models evaluated on the SCUT-FBP dataset. Swin [
39] employs a hierarchical architecture with a shifted window attention mechanism, enabling strong multiscale representation learning and delivering robust performance across diverse vision tasks. MTLoRA [
52] introduces task-agnostic and task-specific low-rank adaptation modules to facilitate parameter-efficient multi-task fine-tuning, significantly reducing training overhead. InvPT [
53], as the first Transformer-based framework for multi-task dense prediction, adopts an inverted pyramid architecture to enhance high-resolution cross-task feature interaction. TaskPrompter [
54] proposes a prompt learning-based multi-task Transformer that integrates spatial channel interaction to jointly model task-shared and task-specific representations, achieving end-to-end MTL. DiffusionMTL [
55] formulates the multi-task partial annotation problem as a pixel-wise denoising task, leveraging diffusion processes and task-conditioned modeling to improve prediction quality under weak supervision. For the classification task, our proposed model, i.e., MoMamba, outperforms all baselines across multiple evaluation metrics. It achieves an ACC of 68%, outperforming InvPT [
53] at 66% and TaskPrompter [
54] at 64%. Furthermore, MoMamba obtains an F1 score of 65.27% and an AP of 69.88%, significantly surpassing the those TaskPrompter [
54], at 60.31% and 60.11%, respectively. For the regression task, our model achieves an R
2 of 0.7473, slightly surpassing the 0.7466 achieved by TaskPrompter [
54]. This demonstrates an improvement of 0.09%, indicating enhanced fidelity in capturing the underlying target distribution. MoMamba obtained an MAE of 0.2912, marginally higher than the 0.2891 reported by TaskPrompter [
54], with a difference of 0.0021. This reflects a deliberate trade-off favoring overall prediction robustness and task-level balance, allowing for slight increases in individual error to achieve better generalization. Notably, the RMSE of MoMamba is 0.3646, which is lower than the 0.3650 reported by TaskPrompter [
54], further evidencing improved regression stability. Additionally, its PC is 0.8694, compared with 0.8657 reported by TaskPrompter [
54], demonstrating superior alignment with ground truth in regression outputs. Overall, across both classification and regression tasks, our model consistently delivers more robust and generalizable performance, highlighting its effectiveness in MTL scenarios.
To further analyze the classification performance of each model, we visualize their confusion matrices on the test set, as shown in
Figure 9. Specifically, subfigures (a)–(f) correspond to the confusion matrices of Swin [
39], MTLoRA [
52], InvPT [
53], TaskPrompter [
54], DiffusionMTL [
55], and MoMamba. Each confusion matrix illustrates the prediction distribution across categories, where the horizontal and vertical axes represent the predicted and ground-truth labels, respectively. The values along the main diagonal indicate the number of correctly classified samples for each class, where higher values correspond to better recognition accuracy for that class. Conversely, off-diagonal values reflect the degree of misclassification between classes. As shown in
Table 1, InvPT achieves the second-highest ACC; however, as observed from
Figure 9c, it fails to correctly classify any samples in class “3”. In contrast, as shown in
Figure 9f, our proposed model successfully classifies 50% of the samples in this category. This substantial difference in class-wise recognition performance directly contributes to the overall accuracy gap and serves as a key reason why InvPT [
53] underperforms compared to our model in classification accuracy.
Figure 10 illustrates the regression performance of each model, where subfigures (a)–(f) correspond to Swin [
39], MTLoRA [
52], InvPT [
53], TaskPrompter [
54], DiffusionMTL [
55], and MoMamba. Each subfigure displays predicted versus true values, along with marginal histograms and residual error distributions. Blue and orange points represent training and test samples, respectively.
Figure 10a,b show pronounced deviations, especially at the low and high score ranges. Although
Figure 10c–e demonstrate moderate improvements, the predictions remain more scattered, with larger residual errors observed between 2.0 and 3.5. In contrast,
Figure 10f illustrates that our model achieves a tight clustering of predictions along the ideal line, and the marginal histograms of the training and test sets are highly consistent, indicating strong generalization capability. Residuals are concentrated near zero, particularly within the range of 3.5–4.0, where prediction errors are minimal. Our proposed model achieves RMSE values of 0.1462 and 0.3646 on the training and test sets, respectively, resulting in a difference of 0.2184. By comparison, TaskPrompter [
54] records RMSE values of 0.1152 and 0.3726 on the training and test sets, respectively, with a larger difference of 0.2574. This result demonstrates that our model achieves better generalization and is less prone to overfitting.
In summary, our model achieves superior performance in both classification and regression tasks, with higher accuracy, stronger regression metrics, and better generalization across datasets, confirming its effectiveness in MTL scenarios.
4.3.2. Comparative Experiments on the SCUT-FBP5500 Dataset
Table 2 summarizes the experimental results of all models on the SCUT-FBP5500 dataset. Compared with mainstream MTL approaches, the proposed model achieves consistently superior performance across all evaluation metrics. In the classification task, the model achieves an ACC of 78.36%, representing an improvement of 4.00 percentage points over DiffusionMTL [
55], reaching 74.36%. The F1 score of the proposed model is 77.28%, which is 6.29 percentage points higher than the 70.99% reported by Swin. In terms of AP, the proposed model attains 80.94%, which is 7.75 percentage points higher than the 73.19% achieved by Swin [
39]. For the regression task, the proposed model also demonstrates higher predictive accuracy and stability. Its PC coefficient reaches 0.9109, reflecting a 1.07% improvement over DiffusionMTL [
55], which yields 0.9002. The RMSE is reduced to 0.2952, compared to 0.3026 reported by DiffusionMTL [
55], representing a decrease of 0.73%. In addition, R
2 reaches 0.8083, showing a 0.91% improvement over the 0.7992 achieved by DiffusionMTL [
55]. These results clearly demonstrate the effectiveness and robustness of the proposed model in multi-task FBP, with substantial improvements observed in both classification and regression tasks.
Figure 11 presents a comprehensive comparison of confusion matrices across different models in the classification task. The second-best-performing model, DiffusionMTL [
55], shown in
Figure 11e, exhibits critical limitations in minority class detection. This model completely fails to identify minority classes, recording 0% prediction ACC for classes “0”, “1”, and “4”. Although DiffusionMTL [
55] achieves a marginally higher ACC of 91.6% on the dominant class “4” compared to the 84.2% ACC achieved by our model, its predictions demonstrate severe bias toward this single dominant category, indicating problematic overfitting behavior and a lack of adaptability to imbalanced class distributions. In contrast, as illustrated in
Figure 11f, MoMamba exhibits superior performance in recognizing all categories, effectively addressing both dominant and minority classes. The model achieves prediction ACCs of 40.0%, 49.3%, 84.2%, 89.5%, and 4.3% for classes “0” through “4”, respectively. Notably, despite class “4” representing an extremely rare category, the model maintains meaningful recognition capability, demonstrating its inherent robustness and class awareness.
These results highlight the balanced performance of the proposed model in classification, combining high accuracy on dominant classes with significantly better recognition of minority classes. This balanced performance translates to enhanced generalization capabilities and more robust overall classification outcomes. The findings validate the effectiveness of the proposed multi-task decoding architecture in addressing class imbalance challenges, positioning it as a superior solution for real world applications where comprehensive class recognition is paramount.
Figure 12 shows the visualization of regression performance for all models. As shown in
Figure 12a,b, Swin [
39] and MTLoRA [
52] suffer from significant deviations from the ideal line, especially at the lower and upper score ranges, reflecting weaker fitting capacity.
Figure 12c–e, corresponding to InvPT [
53], TaskPrompter [
54], and DiffusionMTL [
55], respectively, show moderate improvements in distribution compactness over Swin [
39] and MTLoRA [
52]. However, the overall spread remains considerable, limiting their generalization ability. Specifically, Swin [
39] and MTLoRA [
52] exhibit pronounced residual errors in the 1.0–2.0 range, while InvPT [
53] and TaskPrompter [
54] experience larger errors in the 2.0–3.0 range. Although DiffusionMTL [
55] produces lower overall errors, its generalization performance remains inferior to that of the proposed model. Taking RMSE as an example, the proposed model yields RSME values of 0.2085 and 0.2953 on the training and test set, respectively, with a difference of 0.0868. In contrast, DiffusionMTL [
55] produces RSME values of 0.1690 and 0.3026 on the training and test set, respectively, resulting in a larger difference of 0.1336. For MAE, the proposed model achieves MAEs 0.1658 and 0.2296 on the training and test set, respectively, with a difference of 0.0606, whereas DiffusionMTL [
55] records values of 0.1319 and 0.2296 on the corresponding sets, with a difference of 0.0977. In terms of R
2, MoMamba reaches R
2 values of 0.9079 and 0.8174 on the training and test sets, respectively, with a difference of 0.0905, while DiffusionMTL [
55] records values of 0.9395 and 0.8083 on the corresponding sets, resulting in a difference of 0.1312. Across these three key metrics, the proposed model exhibits substantially smaller performance gaps between the training and test sets, indicating stronger stability and generalization across different data partitions.
In contrast, as shown in
Figure 12f, the proposed model produces a tighter clustering of samples around the ideal line in the main scatterplot, indicating superior fitting performance compared to the other methods. Moreover, the marginal histograms show highly consistent value distributions between the training and test sets, suggesting a minimal distributional shift and strong generalization capability. In the residual plot, prediction errors are primarily concentrated near zero, with the highest density of test samples at zero within the range of 2.0–4.0, indicating minimal error and the highest stability in this interval.
In summary, combining the quantitative results and visual analyses of both classification and regression tasks, the proposed model demonstrates clear overall advantages in MTL scenarios.