A Multi-Task Fusion Model Combining Mixture-of-Experts and Mamba for Facial Beauty Prediction

Gan, Junying; Zhuang, Zhenxin; Chen, Hantian; Xu, Wenchao; Chen, Zhen; Li, Huicong

doi:10.3390/sym17101600

Open AccessArticle

A Multi-Task Fusion Model Combining Mixture-of-Experts and Mamba for Facial Beauty Prediction

by

Junying Gan

^1,*,

Zhenxin Zhuang

¹

,

Hantian Chen

¹

,

Wenchao Xu

¹,

Zhen Chen

¹ and

Huicong Li

²

¹

School of Electronics and Information Engineering, Wuyi University, Jiangmen 529020, China

²

School of Electronic Information and Control Engineering, Guangzhou University of Software, Guangzhou 510990, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(10), 1600; https://doi.org/10.3390/sym17101600

Submission received: 6 August 2025 / Revised: 9 September 2025 / Accepted: 15 September 2025 / Published: 26 September 2025

(This article belongs to the Special Issue Computational Intelligence and Soft Computing: Recent Applications—Second Volume)

Download

Browse Figures

Versions Notes

Abstract

Facial beauty prediction (FBP) is a cutting-edge task in deep learning that aims to equip machines with the ability to assess facial attractiveness in a human-like manner. In human perception, facial beauty is strongly associated with facial symmetry, where balanced structures often reflect aesthetic appeal. Leveraging symmetry provides an interpretable prior for FBP and offers geometric constraints that enhance feature learning. However, existing multi-task FBP models still face challenges such as limited annotated data, insufficient frequency–temporal modeling, and feature conflicts from task heterogeneity. The Mamba model excels in feature extraction and long-range dependency modeling but encounters difficulties in parameter sharing and computational efficiency in multi-task settings. In contrast, mixture-of-experts (MoE) enables adaptive expert selection, reducing redundancy while enhancing task specialization. This paper proposes MoMamba, a multi-task decoder combining Mamba’s state-space modeling with MoE’s dynamic routing to improve multi-scale feature fusion and adaptability. A detail enhancement module fuses high- and low-frequency components from discrete cosine transform with temporal features from Mamba, and a state-aware MoE module incorporates low-rank expert modeling and task-specific decoding. Experiments on SCUT-FBP and SCUT-FBP5500 demonstrate superior performance in both classification and regression, particularly in symmetry-related perception modeling.

Keywords:

facial beauty prediction; Mamba-based modeling; mixture-of-experts; multi-task learning integration

1. Introduction

Symmetry, as a fundamental principle in aesthetics and visual perception, plays a crucial role in human evaluation of facial attractiveness. Facial beauty prediction (FBP), as a frontier topic in visual cognition within artificial intelligence, aims to simulate the complex mechanisms of human perception and aesthetic evaluation. It has become a prominent issue in both academic research and industrial applications. Early studies on FBP primarily relied on handcrafted features [1] and traditional machine learning algorithms [2], such as K-Neighbor [3] and Gaussian empirical model [4], for classification or scoring of facial attractiveness. However, these methods lack the ability to comprehensively capture complex aesthetic features and fall short in end-to-end learning capabilities, resulting in suboptimal accuracy, efficiency, and adaptability.

With the rapid development of deep learning, FBP has made significant progress, generally categorized into classification and regression tasks. Classification-based approaches [5,6,7,8] typically divide facial beauty into discrete levels, emphasizing coarse-grained modeling of aesthetic preferences. In contrast, regression-based methods [9,10,11,12] aim to predict continuous scores to capture subjective nuances in facial beauty perception. Despite notable advances in both directions, classification methods struggle with subtle differences in scores, while regression models often suffer from instability due to ambiguous and subjective evaluation criteria. Furthermore, the scarcity of annotated data limits the potential of single-task models. In response, recent studies have begun to explore multi-task learning (MTL) frameworks that jointly model classification and regression tasks to leverage their complementary advantages, thereby improving prediction performance and generalization.

Current mainstream MTL frameworks typically adopt an encoder–decoder architecture, where a shared encoder extracts general representations, followed by task-specific decoders for prediction. Some studies focus on optimizing the shared encoder [13,14,15], aiming to enhance task-invariant feature learning through supervised signals, while others emphasize decoder design [16,17,18] to capture task relationships. However, these methods often rely on static network structures, which are insufficiently flexible to handle task heterogeneity, leading to redundant representations and task interference that limit model performance.

Mixture-of-experts (MoE) models have been proposed to tackle these issues. MoE models employ a set of independent expert networks, each specialized for certain feature types, and a gating network to dynamically allocate computational resources based on input characteristics. This dynamic feature selection improves both representation quality and intertask conflict mitigation. Recent studies [19,20] have introduced MoE in the encoder to enable task-adaptive feature extraction. MoE-based encoders intelligently distribute resources across tasks and samples, enhancing shared representation while preserving task-specific traits. Its flexibility and scalability make MoE highly suitable for large-scale and complex scenarios. However, compared with its use in encoders, MoE-based decoder structures remain relatively underexplored. Recent work [21] first introduced MoE into the decoder design, enabling dynamic expert selection during task-specific decoding. Compared with static decoders, MoE-based decoders allow adaptive activation of expert networks based on input features, enabling more precise feature transformation and decision making. This approach not only improves adaptability but also facilitates cross-task information interaction, alleviating task conflicts and enhancing overall performance. Motivated by this, we explore the use of MoE in decoder architectures. Despite these advantages, both MoE-based encoders and decoders still face limitations in modeling complex feature relationships and task interactions. MoE-based encoders often rely on local feature selection, hindering the capture of global cross-task associations. Decoders are capable of task-specific transformation; however, they are often insufficient for exploiting complementary information across tasks.

In recent years, Mamba has emerged as an innovative architecture built on linear state-space models (SSMs) [22,23]. It shows strong capabilities in modeling long-range dependencies and supporting efficient information interaction. With higher computational efficiency than traditional architectures, Mamba has been applied in language modeling [24], medical image segmentation [25,26], object detection [27], and image classification [28,29,30]. However, most existing studies on Mamba have concentrated on single-task scenarios, leaving its potential in MTL largely unexplored. MTL requires models to share representations across tasks, but task heterogeneity may lead to conflicts, sometimes causing negative transfer—where sharing knowledge harms rather than helps performance. Traditional Mamba designs focus mainly on global feature modeling, but they are not well optimized to address such differences. Therefore, knowledge is needed on how to integrate Mamba’s efficient feature modeling with task-specific transformations, enabling information exchange across tasks while avoiding negative transfer.

Facial beauty prediction (FBP) under a multi-task setting faces intrinsic challenges, such as task conflict, negative transfer, and difficulty in learning shared representations. While the MoE mechanism has been explored to mitigate task interference through dynamic expert routing, it often suffers from routing inefficiency and imbalanced expert utilization. Meanwhile, the recently proposed Mamba architecture excels at long-range dependency modeling with linear-time efficiency, yet its potential for dense prediction tasks like FBP remains underexplored. These limitations motivate us to design a unified framework that leverages the complementary strengths of MoE and Mamba for multi-task FBP.

In this paper, we propose MoMamba, a novel multi-task decoder architecture that integrates the strengths of the mixture-of-experts (MoE) mechanism and the Mamba architecture for facial beauty prediction. Unlike prior approaches that either focus solely on single-task regression or treat MoE and Mamba as independent components, MoMamba introduces two key innovations. First, the detail enhancement module (DEM) captures complementary frequency-domain and contextual features, enabling more comprehensive aesthetic representation. Second, the state-aware mixture-of-experts (SAME) module achieves fine-grained expert scheduling across tasks, effectively mitigating negative transfer in multi-task learning. Finally, by unifying classification and regression tasks within a single framework, MoMamba demonstrates strong generalization and efficiency on benchmark datasets. These contributions move beyond the incremental combination of existing methods and establish MoMamba as a genuinely novel architecture for multi-task facial beauty prediction. Extensive experiments on the SCUT-FBP [31] and SCUT-FBP5500 [32] datasets show that our proposed method outperforms state-of-the-art models in both classification and regression tasks, validating the effectiveness of MoMamba in MTL scenarios.

The main contributions of this work are as follows:

(1): We propose MoMamba, a novel multi-task model for FBP that combines Mamba for long-range dependency modeling with MoE for dynamic feature selection, tackling the issue of limited labeled data.
(2): We design a DEM that fuses high-frequency and low-frequency features extracted via DCT with the contextual modeling capabilities of the Mamba architecture, establishing a complementary relationship between the frequency and temporal domains and enhancing the representational capacity for MTL.
(3): We develop the SAME module, incorporating state-driven feature modeling and dynamic task-specific routing to achieve more precise modeling of task heterogeneity and cross-task interactions.

2. Related Work

2.1. FBP and MTL

FBP has achieved remarkable progress under MTL frameworks, with numerous studies demonstrating strong performance in feature representation and predictive accuracy. Existing methods commonly introduce related tasks such as gender and ethnicity as auxiliary supervision signals to enhance the main task. By employing shared feature extractors and attention mechanisms, these approaches enable collaborative optimization across tasks, thereby improving prediction accuracy and model generalization. For instance, the study in [33] employs a multi-stream input structure combined with a pretrained convolutional neural network (CNN) to jointly model gender, ethnicity, and facial beauty, enhancing facial feature representation. The method in [34] incorporates an adaptive sharing strategy and an attention-based fusion module into a ResNet-18 backbone, improving label utilization across datasets and alleviating overfitting. In [35], a residual network is used to share features for modeling intertask relationships, enabling joint learning of facial beauty, gender, and ethnicity prediction.

Current multi-task models can generally be categorized into two architectural paradigms, namely, encoder-dominated and decoder-dominated frameworks. Encoder-dominated frameworks typically emphasize shared representation learning by using a common backbone to capture task-invariant features. In contrast, decoder-dominated frameworks incorporate task-specific modules in the decoding stage, enhancing task interaction and improving feature decoding accuracy. Among encoder-based methods, MQ Transformer [13] employs a task query mechanism to facilitate efficient information reasoning. The method in [14] introduces a context modulation module to promote cross-task feature interactions, while [15] proposes a Task-Interaction-Free network that reduces task coupling via multiscale feature modeling to enhance efficiency. For decoder-based approaches, the study in [16] designs an automatic task routing mechanism based on tree structures and Gumbel-Softmax to mitigate oversharing and negative transfer. In [17], the exploitation of interdecoder variation enhances change detection performance. The work in [18] integrates Mamba to model long-range dependencies within tasks while simultaneously improving intertask feature interaction. In addition to architectural innovations, advances in multitask loss design [36], attention mechanisms [37], and MoE models [19,20] have also been widely explored to boost MTL performance. Decoder-centric models offer the distinct advantage of fully leveraging the global features extracted by powerful vision backbones, such as Vision Transformer (ViT) [38] and Swin Transformer [39], while enabling fine-grained reconstruction and task-specific information extraction during decoding. Motivated by this, our work adopts a decoder-centric perspective and focuses on integrating MoE and Mamba techniques to dynamically generate task-specific representations in the decoding stage.

Furthermore, the DCT [40], a classical technique in frequency-domain processing, has been extensively used for image feature extraction, particularly in modeling high- and low-frequency components. In this study, DEM leverages DCT-derived high- and low-frequency features and integrates them with the contextual modeling capability of the Mamba architecture. This allows for complementary interactions between the frequency and temporal domains, further enhancing multi-task representation capabilities.

2.2. MoE Models

In recent years, MoE has been widely used to enhance model expressiveness and computational efficiency. By introducing multiple expert modules, each specializing in different aspects of the input data, MoE enables dynamic selection through a gating mechanism, which activates only a subset of experts per instance. The design enhances nonlinear modeling and leads to improved performance across tasks. MoE has also demonstrated significant potential in the field of computer vision. For example, V-MoE [41] integrates MoE modules into the ViT framework to improve image classification performance. Swin-MoE [42] incorporates expert mechanisms into a hierarchical convolutional structure to better adapt to diverse visual tasks. The pMoE model [43] further enhances spatial modeling by introducing position-aware capabilities into sparse activation strategies. MoE has gradually been extended to MTL, where it addresses the diversity and complexity across tasks. These approaches typically employ multiple expert networks, each learning task-discriminative features, while gating networks or routers dynamically select task-specific expert combinations via hard or soft routing. For instance, ADV-MoE [44] leverages attention-guided routing to improve expert–task matching. AdaMV-MoE [19] introduces an adaptive mechanism to adjust the number of active experts per task, mitigating overfitting on simple tasks and underfitting on complex ones. Moreover, the study in [20] integrates MoE into the ViT backbone with a focus on task-specific sparse activation during inference. By introducing a hardware-aware scheduling strategy to reduce latency and switching overhead on edge devices, this work demonstrates the deployment potential of MoE in real-world multitask applications. While these MoE-based MTL models [19,20] primarily focus on encoder-side integration to enhance feature extraction flexibility and expressiveness, recent work, such as [21], pioneers the introduction of MoE into decoder structures. This approach enables the dynamic generation of more discriminative task-specific representations by introducing multiple shared subspaces along with task-aware gating in the decoding phase.

2.3. Mamba Models

SSMs [22,23] are mathematical frameworks that model the relationship between input and output in dynamic systems through hidden states. In recent years, SSMs have gained increasing popularity in sequence modeling tasks [23,45]. Mamba, a structured SSM-based architecture, integrates input selection mechanisms with dynamic state updates, allowing for effective modeling of salient information across sequences. Its ability to capture long-range dependencies with linear computational complexity has led to the development of efficient Mamba-augmented architectures for various computer vision tasks, such as H-vmunet [25], MambaBTS [26], Voxel Mamba [27], LE-Mamba [28], and DualMamba [29].

For example, H-vmunet [25] combines a high-order 2D scanning module and a local SS2D module to overcome the limitations of traditional CNNs and ViT in medical image segmentation, especially in capturing long-range dependencies and local details. MambaBTS [26] enhances the U-Net architecture by combining Mamba-based state-space modeling with cascaded residual multiscale convolution, improving segmentation of tumors with diverse sizes and shapes. Voxel Mamba [27] proposes a group-free approach that serializes 3D voxel space into a 1D sequence, using the linear complexity of SSMs to reduce spatial adjacency loss. Additionally, a dual-scale SSM block is proposed to increase the receptive field and retain local spatial coherence. LE-Mamba [28] is an architecture that adapts the original Mamba model from 1D sequence tasks to noncausal hyperspectral image classification. DualMamba [29] presents a lightweight model that dynamically integrates global features from Mamba and local features from convolutions via an adaptive global–local fusion strategy. This helps alleviate the computational burden and performance bottlenecks in extracting spectral–spatial features faced by CNN-based and Transformer-based models. MorpMamba [30] combines morphological operations with the SSM framework using a label enhancement module to dynamically adjust spatial and spectral features for improved fusion, with the aim of overcoming the inefficiency of Transformers in hyperspectral image classification.

While these Mamba-based models have demonstrated strong performance in single-task scenarios, the application of state-space modeling in MTL remains in an early stage. MTMamba [18] is among the first attempts to integrate Mamba into multi-task dense prediction. It introduces a cross-task modeling module to enhance explicit task interactions and validates its effectiveness across various downstream tasks.

In summary, despite their effectiveness in single-task visual modeling, Mamba models face key challenges in MTL applications, particularly in dynamic expert activation, information sharing between tasks, and task decoupling. To address this, we propose SAME by incorporating Mamba-based global modeling into the MoE framework. By embedding state-space mechanisms into the expert activation and fusion processes, SAME mitigates the selection bias inherent in traditional MoE structures and significantly improves generalization in multi-task representation learning.

3. Methods

3.1. Overall Architecture

In this work, we propose MoMamba, a decoder-dominated MTL framework tailored for FBP, which simultaneously optimizes both classification and regression tasks. The model enables efficient feature encoding and decoding by integrating frequency and spatial domain cues, global and local features, and shared and task-specific representations. As illustrated in Figure 1, MoMamba is built upon a ViT encoder and introduces two novel modules: the DEM and the SAME module. The SAME module serves as the central decoding component, playing a pivotal role in both task-specific output generation and feature disentanglement through state modeling and dynamic expert selection mechanisms.

First, the input RGB image

{I \in R}^{H \times W \times 3}

is divided into nonoverlapping patches of size

P \times P

, which are subsequently processed by the ViT encoder. Let the output of the l layer be denoted as

{F^{(l)} \in R}^{N_{P} \times D}

, where H and W denote the height and width of the input image, respectively;

N_{P} = \frac{H W}{P^{2}}

represents the total number of patches; and D is the feature dimension. To enhance the capability of multiscale semantic modeling, the model dynamically extracts a set of intermediate features from multiple layers of the encoder, denoted as

{F^{(l)}}_{l \in L}

, where

L

represents the set of selected layer indices,

L \in {3,6, 9,12}

. In this work, we adopt ViT-Base as the backbone encoder to provide hierarchical representations from different depths, which respectively capture local textures, mid-level semantics, and high-level structural information, thereby facilitating task-specific feature fusion and modeling in subsequent stages. Second, the final-layer output of the encoder is fed into the proposed DEM. This module reconstructs the 2D spatial layout of features and performs frequency-domain decomposition using the DCT. It separates high-frequency components, which characterize fine-grained facial textures, from low-frequency components, which capture coarse contours and structural information. Additionally, the DEM incorporates a bidirectional Mamba-based temporal modeling branch, which employs state-space mechanisms to capture contextual dependencies among patch sequences. The frequency-aware and Mamba-enhanced spatial features are then fused to produce refined representations, which serve as informative inputs for the decoder. Third, in the decoding stage, the enhanced features are forwarded to the proposed SAME module. SAME comprises multiple low-rank expert modules and employs a state-aware dynamic routing strategy to achieve task-specific decomposition and reconstruction of the shared representations. Finally, the task-specific disentangled features are passed to separate output heads for classification and regression, enabling both label prediction and continuous scoring of facial attractiveness.

3.2. Detail Enhancement Module

In multi-task FBP, high-quality feature modeling relies not only on global structural representations but also on precise perception of local details. Although backbone networks such as ViT possess strong spatial modeling capabilities, their quadratic computational complexity and limited ability to capture frequency-domain information hinder the effective extraction of fine-grained features from high-resolution images. To address this issue, we propose a DEM, which integrates frequency-domain decomposition with state-space modeling to jointly enhance local detail perception and long-range dependency modeling, as illustrated in Figure 2.

This module takes the output feature from the L-th layer of the ViT encoder

{F^{L} \in R}^{N_{P} \times D}

as the input. After reshaping, it yields a spatially structured feature map

{F \in R}^{C \times H \times W}

. First, the DEM employs the 2D DCT [40] to map the input feature map

{F \in R}^{C \times H \times W}

from the spatial domain to the frequency domain, separating structurally dominant low-frequency components from more discriminative high-frequency details. To achieve controllable decoupling in the frequency domain, a gating mask mechanism based on control parameters

α

is introduced, which adjusts the frequency separation boundary to adapt to different spectral characteristics of the images. The mask design follows two key principles: natural images predominantly concentrate energy in the low-frequency region [46], which indicates that preserving most low-frequency components helps prevent structural information loss, while high-frequency components carry stronger discriminative details in facial images [47]. Accordingly, with

α = 0.05

, a high-frequency mask

M_{high}

is constructed to retain only the bottom-right 5% high-frequency region, while a low-frequency mask

M_{low}

is defined with

α = 0.95

, retaining the top-left 95% low-frequency region. Let

T_{D C T} (\cdot)

and

T_{D C T}^{- 1} (\cdot)

denote the DCT and inverse DCT (IDCT) transforms, respectively. Subsequently,

F_{high}

and

F_{low}

are concatenated along the channel dimension to form a multifrequency feature representation, preserving both local details and global structures simultaneously. The overall procedure can be formulated as follows:

F_{high} = T_{D C T}^{- 1} (M_{high} ⊙ T_{D C T} (F)), F_{low} = T_{D C T}^{- 1} (M_{low} ⊙ T_{D C T} (F))

(1)

F_{freq} = Concat (F_{high}, F_{low})

(2)

where

⨀

denotes the element-wise multiplication operation.

Second, the DEM introduces a bidirectional structure Bi-Mamba based on SSMs to enhance the long-range dependency modeling capability in the spatial domain. This module operates on the input feature map

{F \in R}^{C \times H \times W}

, first expanding the channel dimension through 3 × 3 convolution to obtain

{F^{'} \in R}^{2 C \times H \times W}

and then flattening it into sequential form

{F_{seq} \in R}^{L \times 2 C}

, where

L = H \times W

. The sequence is divided into N consecutive subblocks

{\{F_{s e q}\}}_{i = 1}^{N}

, which are fed into forward and backward Mamba modules. After concatenating the bidirectional outputs, the spatial structure is reconstructed to form state-enhanced features

F_{Mamba}

. The overall formulation is expressed as follows:

F_{forward}^{i} = Mmaba (F_{seq}^{i})

(3)

F_{backward}^{i} = Mmaba (F_{seq}^{i})

(4)

y_{i}^{fused} = F_{forward}^{i} + F_{backward}^{i}, i = {1,2, \cdot \cdot \cdot, N}

(5)

F_{Mamba} = Reshape (Concat (y_{1}^{fused}, \cdot \cdot \cdot, y_{N}^{fused}))

(6)

where

Concat (\cdot)

denotes tensor concatenation,

Reshape (\cdot)

represents tensor shape transformation, and

y_{i}^{fused}

is the enhanced representation of the i-th subblock after fusing forward and backward Mamba outputs.

Finally, the frequency-domain–enhanced features

F_{freq}

and state modeling features

F_{Mamba}

are fused along the channel dimension. Two convolutional modules are stacked on

F_{freq}

, where each module consists of convolution, batch normalization, and rectified linear unit activation function, denoted as

ConvBNReLU

, to enhance local contextual awareness and restore the original channel dimension C. The final output

F_{DEM}

serves as the result of the DEM, which is fed to the downstream multi-task decoder.

F_{fused} = Concat (F_{freq}, F_{ssm})

(7)

F_{DEM} = {ConvBNReLU}^{2} (F_{fused})

(8)

where

{F_{fused} \in R}^{4 C \times H \times W}

and

{F_{DEM} \in R}^{C \times H \times W}

.

3.3. SAME Module

To simultaneously model global dynamic context and local task-relevant features, we propose the SAME module, which combines state-space duality with MoE to enhance modeling capability while maintaining parameter efficiency, as illustrated in Figure 3.

3.3.1. State-Space Modeling Path

First, the 2D feature

{F \in R}^{C \times H \times W}

is flattened into sequential form

F^{'}

. We design a shared parameter generation path to generate the dynamic parameters required for state-space modeling. The input 1D feature

F^{'}

undergoes channel projection through 1D convolution, followed by spatial modeling via depth-wise separable convolution. Finally, the output is divided along the channel dimension into three feature groups

B

,

C

, and

Δ

, which can be expressed as follows:

F^{'} = Flatten (F) \in R^{L \times C}, L = H \times W

(9)

[B, C, Δ] = Split (DWConv (Conv 1 D (F^{'})), d i m = 1)

(10)

where L denotes the sequence length, B is used for state interaction modeling, C is used for output mapping, and

Δ

represents the input-dependent state offset.

Second, to enable adaptive modeling of state evolution directions, we introduce a learnable state bias base vector

θ_{A} \in R^{N}

, which serves as a prior in the state space. Specifically, the final state transition matrix is computed as follows:

A = Softmax (Δ + θ_{A})

(11)

where

Δ

is a dynamic offset that is strongly correlated with the input. After weighting, normalization is performed along the state dimension through the Softmax operation to obtain the final state transition matrix

A \in R^{N \times L}

. This operation normalizes along the state dimension, enabling each sample to possess adaptive state transition capabilities. Compared with classical SSMs such as Mamba [48] that adopt implicit recursive approaches, their state propagation depends on the output of the previous position, making it difficult to achieve parallelization in the sequence dimension. For example, classical SSMs such as Mamba use implicit recursive computation (

h_{t} = f (h_{t - 1}, x_{t})

), leading to difficulties in parallelization and susceptibility to sequence length effects in image tasks. It is expressed as follows:

h = F^{'} {(A ⊙ B)}^{T} \in R^{C \times N}

(12)

where

⨀

represents the element-wise multiplication operation. For each spatial position

l \in [1, L]

, a state interaction tensor

{(A [:, :, l] \cdot B [:, :, l])}^{T} \in R^{N \times N}

is constructed to achieve information fusion between different state channels. The interaction results are element-wise multiplied with input features along the channel dimension, enhancing state awareness while preserving spatial resolution.

Finally, the results h are fused with the output mapping channels C to compute state-aware output features, which are then reshaped to output 2D feature form

F_{HS SD}

, i.e.,

y = h ⊛ C \in R^{C \times L}

(13)

F_{HS SD} = Reshape (y) \in R^{C \times H \times W,} H = W = \sqrt{L}

(14)

where

⊛

represents matrix multiplication. This approach not only preserves the spatial structure of images but also enhances the expressive capability of state awareness and task adaptability. Figure 4 shows the structure of HSSD.

3.3.2. MoE Modeling Path

The MoE modeling consists of three main components, i.e., a base pathway, a shared expert pathway, and a task-specific expert pathway. This design enables efficient, dynamic, and task-adaptive feature representation.

Initially, the base pathway stage employs standard 3 × 3 convolution operations to effectively extract shared low-level features across multiple tasks, with the output form as:

F_{base} = {Conv}_{3 \times 3} (F)

(15)

where

{F \in R}^{C \times H \times W}

represents the input features and

F_{base}

denotes the output generic representation features. The design objective of this pathway is to provide unified features for all tasks, thereby offering stable and compatible foundational feature support for subsequent multi-task modeling.

Subsequently, in the shared expert pathway stage, to construct a shared expert ensemble with diversity and expressive capability, inspired by LoRA [49], this paper introduces n structurally consistent yet parameter-independent low-rank expert modules

{LoRA}^{(i)}, i = \{1,2, \cdot \cdot \cdot, n\}

, where each expert achieves efficient feature modeling capability based on a convolution structure that first reduces rank and then increases rank. Assume the i-th expert consists of a 3 × 3 down-ranking convolution matrix

W_{down}^{(i)} \in R^{3 \times 3 \times C \times r_{i}}

and a 1 × 1 up-ranking convolution matrix

W_{up}^{(i)} \in R^{1 \times 1 \times r_{i} \times C}

, where C represents the number of input or output channels and

r_{i}

denotes the rank value of the i-th expert. This structure can be represented as:

F^{(i)} = {LoRA}^{(i)} (F) = W_{up}^{(i)} * (W_{down}^{(i)} * F)

(16)

where

*

represents the 2D convolution operation and

{F \in R}^{C \times H \times W}

is the input feature map.

While expert structures are shared among tasks, independent parameterization is adopted to improve the perception of distinct feature patterns. This paper equips each task with a task-specific routing network to dynamically schedule these expert outputs. This network generates an expert weight vector

g^{(t)} = [g_{1}^{(t)}, \dots, g_{n}^{(t)}] \in R^{n}

based on input features

F_{t}

, reflecting the applicability and contribution of different experts to the task. To improve the precision and stability of routing selection, the routing network employs a “spatial-channel joint attention” mechanism, combining spatial attention [50] and lightweight channel attention [51] for feature enhancement. The generated weight vector

g^{(t)}

is subsequently processed through a Top-K strategy, retaining only the Top-K scoring experts, denoted as

S^{(t)} \subset \{1,2, \cdot \cdot \cdot, n\}, |S^{(t)}| = K

, while the weights of the remaining experts are set to zero, effectively suppressing noise interference and redundant computation. The final task-specific output

F_{MoE}^{(t)}

is obtained through weighted fusion of selected expert features, i.e.,

F_{MoE}^{(t)} = BN (\sum_{i \in S^{(t)}}^{n} g_{i}^{(t)} \cdot F^{(i)})

(17)

where t represents different tasks and

BN (\cdot)

denotes the batch normalization operation that is used to balance multi-expert feature distributions and improve training convergence and task discriminability.

Finally, in the task-specific expert pathway stage, although shared experts can adapt to multiple tasks, certain tasks still possess unique feature patterns and representation requirements. Therefore, this paper constructs a task-specific low-rank expert pathway for each task individually to capture task-specific attributes. The parameters of this pathway are updated only for the corresponding task and are completely isolated from other tasks. The features from the three pathways are finally fused together and represented as follows:

F^{t} = {LoRA}^{t} (F) = W_{up}^{t} * (W_{down}^{t} * F)

(18)

F_{t} = F_{conv} + F_{MoE}^{(t)} + F^{t}

(19)

where

W_{down}^{t} \in R^{3 \times 3 \times C \times r_{i}}

,

W_{up}^{t} \in R^{1 \times 1 \times r_{i} \times C}

, r represents the rank value, and

F_{t}

denotes the features from MoE modeling under task t.

3.3.3. Cross-Attention Mechanism

To achieve information fusion between state-space modeling and MoE modeling features, we introduce multihead mechanism-based cross-attention, as shown in Figure 5. This mechanism receives two input features

F_{HSSD}

and

F_{t}

, both with dimensions

R^{C \times H \times W}

, representing the guiding feature and guided feature, respectively, and dynamically fuses contextual information from

F_{HSSD}

into

F_{t}

through the attention mechanism. The input features are first flattened into sequence representation to accommodate attention computation, and then three linear transformations are applied to

F_{HSSD}

and

F_{t}

to generate query, key, and value vectors. Subsequently, to enhance expressive capability, the module adopts a multihead attention structure with total projected dimension d, dividing each vector into h subspaces, each with dimension

d_{h} = d / h

. Scaled dot-product attention is performed independently on each head. Finally, all head outputs are concatenated and passed through linear projection to obtain fused features, which are then processed through dropout and reshape operations to restore spatial structure, completing task-specific representation enhancement

F_{final}

. The specific operations are as follows:

F_{HSSD}^{'} = Flatten (F_{HSSD}) \in R^{L \times C}, F_{t}^{'} = Flatten (F_{t}) \in R^{L \times C}, L = H \times W

(20)

Q = F_{HSSD}^{'} W_{Q}, K = F_{t}^{'} W_{K}, V = F_{t}^{'} W_{V}, W_{Q}, W_{K}, W_{V} \in R^{C \times C}

(21)

Q_{i} = Q [:, :, i \cdot d_{h} : (i + 1) \cdot d_{h}], i = 0,1, \dots, h - 1

(22)

{Attention}_{i} = Soft \max (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d_{h}}}) V_{i}

(23)

F_{atten} = W_{o} [{Attention}_{1}, \dots, {Attention}_{h}]

(24)

F_{final} = Reshape (Dropout (F_{atten}))

(25)

where

W_{Q}, W_{K}, W_{V} \in R^{d \times C}

are learnable linear projection matrices;

Flatten (\cdot)

represents flattening spatial features into sequences;

{Q, K, V \in R}^{L \times d}

are the projected query, key, and value vectors; and

W_{o} \in R^{d \times d}

is the output projection matrix.

4. Experimental Results and Analysis

4.1. Experimental Setup

SCUT-FBP [31] is a public dataset for FBP released by South China University of Technology and provides high-quality facial images with corresponding classification and regression labels for human beauty prediction tasks. The images display simple backgrounds and are all captured under natural lighting, ensuring the clarity of facial features. The dataset contains individuals of different ages, face shapes, and expressions, helping improve the ability of the model to learn diverse features. In terms of classification labels, the dataset is divided into five categories, i.e., “0,” “1,” “2,” “3,” “4,” where higher label numbers represent greater attractiveness. Specifically, there are 24, 240, 177, 30, and 29 images with labels “0,” “1,” “2,” “3,” and “4,” respectively, totaling 500 images, with their specific distribution shown in Figure 6. The regression labels are distributed in the range [1.0, 5.0], where higher scores indicate that the facial image has stronger attractiveness, with the specific distribution shown in Figure 7. The dataset is divided into training and test sets at an 8:2 ratio, which means that the training and test set sets contain 400 and 100 images, respectively.

SCUT-FBP5500 [32] is another public dataset for facial beauty prediction established by South China University of Technology. It contains 5500 color frontal face images with a resolution of 350 × 350, covering different ethnicities including Asian and Caucasian, genders, and age ranges from 15 to 60 years old. Each image was scored by 60 volunteers, and every image is equipped with both classification and regression labels. The classification labels are divided into five categories: “0”, extremely unattractive; “1”, unattractive; “2”, average; “3”, attractive; and “4”, extremely attractive. The regression label scores range from 1.0 to 5.0, where higher scores indicate greater attractiveness. Based on ethnicity and gender, the dataset is divided into four equal subsets, specifically including 2000 Asian female images, 2000 Asian male images, 750 Caucasian female images, and 750 Caucasian male images. Example images and corresponding labels from the subsets are shown in Figure 8. The dataset is divided into training and test sets at an 8:2 ratio, which means that the training and test sets contain 4400 and 1100 images, respectively.

4.2. Experimental Environment and Evaluation Metrics

All experiments in this paper are conducted on a computer equipped with NVIDIA RTX 3090 Ti and 64 GB memory running an Ubuntu 22.04 operating system. The main software environments include Python 3.10, PyTorch 2.1.0, and CUDA 11.8. The experiments employ the AdamW optimizer for parameter updates, with an initial learning rate of 0.0001 and weight decay coefficient of 0.00001. The learning rate adjustment strategy adopts cosine annealing scheduling, with a maximum decay period set to 100 and minimum learning rate of 0. The total training epochs are 100, with a batch size of 16 per epoch. Unless otherwise specified, all headline results reported in the summary tables are obtained with nine experts, Top-K = 4, and a LoRA rank configured using the values {16, 24, 32, 40, 48, 56, 64, 72, 80}. This paper employs multiple evaluation metrics for classification and regression tasks to comprehensively assess the model performance on FBP. For classification tasks, accuracy (ACC), F1 score, and average precision (AP) are adopted as performance measures, with specific formulas as follows:

ACC = \frac{TP + TN}{TP + TN + FP + FN}

(26)

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN}

(27)

F 1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall} = \frac{2 TP}{2 TP + FP + FN}

(28)

AP = \sum_{i = 1}^{N} ({Recall}_{i} - {Recall}_{i - 1} \cdot {Precision}_{i})

(29)

where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively; Precision represents precision; Recall represents recall; N denotes the number of classification threshold points; and

{Precision}_{i}

and

{Recall}_{i}

represent the precision and recall at the i-th threshold, respectively.

For regression tasks, Pearson correlation (PC), mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R²) are used as evaluation metrics. Among these, lower MAE and RMSE values indicate better model performance, while higher values are better for all other metrics. The specific formulas are as follows:

\begin{matrix} PC = \frac{\sum_{i = 1}^{N} (y_{i} - \bar{y}) ({\hat{y}}_{i} - \bar{\hat{y}})}{\sqrt{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}} \cdot \sqrt{\sum_{i = 1}^{N} {({\hat{y}}_{i} - \bar{\hat{y}})}^{2}}} # \end{matrix}

(30)

MAE = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|

(31)

\begin{matrix} RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}} # \end{matrix}

(32)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}}

(33)

where

y_{i}

is the true value,

{\hat{y}}_{i}

is the predicted value,

\bar{y}

and

\bar{\hat{y}}

are their respective means, and N represents the total number of samples.

4.3. Comparative Experiments

This paper selects five representative multi-task modeling methods as comparative baselines to comprehensively validate the effectiveness of the proposed method. This paper combines quantitative metrics and visual analysis to provide a systematic demonstration of model effectiveness in multi-task scenarios.

4.3.1. Comparative Experiments on the SCUT-FBP Dataset

Table 1 presents the experimental results of various models evaluated on the SCUT-FBP dataset. Swin [39] employs a hierarchical architecture with a shifted window attention mechanism, enabling strong multiscale representation learning and delivering robust performance across diverse vision tasks. MTLoRA [52] introduces task-agnostic and task-specific low-rank adaptation modules to facilitate parameter-efficient multi-task fine-tuning, significantly reducing training overhead. InvPT [53], as the first Transformer-based framework for multi-task dense prediction, adopts an inverted pyramid architecture to enhance high-resolution cross-task feature interaction. TaskPrompter [54] proposes a prompt learning-based multi-task Transformer that integrates spatial channel interaction to jointly model task-shared and task-specific representations, achieving end-to-end MTL. DiffusionMTL [55] formulates the multi-task partial annotation problem as a pixel-wise denoising task, leveraging diffusion processes and task-conditioned modeling to improve prediction quality under weak supervision. For the classification task, our proposed model, i.e., MoMamba, outperforms all baselines across multiple evaluation metrics. It achieves an ACC of 68%, outperforming InvPT [53] at 66% and TaskPrompter [54] at 64%. Furthermore, MoMamba obtains an F1 score of 65.27% and an AP of 69.88%, significantly surpassing the those TaskPrompter [54], at 60.31% and 60.11%, respectively. For the regression task, our model achieves an R² of 0.7473, slightly surpassing the 0.7466 achieved by TaskPrompter [54]. This demonstrates an improvement of 0.09%, indicating enhanced fidelity in capturing the underlying target distribution. MoMamba obtained an MAE of 0.2912, marginally higher than the 0.2891 reported by TaskPrompter [54], with a difference of 0.0021. This reflects a deliberate trade-off favoring overall prediction robustness and task-level balance, allowing for slight increases in individual error to achieve better generalization. Notably, the RMSE of MoMamba is 0.3646, which is lower than the 0.3650 reported by TaskPrompter [54], further evidencing improved regression stability. Additionally, its PC is 0.8694, compared with 0.8657 reported by TaskPrompter [54], demonstrating superior alignment with ground truth in regression outputs. Overall, across both classification and regression tasks, our model consistently delivers more robust and generalizable performance, highlighting its effectiveness in MTL scenarios.

To further analyze the classification performance of each model, we visualize their confusion matrices on the test set, as shown in Figure 9. Specifically, subfigures (a)–(f) correspond to the confusion matrices of Swin [39], MTLoRA [52], InvPT [53], TaskPrompter [54], DiffusionMTL [55], and MoMamba. Each confusion matrix illustrates the prediction distribution across categories, where the horizontal and vertical axes represent the predicted and ground-truth labels, respectively. The values along the main diagonal indicate the number of correctly classified samples for each class, where higher values correspond to better recognition accuracy for that class. Conversely, off-diagonal values reflect the degree of misclassification between classes. As shown in Table 1, InvPT achieves the second-highest ACC; however, as observed from Figure 9c, it fails to correctly classify any samples in class “3”. In contrast, as shown in Figure 9f, our proposed model successfully classifies 50% of the samples in this category. This substantial difference in class-wise recognition performance directly contributes to the overall accuracy gap and serves as a key reason why InvPT [53] underperforms compared to our model in classification accuracy.

Figure 10 illustrates the regression performance of each model, where subfigures (a)–(f) correspond to Swin [39], MTLoRA [52], InvPT [53], TaskPrompter [54], DiffusionMTL [55], and MoMamba. Each subfigure displays predicted versus true values, along with marginal histograms and residual error distributions. Blue and orange points represent training and test samples, respectively.

Figure 10a,b show pronounced deviations, especially at the low and high score ranges. Although Figure 10c–e demonstrate moderate improvements, the predictions remain more scattered, with larger residual errors observed between 2.0 and 3.5. In contrast, Figure 10f illustrates that our model achieves a tight clustering of predictions along the ideal line, and the marginal histograms of the training and test sets are highly consistent, indicating strong generalization capability. Residuals are concentrated near zero, particularly within the range of 3.5–4.0, where prediction errors are minimal. Our proposed model achieves RMSE values of 0.1462 and 0.3646 on the training and test sets, respectively, resulting in a difference of 0.2184. By comparison, TaskPrompter [54] records RMSE values of 0.1152 and 0.3726 on the training and test sets, respectively, with a larger difference of 0.2574. This result demonstrates that our model achieves better generalization and is less prone to overfitting.

In summary, our model achieves superior performance in both classification and regression tasks, with higher accuracy, stronger regression metrics, and better generalization across datasets, confirming its effectiveness in MTL scenarios.

4.3.2. Comparative Experiments on the SCUT-FBP5500 Dataset

Table 2 summarizes the experimental results of all models on the SCUT-FBP5500 dataset. Compared with mainstream MTL approaches, the proposed model achieves consistently superior performance across all evaluation metrics. In the classification task, the model achieves an ACC of 78.36%, representing an improvement of 4.00 percentage points over DiffusionMTL [55], reaching 74.36%. The F1 score of the proposed model is 77.28%, which is 6.29 percentage points higher than the 70.99% reported by Swin. In terms of AP, the proposed model attains 80.94%, which is 7.75 percentage points higher than the 73.19% achieved by Swin [39]. For the regression task, the proposed model also demonstrates higher predictive accuracy and stability. Its PC coefficient reaches 0.9109, reflecting a 1.07% improvement over DiffusionMTL [55], which yields 0.9002. The RMSE is reduced to 0.2952, compared to 0.3026 reported by DiffusionMTL [55], representing a decrease of 0.73%. In addition, R² reaches 0.8083, showing a 0.91% improvement over the 0.7992 achieved by DiffusionMTL [55]. These results clearly demonstrate the effectiveness and robustness of the proposed model in multi-task FBP, with substantial improvements observed in both classification and regression tasks.

Figure 11 presents a comprehensive comparison of confusion matrices across different models in the classification task. The second-best-performing model, DiffusionMTL [55], shown in Figure 11e, exhibits critical limitations in minority class detection. This model completely fails to identify minority classes, recording 0% prediction ACC for classes “0”, “1”, and “4”. Although DiffusionMTL [55] achieves a marginally higher ACC of 91.6% on the dominant class “4” compared to the 84.2% ACC achieved by our model, its predictions demonstrate severe bias toward this single dominant category, indicating problematic overfitting behavior and a lack of adaptability to imbalanced class distributions. In contrast, as illustrated in Figure 11f, MoMamba exhibits superior performance in recognizing all categories, effectively addressing both dominant and minority classes. The model achieves prediction ACCs of 40.0%, 49.3%, 84.2%, 89.5%, and 4.3% for classes “0” through “4”, respectively. Notably, despite class “4” representing an extremely rare category, the model maintains meaningful recognition capability, demonstrating its inherent robustness and class awareness.

These results highlight the balanced performance of the proposed model in classification, combining high accuracy on dominant classes with significantly better recognition of minority classes. This balanced performance translates to enhanced generalization capabilities and more robust overall classification outcomes. The findings validate the effectiveness of the proposed multi-task decoding architecture in addressing class imbalance challenges, positioning it as a superior solution for real world applications where comprehensive class recognition is paramount.

Figure 12 shows the visualization of regression performance for all models. As shown in Figure 12a,b, Swin [39] and MTLoRA [52] suffer from significant deviations from the ideal line, especially at the lower and upper score ranges, reflecting weaker fitting capacity. Figure 12c–e, corresponding to InvPT [53], TaskPrompter [54], and DiffusionMTL [55], respectively, show moderate improvements in distribution compactness over Swin [39] and MTLoRA [52]. However, the overall spread remains considerable, limiting their generalization ability. Specifically, Swin [39] and MTLoRA [52] exhibit pronounced residual errors in the 1.0–2.0 range, while InvPT [53] and TaskPrompter [54] experience larger errors in the 2.0–3.0 range. Although DiffusionMTL [55] produces lower overall errors, its generalization performance remains inferior to that of the proposed model. Taking RMSE as an example, the proposed model yields RSME values of 0.2085 and 0.2953 on the training and test set, respectively, with a difference of 0.0868. In contrast, DiffusionMTL [55] produces RSME values of 0.1690 and 0.3026 on the training and test set, respectively, resulting in a larger difference of 0.1336. For MAE, the proposed model achieves MAEs 0.1658 and 0.2296 on the training and test set, respectively, with a difference of 0.0606, whereas DiffusionMTL [55] records values of 0.1319 and 0.2296 on the corresponding sets, with a difference of 0.0977. In terms of R², MoMamba reaches R² values of 0.9079 and 0.8174 on the training and test sets, respectively, with a difference of 0.0905, while DiffusionMTL [55] records values of 0.9395 and 0.8083 on the corresponding sets, resulting in a difference of 0.1312. Across these three key metrics, the proposed model exhibits substantially smaller performance gaps between the training and test sets, indicating stronger stability and generalization across different data partitions.

In contrast, as shown in Figure 12f, the proposed model produces a tighter clustering of samples around the ideal line in the main scatterplot, indicating superior fitting performance compared to the other methods. Moreover, the marginal histograms show highly consistent value distributions between the training and test sets, suggesting a minimal distributional shift and strong generalization capability. In the residual plot, prediction errors are primarily concentrated near zero, with the highest density of test samples at zero within the range of 2.0–4.0, indicating minimal error and the highest stability in this interval.

In summary, combining the quantitative results and visual analyses of both classification and regression tasks, the proposed model demonstrates clear overall advantages in MTL scenarios.

4.4. Comparative Experiments with State-of-the-Art Models

To further evaluate the modeling capability of the proposed multi-task framework on individual tasks, we conduct a comparative analysis based on the SCUT-FBP5500 dataset by isolating the classification and regression tasks, respectively. The performance is compared against representative single-task models, as presented in Table 3 and Table 4.

In the classification task, the self-supervised method [5] focuses on improving performance under weak supervision via label refinement and pretraining techniques, while CARD [8] combines a diffusion-based conditional generative model with a conditional mean estimator to improve conditional distribution prediction. DiffMIC [56] applies diffusion models to medical image classification with dual conditional guidance for robust feature learning. CD-Loop [57] leverages diffusion-based denoising with pre-trained priors and density clustering to achieve accurate chromatin loop prediction. In comparison, the proposed model achieves an accuracy of 78.36% in the classification task, representing a 2.59 percentage point improvement over DiffMIC [56] with an accuracy of 75.77%, outperforming other baseline methods. These results indicate that the proposed SAME module possesses a stronger capability in extracting discriminative features and enhancing task adaptability.

In the regression task, EN-CNN [9] proposed an ensemble of pre-trained CNNs to address the challenge of accurately predicting facial beauty, achieving results closer to human evaluation. AaNet [12] improves fine-grained modeling through multiscale fusion and a multibranch architecture, while E-BLS [58] enhances continuous score prediction with multi-scale features and attention mechanisms. However, all these methods are based on single-task optimization and cannot balance shared and task-specific representations. In contrast, the proposed model achieves a PC coefficient of 0.9109, surpassing the best-performing existing method E-BLS [58] that records a PC coefficient of 0.9104. This result demonstrates that the model excels at modeling continuous beauty scores and effectively leveraging shared task representations. MoMamba addresses this limitation in a multi-task setting, achieving a Pearson correlation of 0.9109. While EN-CNN reports a higher single-task regression score of 0.9350, MoMamba maintains competitive regression accuracy and simultaneously delivers strong classification performance, demonstrating its advantage in jointly modeling multiple facial beauty tasks.

In summary, the proposed model excels in discrete classification tasks and achieves competitive performance in continuous regression tasks under the multi-task framework. These results highlight the effectiveness of MoMamba in capturing complex features and enabling robust multi-task generalization.

4.5. Ablation Study

A series of ablation experiments were conducted based on both single-task and MTL baselines to evaluate the effectiveness of the proposed modules under an MTL framework. Additionally, inspired by the methodology in [59], a unified metric termed multi-task improvement rate, denoted as Δ_M, was designed to quantify the overall performance gains introduced by each module in multi-task scenarios. For classification tasks, ACC, F1 score, and AP are used to comprehensively assess model performance. For regression tasks, PC and R² are used to assess positive correlation, and the inverse values of MAE and RMSE are applied to unify the metric directionality, ensuring that higher values consistently reflect better performance. This design avoids the influence of scale differences on improvement rate calculations. Δ_M is computed by averaging the relative improvement rates across multiple evaluation metrics for both classification and regression tasks and then integrating the results to provide a unified description of overall performance changes across heterogeneous tasks. This metric is particularly useful in scenarios involving multiple heterogeneous tasks, as it effectively reflects the synergistic gains or potential performance degradation resulting from structural enhancements.

The average relative improvement rate

Δ_{cls}

for the classification task is defined as:

Δ_{cls} = \frac{1}{|C|} \sum_{m \in C} \frac{S_{m}^{MTL} - S_{m}^{STL}}{S_{m}^{STL}}

(34)

where

C = \{ACC, F 1, AP\}

denotes the set of evaluation metrics used for the classification task.

The average relative improvement rate

Δ_{reg}

for the regression task is defined as:

Δ_{reg} = \frac{1}{|R|} \sum_{m \in R} \{\begin{matrix} \frac{\frac{1}{S_{m}^{MTL}} - \frac{1}{S_{m}^{STL}}}{\frac{1}{S_{m}^{STL}}}, i f m \in \{M A E, R M S E\} \\ \frac{S_{m}^{MTL} - S_{m}^{STL}}{S_{m}^{STL}}, O t h e r s \end{matrix}

(35)

where

R = \{{PC, MAE, RMSE, R}^{2}\}

denotes the set of evaluation metrics used for the regression task.

The overall multi-task improvement rate Δ_M is

Δ_{M} = \frac{1}{2} (Δ_{cls} + Δ_{reg})

(36)

4.5.1. Ablation Study on the Number of Experts Based on the SCUT-FBP Dataset

The ablation results of each module on the SCUT-FBP dataset are presented in Table 5. First, compared to the single-task baseline, the multi-task baseline without any structural enhancements shows clear performance degradation in the regression task. For example, the PC coefficient decreases from 0.8104 in the single-task setting to 0.7716 in the multi-task setting, resulting in a drop of 0.0388. The corresponding multi-task improvement rate, denoted as Δ_M, reaches −4.79%, reflecting a typical case of task conflict, also known as negative transfer in MTL. After incorporating the DEM, the classification performance improves significantly. The ACC increases from 62.00% to 64.00%, yielding a gain of 2.00 percentage points. The F1 score increases from 58.85% to 62.20%, marking an improvement of 3.58 percentage points. In the regression task, the PC coefficient recovers to 0.7981, and the overall Δ_M reaches 0.46%, validating the effectiveness of complementary modeling in the frequency and temporal domains for mitigating task conflict. Subsequently, when the SAME module is integrated with the multi-task baseline, the model demonstrates enhanced cross-task feature sharing. The AP for classification increases from 59.64% to 69.81%, exhibiting a significant gain of 10.17 percentage points. In the regression task, the MAE decreases from 0.3801 to 0.3074, and Δ_M improves from 0% to 8.37%, indicating more efficient task cooperation. Finally, the combined integration of both the DEM and SAME module leads to comprehensive improvements. The model achieves an ACC of 68.00% in classification. For the regression task, its PC coefficient and R² reach 0.8694 and 0.7473, respectively. The overall Δ_M increases to 15.31%. These results highlight the complementary synergy between the DEM and SAME module, demonstrating their effectiveness in cross-task feature alignment and dynamic expert selection.

4.5.2. Ablation Study of Individual Modules on SCUT-FBP5500

The ablation results of each module based on the SCUT-FBP5500 are summarized in Table 6. Firstly, the multi-task baseline shows significant performance degradation compared to the single-task baseline. For example, in the regression task, the PC coefficient drops from 0.9023 under the single-task setting to 0.8734 under the multi-task setting, a decrease of 0.0289, leading to a Δ_M of −4.00%. This indicates a typical case of negative transfer in MTL. Secondly, after introducing the DEM module, the model performance improves over the multi-task baseline. The PC coefficient increases from 0.8734 to 0.8894, and the RMSE decreases from 0.3409 to 0.3180. In classification, the F1 score also rises from 74.23% to 75.93%. The Δ_M improves from −4.00% to −1.44%, marking a gain of 2.56 percentage points. Although performance is still slightly below the single-task baseline, this result demonstrates the initial effectiveness of DEM in mitigating intertask interference. Thirdly, when the SAME module is added independently, overall performance is further enhanced. In the classification task, the AP increases from 78.71% to 79.35%, an improvement of 0.64 percentage points. In the regression task, the mean absolute error decreases from 0.2589 to 0.2311, and the PC coefficient reaches 0.8989, nearly recovering to the single-task value of 0.9023. The Δ_M rises to 0.74%, achieving positive transfer for the first time and verifying the effectiveness of the dynamic expert routing mechanism in cross-task modeling. Finally, when both the DEM and SAME modules are incorporated, the model achieves optimal performance across both classification and regression tasks. In the classification task, the accuracy increases from 76.27% to 78.36%, the F1 score improves from 75.10% to 77.28%, and the AP rises from 79.04% to 80.94%. These represent gains of 2.09, 2.18, and 1.90 percentage points. In the regression task, the RMSE is reduced from 0.2989 to 0.2953, and the Δ_M reaches +2.11%.

In summary, the DEM and SAME module demonstrate strong complementarity in feature selection and dynamic modeling. Their joint contribution effectively enhances the overall performance of the MTL model and alleviates potential intertask conflicts.

4.5.3. Ablation Study on the Number of Experts Based on the SCUT-FBP5500 Dataset

To systematically investigate the impact of the number of low-rank experts and the LoRA rank within the SAME module, this study evaluates model performance under different expert configurations while maintaining a consistent architecture and training setup. The number of experts varied from 0 to 15, and the LoRA rank per expert is adjusted from 16 to 128 in increments of 8. The corresponding influence on the overall multi-task improvement rate, denoted as Δ_M, is summarized in Table 7.

When no expert mechanism is applied, that is, the number of experts is set to zero, the model exhibits the weakest performance. The Δ_M reaches −6.57%, indicating a significant decline in task modeling capacity and the emergence of negative transfer in MTL due to the absence of expert routing. However, model performance improves steadily as the number of experts increases. The highest performance is observed with nine experts. Beyond this point, performance begins to degrade. This degradation is primarily caused by the inability of the routing mechanism to effectively allocate an excessive number of experts, resulting in task-level feature confusion. At the optimal setting with nine experts, the classification metrics reach their peak values. The ACC reaches 78.36%, the F1 score improves to 77.28%, and the AP reaches 80.94%. In regression tasks, the PC coefficient rises to 0.9109, the RMSE drops to 0.2953, and R² reaches 0.8174. The Δ_M improves to +2.11%. These results confirm that a moderate number of experts can effectively reduce task interference and promote better cross-task representation learning. However, performance begins to decline again when the number of experts exceeds nine. In particular, the regression task suffers from increased MAE and RMSE, suggesting a loss of generalization caused by parameter redundancy and unstable routing. As illustrated in Figure 13 and Figure 14, the Δ_M initially increases with the number of experts but begins to decrease once the number surpasses the optimal value.

In summary, the number of experts significantly influences model performance. A moderate number of experts, specifically nine in this setting, achieves a favorable trade-off between accuracy and stability, leading to improved MTL performance.

4.5.4. Ablation of Top-K Expert Selection Strategy on the SCUT-FBP5500 Dataset

To investigate the impact of the Top-K selection mechanism in the dynamic MoE module on MTL performance, a series of experiments were conducted by fixing the number of experts at nine while varying the value of Top-K. The results are summarized in Table 8.

The results reveal a nonlinear relationship between the Top-K setting and overall model performance. When Top-K is set to one, each sample is routed to only a single expert. While this configuration offers high computational efficiency, it often leads to biased routing decisions and lacks feature collaboration and redundancy tolerance across experts. Consequently, it yields the poorest performance across all metrics, with the multi-task improvement rate Δ_M dropping to −4.85%, indicating a clear case of negative transfer. As Top-K increases, the number of accessible experts per sample grows, making the routing mechanism more flexible and enhancing both task-sharing and task-specific modeling and thereby leading to consistent performance improvements. The optimal performance is achieved when Top-K equals four. At this point, the routing mechanism maintains selective allocation while allowing sufficient feature fusion, resulting in the best performance across classification and regression tasks. However, further increasing Top-K beyond four causes performance degradation. On the one hand, excessive expert access introduces redundant information that interferes with specialization. On the other hand, it increases gradient conflicts during training, negatively affecting convergence stability. Although a higher Top-K increases the frequency of expert usage, it does not necessarily improve feature collaboration. Instead, it may harm multi-task performance due to routing redundancy. As illustrated in Figure 13 and Figure 14, Δ_M increases with Top-K up to 4 but begins to decline when Top-K exceeds this value. Therefore, this study adopts a Top-K setting of four for optimal performance.

In conclusion, the Top-K strategy plays a critical role in the effectiveness of the expert routing mechanism. A small value restricts collaboration, while a large value introduces redundancy. A moderate Top-K setting achieves the best balance between routing efficiency and expert diversity, resulting in superior MTL performance.

4.5.5. Impact of Frequency Components in DEM

To investigate the role of frequency information in the detail enhancement module (DEM), we conducted an ablation study by selectively enabling or disabling frequency extraction. Specifically, we considered four configurations: using raw image features without frequency decomposition (without frequency), retaining only the low-frequency components via DCT (only low-frequency), retaining only the high-frequency components (only high-frequency), and finally integrating both low- and high-frequency features (low- and high-frequency).

The results of the frequency ablation study are summarized in Table 9, which presents the performance under four different frequency configurations in DEM. From the table, we observe that using raw image features without frequency decomposition yields the lowest performance across most metrics, with an ACC of 74.91%, F1 of 73.22%, and PC of 0.8741. Introducing only low-frequency components improves the overall performance, particularly in capturing global facial structures, resulting in an ACC of 76.45% and PC of 0.8827. Retaining only high-frequency components also improves over the configuration without frequency, highlighting that fine-grained details such as edges and textures are beneficial for both classification and regression tasks. The best results are obtained when both low- and high-frequency features are integrated, achieving an ACC of 78.36%, F1 of 77.28%, PC of 0.9109, and R² of 0.8174, indicating that complementary information from different frequency bands significantly enhances model performance.

To further illustrate the impact of frequency features, we visualize the feature activations under each configuration in Figure 15. Figure 15a–d present the heatmaps corresponding to four configurations: without frequency, only low-frequency, only high-frequency, and low- and high-frequency. The heatmaps reveal that low-frequency features mainly emphasize global structures, such as overall face shape and skin regions, whereas high-frequency features highlight local details, including eyes, mouth, and wrinkles. The fused low- and high-frequency representation provides the most comprehensive coverage, capturing both structural and textural cues, which explains the superior quantitative performance observed in Table 9.

In summary, these results demonstrate that frequency information plays a crucial role in facial representation. Low-frequency components provide global structural context, high-frequency components preserve fine details, and their integration allows the DEM to capture both macro and micro facial cues, enhancing the robustness and accuracy of the proposed multi-task FBP model.

4.6. Discussion and Limitations

This section presents the discussion and analysis of the limitations. We further conducted subgroup experiments on the SCUTFBP-5500 dataset by dividing the samples into male and female subsets, followed by separate training and evaluation. The results are shown in Table 10.

The gender-based error analysis shows clear performance differences. For female faces, the model achieves an ACC of 76.01%, PC of 0.8936, and MAE of 0.2499, while for male faces the ACC rises to 80.69%, with a PC of 0.9271 and MAE of 0.2096. These results indicate that the model performs more reliably on male samples, achieving a higher predictive accuracy and lower error. Although the model achieves strong overall performance, it exhibits a gender-related performance gap. Furthermore, the dataset is limited in its demographic diversity, particularly with respect to ethnicity and age, which may hinder the generalizability of the findings to broader populations. Future work should address these limitations by incorporating more balanced datasets and fairness-aware training strategies to improve robustness across demographic groups.

5. Conclusions

This paper proposes MoMamba, a novel multi-task model that integrates the strengths of MoE and Mamba for FBP. The model design incorporates two key modules: the DEM and the SAME module. Experimental results demonstrate the superior performance of MoMamba in MTL scenarios. By enhancing frequency-domain and contextual information through the DEM and achieving fine-grained expert scheduling via the SAME module, the model achieves significant improvements in classification accuracy and regression stability. Compared with state-of-the-art methods, MoMamba outperforms existing approaches on multiple metrics, validating its practical value and research potential in FBP. Specifically, MoMamba achieves an ACC of 78.36%, PC of 0.9109, and MAE of 0.2296. These results validate the practical value and research potential of MoMamba in FBP. Future work will explore knowledge distillation and lightweight architecture design to enhance model applicability on mobile and edge devices.

Author Contributions

Funding acquisition, J.G.; Methodology, J.G., Z.Z. and H.C.; Visualization, J.G.; Writing—original draft, Z.Z.; Writing—review and editing, J.G., Z.Z., W.X., Z.C. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 61771347 (Project Title: Weakly Supervised Deep Learning for Facial Beauty Prediction in Big Data Environment).

Data Availability Statement

The datasets used in this study are publicly available and were created by South China University of Technology (SCUT). Specifically, SCUT-FBP dataset is accessible at http://www.hcii-lab.net/data/scut-fbp/cn/introduce.html (accessed on 17 September 2024). SCUT-FBP5500 dataset is available at https://github.com/HCIILAB/SCUT-FBP5500-Database-Release (accessed on 17 September 2024). Researchers can obtain these datasets from the provided links under the terms and conditions specified by their creators.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Eisenthal, Y.; Dror, G.; Ruppin, E. Facial attractiveness: Beauty and the machine. Neural Comput. 2006, 18, 119–142. [Google Scholar] [CrossRef]
Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Xia, H.; Yu, D.; Cheng, J.; Li, J. Outlier detection method based on high-density iteration. Inf. Sci. 2024, 662, 120286–120292. [Google Scholar] [CrossRef]
Sun, L.; Shi, W.; Tian, X.; Li, J.; Zhao, B.; Wang, S.; Tan, J. A plane stress measurement method for CFRP material based on array LCR waves. NDT E Int. 2025, 151, 103318–103324. [Google Scholar] [CrossRef]
Gan, J.; Zhou, Q.; He, G. A novel method to facial beauty prediction based on self-supervised learning. Signal Process. 2023, 39, 1500–1509. [Google Scholar]
Zhai, Y.; Huang, Y.; Xu, Y.; Gan, J.; Cao, H.; Deng, W.; Labati, R.D.; Piuri, V.; Scotti, F. Asian female facial beauty prediction using deep neural networks via transfer learning and multi-channel feature fusion. IEEE Access 2020, 8, 56892–56907. [Google Scholar] [CrossRef]
Saeed, J.N.; Abdulazeez, A.M.; Ibrahim, D.A. FIAC-Net: Facial image attractiveness classification based on light deep convolutional neural network. In Proceedings of the 2022 Second International Conference on Computer Science, Engineering and Applications (ICCSEA), Gunupur, India, 8 September 2022; pp. 1–6. [Google Scholar]
Han, X.; Zheng, H.; Zhou, M. Card: Classification and regression diffusion models. Adv. Neural. Inf. Process Syst. 2022, 35, 18100–18115. [Google Scholar]
Boukhari, D.E.; Chemsa, A.; Taleb-Ahmed, A.; Ajgou, R.; Bouzaher, M.T. Facial beauty prediction using an ensemble of deep convolutional neural networks. Eng. Proc. 2023, 56, 125. [Google Scholar] [CrossRef]
Arabo, W.; Abdulazeez, A.M. Facial Beauty Prediction Based on Deep Learning: A Review. Indones. J. Comput. Sci. 2024, 13, 7269–7286. [Google Scholar] [CrossRef]
Sun, Z.; Lin, L.; Yu, Y.; Jin, L. Learning feature alignment across attribute domains for improving facial beauty prediction. Expert Syst. Appl. 2024, 249, 123644–123663. [Google Scholar] [CrossRef]
Lin, L.; Liang, L.; Jin, L.; Chen, W. Attribute-Aware Convolutional Neural Networks for Facial Beauty Prediction. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019; pp. 847–853. [Google Scholar]
Xu, Y.; Li, X.; Yuan, H.; Yang, Y.; Zhang, L. Multi-task learning with multi-query transformer for dense prediction. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1228–1240. [Google Scholar] [CrossRef]
Shah, U.; Tukur, M.; Alzubaidi, M.; Pintore, G.; Gobbetti, E.; Househ, M.; Schneider, J.; Agus, M. MultiPanoWise: Holistic deep architecture for multi-task dense prediction from a single panoramic image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 17–18 June 2024; pp. 1311–1321. [Google Scholar]
Sirejiding, S.; Bayramli, B.; Lu, Y.; Yang, Y.; Alsarhan, T.; Lu, H.; Ding, Y. Task-Interaction-Free multi-task learning with efficient hierarchical feature representation. In Proceedings of the 32nd ACM International Conference on Multimedia, Lisbon, Portugal, 28 October–1 November 2024; pp. 6103–6112. [Google Scholar]
Guo, P.; Lee, C.Y.; Ulbricht, D. Learning to branch for multi-task learning. In Proceedings of the International conference on machine learning (ICML), PMLR, Vienna, Austria, 14–18 June 2009; pp. 3854–3863. [Google Scholar]
Li, Z.; Wang, X.; Fang, S.; Zhao, J.; Yang, S.; Li, W. A decoder-focused multitask network for semantic change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Lin, B.; Jiang, W.; Chen, P.; Zhang, Y.; Liu, S.; Chen, Y.C. MTMamba: Enhancing multi-task dense scene understanding by mamba-based decoders. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 314–330. [Google Scholar]
Chen, T.; Chen, X.; Du, X.; Rashwan, A.; Yang, F.; Chen, H.; Wang, Z.; Li, Y. AdaMV-MoE: Adaptive multi-task vision mixture-of-experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17346–17357. [Google Scholar]
Fan, Z.W.; Sarkar, R.; Jiang, Z.; Chen, T.; Zou, K.; Cheng, Y.; Hao, C.; Wang, Z. M³vit: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. Adv. Neural Inf. Process. Syst. 2022, 35, 28441–28457. [Google Scholar]
Ye, H.R.; Xu, D. Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 21828–21837. [Google Scholar]
Gu, A.; Goel, K.; Gupta, A.; Ré, C. On the parameterization and initialization of diagonal state space models. Adv. Neural Inf. Process. Syst. 2022, 35, 35971–35983. [Google Scholar]
Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Adv. Neural Inf. Process. Syst. 2021, 34, 572–585. [Google Scholar]
Peng, Z. Ptm-mamba: A ptm-aware protein language model with bidirectional gated mamba blocks. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), Birmingham, UK, 21–25 October 2024; pp. 5475–5478. [Google Scholar]
Wu, R.; Liu, Y.; Liang, P.; Chang, Q. H-vmunet: High-order vision mamba unet for medical image segmentation. Neurocomputing 2025, 23, 129447–129460. [Google Scholar] [CrossRef]
Zhou, R.; Wang, J.; Xia, G.; Xing, J.; Shen, H.; Shen, X. Cascade residual multiscale convolution and mamba-structured unet for advanced brain tumor image segmentation. Entropy 2024, 26, 385–397. [Google Scholar] [CrossRef]
Zhang, G.; Fan, L.; He, C.; Lei, Z.; Zhang, Z.X.; Zhang, L. Voxel mamba: Group-free state space models for point cloud based 3D object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 81489–81509. [Google Scholar]
Wang, C.; Huang, J.; Lv, M.; Du, H.; Wu, Y.; Qin, R. A local enhanced mamba network for hyperspectral image classification. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104092–104106. [Google Scholar] [CrossRef]
Sheng, J.; Zhou, J.; Wang, J.; Ye, P. Dualmamba: A lightweight spectral-spatial mamba-convolution network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 63, 1–15. [Google Scholar] [CrossRef]
Ahmad, M.; Butt, M.H.F.; Khan, A.M.; Mazzara, M.; Distefano, S. Spatial-spectral morphological mamba for hyperspectral image classification. Neurocomputing 2025, 636, 129995. [Google Scholar] [CrossRef]
Xie, D.; Liang, L.; Jin, L.; Xu, J.; Li, M. SCUT-FBP: A benchmark dataset for facial beauty perception. In Proceedings of the 2015 IEEE International Conference on Systems, Man, and Cybernetics, Hong Kong, China, 9–12 October 2015; pp. 1821–1826. [Google Scholar] [CrossRef]
Liang, L.; Lin, L.; Jin, L.; Xie, D.; Li, M. SCUT-FBP5500: A diverse benchmark dataset for multi-paradigm facial beauty prediction. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 1598–1603. [Google Scholar] [CrossRef]
Vahdati, E.; Suen, C.Y. Facial beauty prediction from facial parts using multi-task and multi-stream convolutional neural networks. Int. J. Pattern Recognit. Artif. Intell. 2021, 35, 2160002–2160013. [Google Scholar]
Gan, J.; Luo, H.; Xiong, J.; Xie, X.; Li, H.; Liu, J. Facial beauty prediction combined with multi-task learning of adaptive sharing policy and attentional feature fusion. Electronics 2023, 13, 179–190. [Google Scholar] [CrossRef]
Vahdati, E.; Suen, C.Y. Facial beauty prediction using transfer and multi-task learning techniques. In Proceedings of the International Conference on Pattern Recognition and Artificial Intelligence, Paris, France, 1–3 June 2020; pp. 441–452. [Google Scholar]
Gong, T.; Lee, T.; Stephenson, C.; Renduchintala, V.; Padhy, S.; Ndirango, A.; Keskin, G.; Elibol, O.H. A comparison of loss weighting strategies for multi task learning in deep neural networks. IEEE Access 2019, 7, 141627–141632. [Google Scholar]
Liu, S.; Johns, E.; Davison, A.J. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1871–1880. [Google Scholar]
Chen, Z.; Xie, L.; Niu, J.; Liu, X.; Wei, L. Visformer: The vision-friendly transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; 17 October 2021; pp. 589–598. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Woods, R.E.; Gonzalez, R.C. Digital Image Processing Third Edition; Electronic Industry Press: Beijing, China, 2021. [Google Scholar]
Riquelme, C.; Puigcerver, J.; Mustafa, B. Scaling vision with sparse mixture of experts. Adv. Neural Inf. Process. Syst. 2021, 34, 8583–8595. [Google Scholar]
Hwang, C.; Cui, W.; Xiong, Y.; Yang, Z.; Liu, Z.; Hu, H.; Wang, Z.; Salas, R.; Jose, J.; Ram, P.; et al. Tutel: Adaptive mixture-of-experts at scale. Proc. Mach. Learn. Syst. 2023, 5, 269–287. [Google Scholar]
Yun, S.; Lee, H.; Kim, J.; Shin, J. Patch-level representation learning for self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8354–8363. [Google Scholar]
Zhang, Y.; Cai, R.; Chen, T.; Zhang, G.; Zhang, H.; Chen, P.Y.; Chang, S.; Wang, Z.; Liu, S. Robust mixture-of-expert training for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 90–101. [Google Scholar]
Rangapuram, S.S.; Seeger, M.W.; Gasthaus, J.; Stella, L.; Wang, Y.; Januschowski, T. Deep state space models for time series forecasting. Adv. Neural Inf. Process. Syst. 2018, 31, 7785–7794. [Google Scholar]
Cho, P.; Dash, S.; Tsaris, A.; Yoon, H.J. Image transformers for classifying acute lymphoblastic leukemia. In Medical Imaging 2022: Computer-Aided Diagnosis; SPIE: San Diego, CA, USA, 2022; Volume 12033, pp. 647–653. [Google Scholar]
Yu, Z.; Zhao, C.; Wang, Z.; Qin, Y.; Su, Z.; Li, X.; Zhou, F.; Zhao, G. Searching central difference convolutional networks for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5295–5305. [Google Scholar]
Dao, T.; Gu, A. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Proceedings of the 41st International Conference on Machine Learning (ICML), Honolulu, HI, USA, 21–27 July 2024; pp. 10041–10071. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 May 2022; Volume 1, pp. 3–23. [Google Scholar]
Rizzolatti, G.; Craighero, L. Spatial attention: Mechanisms and theories. In Advances in Psychological Science; Psychology Press: Hove, UK, 2014; pp. 171–198. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Agiza, A.; Neseem, M.; Reda, S. MTLoRA: Low-rank adaptation approach for efficient multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 16–22 June 2024; pp. 16196–16205. [Google Scholar]
Ye, H.; Xu, D. Inverted pyramid multi-task transformer for dense scene understanding. In Proceedings of the European Conference on Computer Vision(ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 514–530. [Google Scholar]
Ye, H.; Xu, D. TaskPrompter: Spatial-channel multi-task prompting for dense scene understanding. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022; pp. 1–20. [Google Scholar]
Ye, H.; Xu, D. DiffusionMTL: Learning multi-task denoising diffusion model from partially annotated data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 16–22 June 2024; pp. 27960–27969. [Google Scholar]
Yang, Y.; Fu, H.; Aviles-Rivero, A.I.; Schönlieb, C.-B.; Zhu, L. Diffmic: Dual-guidance Diffusion Network for Medical Image Classification. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), Vancouver, BC, Canada, 8–12 October 2023; pp. 95–105. [Google Scholar] [CrossRef]
Shen, J.; Wang, Y.; Luo, J. Cd-loop: A chromatin loop detection method based on the diffusion model. Front. Genet. 2024, 15, 1393406. [Google Scholar] [CrossRef]
Gan, J.; Xie, X.; Zhai, Y.; He, G.; Mai, C.; Luo, H. Facial beauty prediction fusing transfer learning and broad learning system. Soft Comput. 2023, 27, 13391–13404. [Google Scholar]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Knowl. Data Eng. 2022, 34, 5586–5609. [Google Scholar]

Figure 1. Overall architecture of the proposed MoMamba.

Figure 2. Architecture of the DEM. Among them, Green represents low-frequency information, orange represents high-frequency information, and dark blue represents mask information.

Figure 3. Architecture of the SAME module.

Figure 4. Architecture of the HSSD.

Figure 5. Architecture of the cross-attention mechanism.

Figure 6. SCUT-FBP classification label distribution.

Figure 7. SCUT-FBP regression labels distribution.

Figure 8. SCUT-FBP5500: Sample images by ethnicity and gender with beauty scores.

Figure 9. Confusion matrices of different models for the classification task on the SCUT-FBP dataset.

Figure 10. Regression prediction and residual visualization of various models on the SCUT-FBP dataset.

Figure 11. Confusion matrices of different models for the classification task on the SCUT-FBP5500 dataset.

Figure 12. Regression prediction and residual visualization of various models on the SCUT-FBP5500 dataset.

Figure 13. Impact of varying the number of experts on MTL performance.

Figure 14. Impact of Top-K selection on MTL performance improvement.

Figure 15. Heatmaps of facial features under different frequency components.

Table 1. Comparative results on the SCUT-FBP dataset.

Model	ACC (%)	F1 (%)	AP (%)	PC	MAE	RMSE	R²
Swin [39]	58.00	54.41	53.61	0.6458	0.4490	0.5674	0.3878
MTLoRA [52]	57.00	53.17	55.21	0.6423	0.4175	0.5626	0.3980
InvPT [53]	66.00	59.14	61.90	0.8461	0.3002	0.3871	0.7150
TaskPrompter [54]	64.00	60.31	60.11	0.8657	0.2891	0.3650	0.7466
Diffusion-MTL [55]	64.00	59.02	62.40	0.8595	0.2930	0.3726	0.7359
MoMamba	68.00	65.27	69.88	0.8694	0.2912	0.3646	0.7473

Note: Bold indicates the best performance. Underline indicates the second best.

Table 2. Comparative results on the SCUT-FBP5500 dataset.

Model	ACC (%)	F1 (%)	AP (%)	PC	MAE	RMSE	R²
Swin [39]	72.64	70.99	73.19	0.8358	0.2939	0.3816	0.6951
MTLoRA [52]	73.10	70.10	72.36	0.8228	0.3044	0.3937	0.6756
InvPT [53]	73.82	66.84	60.32	0.8961	0.2345	0.3106	0.7981
TaskPrompter [54]	73.27	66.34	62.89	0.8860	0.2500	0.3260	0.7776
DiffusionMTL [55]	74.36	67.32	62.69	0.9002	0.2296	0.3026	0.8083
MoMamba	78.36	77.28	80.94	0.9109	0.2296	0.2953	0.8174

Note: Bold indicates the best performance. Underline indicates the second best.

Table 3. Comparative results of the classification task on the SCUT-FBP5500 dataset.

	Classification Task	ACC (%)
Model		ACC (%)
Self-supervised method [5]		72.16
CARD [8]		71.40
DiffMIC [56]		75.77
CD-Loop [57]		65.48
MoMamba		78.36

Note: Bold indicates the best performance. Underline indicates the second best.

Table 4. Comparative results of the regression task on the SCUT-FBP5500 dataset.

	Regression Task	PC
Model		PC
P-AaNet [12]		0.8965
EN-CNN [9]		0.9350
AaNet [12]		0.9056
E-BLS [58]		0.9104
MoMamba		0.9109

Note: Bold indicates the best performance. Underline indicates the second best.

Table 5. Ablation results based on the SCUT-FBP dataset.

Model Configuration		ACC (%)	F1 (%)	AP (%)	PC	MAE	RMSE	R²	ΔM (%)
Single-task	B	62.00	58.62	59.26	0.8104	0.3747	0.4521	0.6761	0.00
Multi-task	B	62.00	58.85	59.64	0.7716	0.3801	0.5032	0.5184	−4.79
	B + D	64.00	62.20	64.65	0.7981	0.3658	0.4780	0.5654	+0.46
	B + S	64.00	59.02	69.81	0.8360	0.3074	0.4058	0.6867	+8.37
	B + D + S	68.00	65.27	69.88	0.8694	0.2911	0.3645	0.7473	+15.31

Note: ACC, F1 score, and AP are expressed in percentage. DEM, SAME, and baseline are denoted as D, S, and B, respectively. Bold indicates the best performance.

Table 6. Ablation results based on the SCUT-FBP5500 dataset.

Model Configuration		ACC (%)	F1 (%)	AP (%)	PC	MAE	RMSE	R²	ΔM (%)
Single-task	B	76.27	75.10	79.04	0.9023	0.2375	0.2989	0.8130	0.00
Multi-task	B	76.72	74.23	78.71	0.8734	0.2589	0.3409	0.7567	−4.00
	B + D	77.18	75.93	77.98	0.8894	0.2431	0.3180	0.7884	−1.44
	B + S	77.91	76.34	79.35	0.8989	0.2311	0.3031	0.8076	+0.74
	B + D + S	78.36	77.28	80.94	0.9109	0.2296	0.2953	0.8174	+2.11

Note: ACC, F1 score, and AP are expressed in percentage. DEM, SAME, and baseline are denoted as D, S, and B, respectively. Bold indicates the best performance.

Table 7. Impact of expert number on model performance on the SCUT-FBP5500 dataset.

Expert Number	ACC (%)	F1 (%)	AP (%)	PC	MAE	RMSE	R²	ΔM (%)
0	76.27	74.66	76.94	0.8774	0.2757	0.3704	0.7127	−6.57
3	76.91	74.93	80.09	0.9073	0.2383	0.3119	0.7963	−0.43
6	77.54	75.17	79.79	0.9023	0.2274	0.3046	0.8056	+0.66
9	78.36	77.28	80.94	0.9109	0.2296	0.2953	0.8174	+2.11
12	77.36	77.34	79.34	0.9062	0.2378	0.3104	0.7983	+0.15
15	77.27	75.80	78.91	0.8939	0.2390	0.3177	0.7888	−0.96

Note: Bold indicates the best performance. Underline indicates the second best.

Table 8. Impact of Top-K on model performance on the SCUT-FBP5500 dataset.

Top-K	ACC (%)	F1 (%)	AP (%)	PC	MAE	RMSE	R²	ΔM (%)
1	76.00	74.28	74.90	0.8863	0.2589	0.3423	0.7547	−4.85
2	76.36	75.38	79.09	0.8914	0.2593	0.3362	0.7633	−3.26
3	77.63	77.38	79.80	0.8981	0.2568	0.3308	0.7708	−1.89
4	78.36	77.28	80.94	0.9109	0.2296	0.2953	0.8174	+2.11
5	77.18	74.97	79.03	0.8979	0.2593	0.3332	0.7675	−2.93
6	76.81	75.89	78.43	0.9026	0.2721	0.3515	0.7414	−4.39

Note: Bold indicates the best performance. Underline indicates the second best.

Table 9. Performance with different frequency configurations.

Configurations	ACC (%)	F1 (%)	AP (%)	PC	MAE	RMSE	R²
Without frequency	74.91	73.22	74.08	0.8741	0.2757	0.3629	0.7242
Only low-frequency	76.45	74.51	78.50	0.8827	0.2552	0.3414	0.7559
Only high-frequency	75.27	74.11	76.47	0.8825	0.2629	0.3511	0.7420
Low- and high-frequency	78.36	77.28	80.94	0.9109	0.2296	0.2953	0.8174

Note: Bold indicates the best performance.

Table 10. Performance comparison across gender groups.

Gender	ACC (%)	F1 (%)	AP (%)	PC	MAE	RMSE	R²
Female	76.01	74.99	77.25	0.8936	0.2499	0.3192	0.7881
Male	80.69	79.62	85.30	0.9271	0.2096	0.2697	0.8444

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gan, J.; Zhuang, Z.; Chen, H.; Xu, W.; Chen, Z.; Li, H. A Multi-Task Fusion Model Combining Mixture-of-Experts and Mamba for Facial Beauty Prediction. Symmetry 2025, 17, 1600. https://doi.org/10.3390/sym17101600

AMA Style

Gan J, Zhuang Z, Chen H, Xu W, Chen Z, Li H. A Multi-Task Fusion Model Combining Mixture-of-Experts and Mamba for Facial Beauty Prediction. Symmetry. 2025; 17(10):1600. https://doi.org/10.3390/sym17101600

Chicago/Turabian Style

Gan, Junying, Zhenxin Zhuang, Hantian Chen, Wenchao Xu, Zhen Chen, and Huicong Li. 2025. "A Multi-Task Fusion Model Combining Mixture-of-Experts and Mamba for Facial Beauty Prediction" Symmetry 17, no. 10: 1600. https://doi.org/10.3390/sym17101600

APA Style

Gan, J., Zhuang, Z., Chen, H., Xu, W., Chen, Z., & Li, H. (2025). A Multi-Task Fusion Model Combining Mixture-of-Experts and Mamba for Facial Beauty Prediction. Symmetry, 17(10), 1600. https://doi.org/10.3390/sym17101600

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Task Fusion Model Combining Mixture-of-Experts and Mamba for Facial Beauty Prediction

Abstract

1. Introduction

2. Related Work

2.1. FBP and MTL

2.2. MoE Models

2.3. Mamba Models

3. Methods

3.1. Overall Architecture

3.2. Detail Enhancement Module

3.3. SAME Module

3.3.1. State-Space Modeling Path

3.3.2. MoE Modeling Path

3.3.3. Cross-Attention Mechanism

4. Experimental Results and Analysis

4.1. Experimental Setup

4.2. Experimental Environment and Evaluation Metrics

4.3. Comparative Experiments

4.3.1. Comparative Experiments on the SCUT-FBP Dataset

4.3.2. Comparative Experiments on the SCUT-FBP5500 Dataset

4.4. Comparative Experiments with State-of-the-Art Models

4.5. Ablation Study

4.5.1. Ablation Study on the Number of Experts Based on the SCUT-FBP Dataset

4.5.2. Ablation Study of Individual Modules on SCUT-FBP5500

4.5.3. Ablation Study on the Number of Experts Based on the SCUT-FBP5500 Dataset

4.5.4. Ablation of Top-K Expert Selection Strategy on the SCUT-FBP5500 Dataset

4.5.5. Impact of Frequency Components in DEM

4.6. Discussion and Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI