Multimodal Image Translation Algorithm Based on Singular Squeeze-and-Excitation Network

Tu, Hangyao; Wang, Zheng; Zhao, Yanwei

doi:10.3390/math13010177

Open AccessArticle

Multimodal Image Translation Algorithm Based on Singular Squeeze-and-Excitation Network

by

Hangyao Tu

^1,2,

Zheng Wang

^2,* and

Yanwei Zhao

³

¹

School of Computer Science and Technology, Zhejiang University, Hangzhou 310015, China

²

School of Computer and Computational Science, Hangzhou City University, Hangzhou 310015, China

³

College of Engineering, Zhejiang University of Technology, Hangzhou 310015, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(1), 177; https://doi.org/10.3390/math13010177

Submission received: 21 November 2024 / Revised: 18 December 2024 / Accepted: 29 December 2024 / Published: 6 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Image-to-image translation methods have advanced from focusing on image-level info to incorporating pixel-level and instance-level details. However, with feature-level constraint, deviation occurs when the network overemphasizes convolutional features, neglecting traditional image feature extraction. To address this, we proposed the multimodal image translation algorithm MASSE based on a Singular Squeeze-and-Excitation Network, combining GANs and SENet. It utilizes SVD features to assist the SENet in managing the scaling degree. The SENet employs SVD to extract features and enhance the Excitation operation to obtain new channel attention weights and form attention feature maps. Then, image content features are refined by combining convolutional and attention feature maps, and style features are obtained by the style generator. Finally, content and style features are combined to generate new style images. Ablation experiments showed the optimal SVD parameter is 128, producing the best translation results. According to FID, MASSE outperforms current methods in generating diverse images.

Keywords:

image translation; generative model; singular value decomposition; multimodal images

MSC:

68T40; 68T07

1. Introduction

In computer vision, image translation encompasses a wide range of complex challenges [1,2,3]. The primary goal is to convert images from one domain to another. With the advancement of deep learning, the application of GANs to address image translation problems has gained recognition from numerous scholars and experts, which includes both paired image translation [4,5] and unpaired image translation [6,7] techniques. The introduction of paired image translation, exemplified by methods like pix2pix, has resolved several practical issues in image generation [8,9,10]. However, these methods are not without their limitations. Paired image translation methods, such as pix2pix, require substantial amounts of data. On the other hand, images generated by unpaired translation methods tend to lack diversity, failing to produce images in various forms. There are specific tasks, such as virtual clothing try-on [11] and facial expression [12] image translation, that demand the network to output data with multiple modalities. To address these issues, Huang [13] proposed the multimodal image translation (MUNIT) framework. The key innovation of this network lies in its introduction of content and style generators, which integrate the characteristics of content and style features from various domains to tackle the problem of single output in existing models. This approach successfully enables multimodal image translation, providing diverse and varied outputs.

Specifically, MUNIT also faces challenges. It requires the network to extract content and style features and create new modal images by combining different content and style features. However, the distinction between content and style characteristics is not always clear, as the feature constraints depend on the discriminator’s ability to differentiate images within the same domain. This reliance results in less obvious distinctions between content and style features. Therefore, the primary focus for improving MUNIT lies in developing methods to effectively constrain the generator to produce the desired content and style features.

In the process of extracting content and style features, the attention mechanism and edge information are more effective at constraining the generation of content features. By contrast, extracting style features has relatively high requirements for network color structure feature extraction, as it requires the network to identify the corresponding style and form category information. Therefore, we focus on enhancing content information to distinguish it from style information for network training. The Squeeze-and-Excitation Network (SENet) [14] is a sophisticated channel attention mechanism which establishes the attention map by modeling the relationships between channels, thereby facilitating the extraction of content features. The SENet aims to model the interdependence between feature channels. Specifically, the network automatically learns the importance of each channel, promoting useful features and suppressing less important ones based on their significance. However, the SENet lacks external auxiliary information to guide the identification of key features, which can lead to slow convergence and reduced convergence accuracy.

There are also drawbacks to deep learning algorithms. Deep learning algorithms use convolution, pooling, and feature fusion methods to extract key features from images to solve various problems, but the efficiency of extracting features is lower than that of feature algorithms. This leads to two problems: (1) The network has learned the features obtained by traditional algorithms, but it consumes a good deal of time and computing resources; (2) The network has not learned the features obtained by traditional algorithms, reducing the precision of reconstructing an aspect of the image. Therefore, considering the above problems, it is a necessary work to combine deep learning with traditional methods.

Singular value decomposition (SVD) [15] is a crucial method for extracting the primary features of images, as singular value features contain substantial image information. Therefore, when the SENet searches for important channel features, incorporating singular value features can enhance the attention mechanism’s ability to recognize the global features of the image. This improvement boosts both the convergence speed and the accuracy of the attention mechanism. Consequently, combining SVD with the SENet is an effective approach to addressing the network’s key feature extraction challenges.

The SENet extracts attention features in the channel, represented by a vector that assigns weights to each channel. These weights are then multiplied by the corresponding feature map channels to amplify or reduce their effects. However, the degree of amplification and reduction depends on the optimization of the loss function, which may cause the network to oscillate and fail to converge. Most pixel areas in an image are common, meaning many pixels may have nearly the same value. In the image matrix, theoretically, one pixel can approximate nearby pixels. SVD can be seen as obtaining a weighted solution and averaging the sum after compression. Therefore, using SVD to neutralize the amplification and reduction degree of the SENet is a feasible approach. Content features are sensitive to differences in pixel values and can only be determined if there is a large difference between the pixel values of the boundary of the object and the pixel values of the surrounding pixels. However, style features are not highly sensitive to pixel values, so increasing the scale of the SENet is an effective way to separate MUNIT content and style features.

In summary, to enable the network to learn more comprehensive content and style features, we combined the SENet with channel singular values to achieve a more complete extraction of content features. This approach integrates traditional features with convolutional features. The main contributions of our work are summarized as follows:

Propose a novel multimodal image translation algorithm: We propose the multimodal image translation algorithm based on the Singular Squeeze-and-Excitation Network (MASSE). MASSE facilitates the conversion between images by utilizing the content and style of the generated feature blocks.
Develop an enhanced attention mechanism: We developed an attention mechanism combined with singular value channels, named the SSEnet. This mechanism integrates channel weights with singular value features to enhance image content features.
Introduce a new feature cascading method: We introduce a new method called feature layer insertion (FLI), which efficiently combines traditional features with convolutional features.
Demonstrated empirical effectiveness: We empirically demonstrate the effectiveness of our method in image translation and illustration. Qualitative and quantitative comparisons with state-of-the-art models validate the superior performance of MASSE.

2. Related Work

2.1. Generative Adversarial Networks

The generative adversarial network (GAN) is a model within the realm of deep learning, recognized as one of the most promising methods for handling complex distributions in unsupervised learning in recent years. The model achieves excellent outputs through the adversarial learning process of its two components: the generator network (G) and the discriminator network (D). During training, the objective of G is to produce images that are as realistic as possible in order to deceive the D. Conversely, the goal of D is to distinguish the images generated by G from real images. In summary, G and D engage in a dynamic and adversarial interplay, driving the learning process.

\min_{G} \max_{D} V (D, G) = E_{x ~ p_{d a t a}} [\log D (x)] + E_{z ~ p_{z} (z)} [\log (1 - D (G (z)))]

(1)

2.2. Image-to-Image Translation

To the best of our knowledge, traditional approaches to image translation were first introduced by Rosales et al. [16], who employed Bayes’ theorem to translate output images. With the development of deep learning, Gatys et al. [17] pioneered the use of convolutional neural networks for image translation. Taigman et al. [18] further advanced the field with the Domain Transfer Network (DTN), enabling the conversion between different domains. However, these methods have not achieved high-quality image translation.

Using GANs for image translation tasks has emerged as a mainstream approach in recent years. Due to differences in data type, GAN-based methods are categorized into supervised and unsupervised translations. The supervised approach, exemplified by the well-known pix2pix GAN [19], relies on conditional GANs (CGANs) to map input images to corresponding output images in paired datasets. However, this type of algorithm often results in limited diversity in generated outputs.

To address this limitation, Zhu et al. introduced BicycleGAN [20], which combines CVAE-GAN [21] and CLR-GAN [22] to enforce bijective consistency between the output and feature layer codes. This enhancement enables the network to achieve diverse output modalities.

However, the aforementioned algorithms rely on paired datasets for completing translation tasks. In practical scenarios, acquiring large quantities of paired datasets can be challenging, necessitating the use of unsupervised algorithms for translation tasks. Zhu et al. proposed CycleGAN [23] to address image translation tasks without paired datasets. CycleGAN leverages cycle consistency losses to achieve translation between unpaired data from domain A to domain B (and vice versa).

Similarly, DiscoGAN [24] and DualGAN [25] have enhanced loss functions to facilitate image translation with unpaired datasets. Currently, achieving unpaired image translation using CycleGAN has become mainstream [26,27,28].

Recently, FUNIT [29] identified a significant limitation in current algorithms, noting that extensive feature data from multi-class target domains could greatly restrict their practical application. Therefore, FUNIT is designed as a few-shot, unsupervised image-to-image translation algorithm suitable for translating images of unseen object classes. COCO-FUNIT [30] introduced a conditional style encoder to mitigate irrelevant appearance information in translated images. This encoder adjusts style code changes based on input content images, effectively minimizing the impact of irrelevant data during network translation with limited sample data. Li [31] considered the scarcity of face sketches and the abundance of face photos, proposed a few-shot face sketch-to-photo synthesis model based on asymmetric image translation, where the sketch-to-photo process uses a feature-embedded generating network, while the photo-to-sketch process uses a style transfer network.

More recently, The development of diffusion models provided a variety of approaches for image translation. the Brownian Bridge Diffusion Model (BBDM) [32] models image-to-image translation as a random Brownian bridge process and directly learns the translation bidirectional diffusion process between the two domains instead of the conditional generation process.

Among the mentioned methods, few image translation algorithms integrate traditional features. To our knowledge, only UNTF [33,34,35] has incorporated features extracted via SVD in unpaired image translation. However, this method solely connects these features in a cascading manner, lacking flexibility in feature fusion.

3. Multimodal Image Translation Algorithm Based on Singular Squeeze-and-Excitation Network (MASSE)

3.1. Channel Attention Mechanism

For convolutional neural networks (CNNs), the primary method of feature extraction involves updating the old feature maps to new convolutional maps using convolution kernels—a process known as local region feature fusion. This fusion integrates spatial dimensions (H for height and W for width.) and channel dimensions (C for channels) of the image. However, indiscriminately combining H, W, and C does not necessarily enhance network training accuracy. To address this challenge, spatial attention mechanisms and channel attention mechanisms have been developed. In this context, singular value decomposition (SVD) has been applied to images with various channels, providing a useful approach to combine the channel attention mechanism with singular value features.

3.2. The Introduction of MASSE Model

To address the MUNIT challenge, we incorporated singular value features (SVD features) of images into the squeeze and excitation operations. This approach, termed the Singular Value Squeeze-and-Excitation (SSEnet), aims to extract attention maps that assist the network in generating content features with clearly defined boundaries.

Based on the MUNIT framework enhanced by the SENet, we proposed the multimodal image translation algorithm based on the Singular Squeeze-and-Excitation Network. This model leverages the SSEnet to achieve image conversion across diverse datasets. The model structure is illustrated in Figure 1. We begin by outlining the network structure, followed by an explanation of the algorithm’s implementation, and conclude with detailed training procedures.

3.2.1. The Structure of MASSE

In discussing the multimodal image translation task, Figure 1 illustrates the overall structure of MASSE, highlighting the decomposition of domain A and domain B images into content features (A and B) and style features (A and B). Content features depict the image contours and poses, while style features capture distinctive individual characteristics that differentiate similar images. Initially, each image undergoes three network processes: (1) The style network generated style features, (2) The content network generated content features, and (3) Singular Squeeze-and-Excitation Network (SSE) obtained channel feature weights.

To enhance the clarity of content features, the features from the N-th layer of the SSEnet are concatenated with the original image, a parameter discussed in Section 4.3. Content B is combined with Style A to create the new image BA, while Content A is combined with Style B to produce the new image AB. The objective of the network is to ensure that the newly generated images BA and AB correspond as closely as possible to domain A and domain B, respectively. Discriminator 1 and Discriminator 2 respectively ascertain whether the images belong to the generated set, thereby constraining images BA and AB to resemble their corresponding domain images. The original image receives the attention mechanism map through the SSEnet. Since the SENet only pays attention to image feature-level data and lacks pixel-level constraints, the addition of SVD features helps the network obtain more refined feature data and enables the network to approximate pixel-level constraints. Additionally, by increasing the number of decomposed style networks, the variety of styles can be expanded, enabling the completion of multimodal image translation. This approach involves inputting the original image into multiple style networks, allowing for the fusion of different style and content features to generate images with diverse stylistic characteristics. Thus, this method facilitates the achievement of multimodal image translation tasks.

3.2.2. The Singular Squeeze-and-Excitation Network (SSEnet) Structure

The SSEnet is divided into two modules: the SE module and the SVD module. The structure is shown in Figure 2. The SE module consists of F_tr(.), F_sq(.) and F_fc(.), while the SVD module consists of F_c(.). The feature map of the final SE module and the feature map output by the SVD module are fused through F_gap(.) to form new weight information. F_tr(.) in the SE module is a transformation structure that uses convolution operations. We used two layers of convolution to expand the channel to N layers. N-value sensitivity will be demonstrated in ablation experiments. F_sq(.) is the compression operation, similar to the SE model, which encoded spatial features into a global feature, Since average pooling could contain as much spatial information as possible, it is feasible to use average pooling to obtain global variables. F_fc(.) needs to capture the weights between channels. If the Gating mechanism is used consistently with the SE model, the weights obtained by the SENet will overlap or separate from the SVD feature selection weights again, which results in reduced model stability. Therefore, in order to improve the stability of the model, the F_fc(.) operation is completed using the fully connected layer. The formula for F_fc(.) is as follows:

S = F_{f c} (z, W) = σ (W, z)

(2)

where σ is the sigmoid activation function.

The SVD module contains F_c(.). The function of F_c(.) is to optimize the SVD features to form weight information that conforms to the size of S. The order of the SVD eigenvalues is arranged according to the eigenvalues, the more important features will be in the front, and the noise features will be in the back. So, how to extract the required features is the key to F_c(.) operation. The F_c(.) formula is as follows:

f e a t = F_{c} (S V D, n) = l \arg e (Re L U (W_{1}, S V D), n)

(3)

SVD refers to the extracted SVD feature, large(x, n) represents the extraction of the first n values in the x array, n is a hyperparameter, and ReLU is the activation function. F_gap(.) fused S and feat feature vectors to form new weight information. The formula of F_gap(.) is as follows:

e = F_{g a p} (s, f e a t, α, β) = σ (α {(s \otimes f e a t)}^{β})

(4)

where

α

and

β

are the desensitization parameters, σ represents the sigmoid function, and

\otimes

represents the result of multiplying the two. Finally, the obtained weight is multiplied by the original image to obtain the attention map.

Above all, F_c(.) transfers the mean information of SVD to the main attention network, helping the channel attention to better adjust the weight. F_gap(.) combines the information in F_c(.) with F_fc(.) through non-linear combination effectively, helping F_scale(.) to extract the attention feature map more effectively.

3.2.3. The Feature Layer Insertion (FLI) Generator and Discriminator of MASSE

Regarding the generator, we incorporated the FLI method to enhance its capability to connect corresponding network features, thereby increasing sensitivity during network computations.

In traditional methods, features in the SSEnet are typically multiplied with convolutional features or cascaded into the network for operation. However, we believe such operations may reduce the network’s sensitivity to different channels. Therefore, we propose the FLI method to optimize network performance by aligning feature layers within the same channel. FLI arranges two feature maps sampled to the same depth in sequence along the channel direction and concatenates the features of the same channel to form feature fusion. This approach aims to mitigate low-quality results caused by discrepancies in channel fusion. Specifically, the convolution block and the SSEnet block embed their respective feature layers into each other, establishing cohesive connections between different features.

Subsequently, style features and content features are extracted from the image by the style encoder and content encoder, respectively. These features are concatenated within the network to generate fused features, which pass through six residual and deconvolution layers to return to the original image size. The output merges style features from one image with content features from another, completing the image translation task as depicted in Figure 3.

In the discriminator, PatchGAN [10] is employed not to assess the entire image at once but to evaluate each patch generated by the generator within the input image. The discriminator is structured as a five-layer network, where each layer undergoes training using Convolution, Batch Normalization, and Leaky ReLU with a leaky rate set to 0.2. Ultimately, the features are processed through a sigmoid activation function to determine the authenticity of each patch.

Table 1 and Table 2 describe the detailed parameters of the network.

The network pseudo code is as follows (see Algorithm 1):

Algorithm 1: MASSE

Step 1
Initialize
Get(images)
Svd_featureA = svd(imagesA)
Svd_featureB = svd(imagesB)
Step 2
For i to iteration
SSEnet_featureA = SSEnet(imageA, Svd_featureA)
    ContentA = FLI(encode_content(imageA), SSEnet_featureA)
    StyleB = encode_style(imageB)
    SSEnet_featureB = SSEnet(imageB, Svd_featureB)
    ContentB = FLI(encode_content(imageB), SSEnet_featureB))
    StyleA = encode_style(imageA)
    fakeA = generator(ContentB,StyleA)
fakeB = generator(ContentA,StyleB)
Discriminator(realA, fakeA)
Discriminator(realB, fakeB)

3.3. Global Loss Function

In translation problems, the key to solving them lies in selecting the appropriate loss function. This study enhances the loss function of the proposed network by integrating adversarial loss with reconstruction loss:

Adversarial loss. The adversarial loss is a key loss for generative adversarial networks. The adversarial training of the generator and discriminator enables the update of network parameters.

L_{G A N}^{A} = E_{c_{2} ~ p_{1} (c_{b}), s_{1} ~ p_{2} (s_{a})} [\log (1 - D_{1} (G_{1} (c_{b}, s_{a})))] + E_{a ~ p_{1} (a)} [\log D_{1} (A)]

(5)

L_{G A N}^{B} = E_{c_{1} ~ p_{1} (c_{a}), s_{2} ~ p_{2} (s_{b})} [\log (1 - D_{2} (G_{2} (c_{a}, s_{b})))] + E_{b ~ p_{1} (b)} [\log D_{2} (B)]

(6)

where A and B are datasets, a and b are data in dataset A and dataset B, D₁ is a discriminator for distinguishing real and fake images in dataset A, and D₂ is a discriminator for distinguishing real and fake images in dataset B.

Image reconstruction loss. In the images sampled from a given data distribution, we are able to reconstruct it after encoding and decoding.

L_{r e c o n}^{A} = E_{a ~ p_{1} (a)} [{‖G_{1} (E_{c} (a), E_{s} (a)) - a‖}_{1}]

(7)

L_{r e c o n}^{A} = E_{b ~ p_{1} (b)} [{‖G_{1} (E_{c} (b), E_{s} (b)) - b‖}_{1}]

(8)

SVD reconstruction loss. In the SVD feature of the given images sample, we could reconstruct the SVD feature, so the SVD reconstruction loss was found between the reconstructed SVD feature and the original image SVD feature.

L_{r e c o n}^{s v d} = E_{c_{1} ~ p_{1} (c_{a}), s_{2} ~ p_{2} (s_{b})} ‖S V D (G_{1} (E_{c} (a), E_{s} (b))) - S V D [b]‖

(9)

L_{r e c o n}^{s v d} = E_{c_{1} ~ p_{1} (c_{b}), s_{2} ~ p_{2} (s_{c})} ‖S V D (G_{1} (E_{c} (b), E_{s} (a))) - S V D [a]‖

(10)

Total loss. The generator and discriminator are trained to optimize the final objective, and the final total loss is shown as follows:

\min_{E_{c}, E_{s}, G_{1}, G_{2}} \max_{D_{1}, D_{2}} L (E_{c}, E_{s}, G_{1}, G_{2}, D_{1}, D_{2}) = L_{G A N}^{A} + L_{G A N}^{B} + λ_{1} (L_{r e c o n}^{A} + L_{r e c o n}^{B}) + λ_{2} (L_{r e c o n}^{s v d a} + L_{r e c o n}^{s v d b})

(11)

where λ₁ and λ₂ are the parameters of different reconstruction losses.

4. Training and Experimental Results

4.1. Parameter Details

For network training, we set the following parameters: Tiankuo is the server model, E5-2600 is selected as the CPU, and RTX 2080ti is selected as the GPU. Use tensorflow framework. The learning rate is 1 × 10⁻⁴, the λ₁ sets to 10, the λ₂ sets to 1, and the N sets to 128, where λ₁ is the weight of image reconstruction loss and λ₂ is the loss of image SVD feature. The SVD feature length is the rank of the original graph matrix and N is the first N features of the SVD feature length, where N is 128.

4.2. Dataset

We cropped all images to 256 × 256 × 3 size. Four datasets of Summer2Winter, Apple2Orange, Cezanne2Photo, and Vangogh2Photo were used for comparative experiments.

Summer2Winter: contains 1231 summer images and 962 winter images, and the test set contains 309 summer images and 238 winter images, which is a relatively common unpaired image translation dataset.

Apple2Orange: contains 995 apple images and 1019 orange images, and the test set contains 266 apple images and 248 orange images. It is a data set with obvious contour and color feature distinction

Cezanne2Photo: contains 6287 photo images and 525 Cezanne images, and the test set contains 751 photo images and 58 Cezanne images.

Image evaluation index: FID. Compared with IS, Fréchet inception distance (FID) extracted the feature extraction layer of the inception network as a feature extraction function. The feature extraction function is used to calculate the distribution mean and variance between the real distribution P_data and the generated distribution P_z. The method calculates the distance between the true distribution and the generated distribution by obtaining information about the mean and variance.

F I D (x, z) = {‖μ_{x} - μ_{z}‖}_{2}^{2} + T r (\underset{x}{Σ} + \underset{z}{Σ} - 2 {(\underset{x}{Σ} \underset{z}{Σ})}^{\frac{1}{2}})

(12)

where μ_x and μ_z represent the mean of the real distribution and the generated distribution,

\underset{x}{Σ}

and

\underset{z}{Σ}

represent the variance of the real distribution and the generated distribution, Tr represents the synthesis of elements on the diagonal of the matrix which, in matrix theory, becomes the trace of the matrix. x and z represent the real images and the generated images.

To sum up, the smaller the FID value, the closer the generated distribution is to the real distribution, and the closer the generated image is to the real image.

Image evaluation index: EGF. The energy of an image can be thought of in terms of its pixel intensity values, where higher variations in intensity between neighboring pixels might indicate more detailed or textured regions. It could be understood or hypothesized as a method to evaluate image clarity or quality by examining the gradients of energy or intensity across an image. It is defined as follows:

D (f) = \frac{\sum_{y} \sum_{x} ({|f (x + 1, y) - f (x, y)|}^{2} + {|f (x, y + 1) - f (x, y)|}^{2})}{256 \times 256}

(13)

where

f (x, y)

is a point in the image,

f (x, y + 1)

is a point below

f (x, y)

f (x + 1, y)

is a point to the right of the image.

4.3. Ablation Experiment

In Section 3.2, we discussed the presence of both image information features and noise features in the SVD feature. The singular value feature is represented as a diagonal matrix sorted by the magnitude of singular values. Hence, we default to retaining more information in higher-weighted positions while discarding less critical information, thereby minimally altering the distribution pattern relative to the original matrix. The parameter N denotes the length of selected features. Therefore, selecting an appropriate N ensures that crucial high-weight information remains intact, a critical consideration for model performance enhancement. To explore this, we evaluated gradients with N values of 16, 32, 64, 128, and 256. The comparative results are illustrated in Figure 4:

The experimental task involves image conversion across different seasons. We observed that, when N = 256, the algorithm successfully translated images effectively, yet encountered instances of feature imbalance during translation, where the feature span within the same scene varied significantly.

As depicted in Figure 4, the second image at N = 256 accurately portrays a winter lake scene, whereas images 3 to 6 primarily enhance background colors, falling short of the winter scene’s requirements. This phenomenon may stem from the inclusion of noise features in SVD, which enhance the translation network’s diversity. Consequently, N = 256 exhibits a broad array of stylistic variations. By contrast, results at N = 64, N = 32, and N = 16 highlight structural image characteristics with distinct boundaries and complete structures. The SSEnet effectively provides robust content features aiding the network in achieving accurate translations. However, these three groups display varied types of translations, introducing colors and styles not originally present, as shown sequentially in Figure 4. This discrepancy arises when the network lacks essential SVD image features, resulting in feature disorder where various color features are added to different content.

In addition, we found that when the value of N is too large or too small, different style images will be randomly generated, but when the value of N is too large, the results of image translation are orderly; that is, an organized and regular style image is generated. when the value of N is too small, the result is an unordered and messy style image. Therefore, we believe that the network is more sensitive to the information in SVD when translating style features. The value N = 128 is considered to be a relatively good image translation in this section. Ensuring orderly image features and style translation could determine the domain of the translated image. Table 3 shows the FID and EGF index of the algorithm in this chapter. From the FID index, it is considered that the distribution of images generated by MASSE is better than other parameters. Therefore, this section considers that N = 128 is the optimal parameter for the network.

Additionally, we observed that when N is either too large or too small, the network generates random style images. Specifically, when N is excessively large, the image translation results tend to be orderly, producing organized and consistent style images. Conversely, when N is too small, the output becomes disordered and chaotic in style.

Therefore, we posit that the network’s sensitivity to information in SVD plays a crucial role in translating style features. The value N = 128 emerges as particularly effective for image translation in this study. It strikes a balance, ensuring orderly image features and stylistic translation across different domains. Table 3 presents the FID index of the algorithm in this chapter, indicating that the distribution of images generated by MASSE at N = 128 surpasses other parameters. Hence, this section concludes that N = 128 is the optimal parameter for the network.

4.4. Analysis of Results

Style transfer

The validity of the algorithm has been verified on the Summer2winter and Apple2orange datasets. As shown in Figure 5, the UNIT [33] algorithm in the Summer2winter dataset only translated structural features but failed to translate winter information to generated images. Although CycleGAN learned some of the characteristics of winter, it failed to fully express the characteristics of winter on the required objects. GEN [34] translated semantic features well but was less effective for translation style features. While the UNTF algorithm [35] could translate winter images well, the features of winter are not as rich as MASSE. In the Apple2orange dataset, since the features of apples and oranges are easier to obtain, the translation effect of UNIT is better than that of UNTF and CycleGAN, but slightly inferior to MASSE. The reason for the above is that CycleGAN does not have separate network extraction content and style features, while MASSE, UNTF, and UNIT all have specific network extraction content or style features. However, the extracted features of UNIT are relatively unstable and it is prone to phenomena such as unbalanced network feature extraction, which created differences in translation on different datasets. UNTF extracted the global features on SVD, and MASSE used SVD combined with the attention mechanism to make the network learn SVD and convolution features regularly. Therefore, it is slightly inferior to MASSE in terms of feature recognition. In addition, the indicators of FID and EGF are shown in Table 4, and MASSE is better than other comparison algorithms.

Image illustration translation

The validity of the algorithm has been verified on the Cezanne2Photo dataset. As shown in Figure 6, some images in UNTF have good painting translations, such as the stones in the second row, the lighthouse in the fourth row, etc., which vividly show the vague artistic conception in the painting, but there are also some imbalances. For example, the railroad tracks in the first picture are not clear and, in the fifth and sixth pictures, there are some strange image blocks in the image, which may be because the noise features in SVD are slightly larger than the detail features. GANILLA [36] has a good boundary for translating image content, but the illustration property is not well displayed, and it only deepens the characteristics of existing images. The results obtained by UNITG [37] have certain advantages, and its coloring conforms to the artist’s consideration of color and has certain practical significance. MASSE has improved the problems of the above two. MASSE shows the rails more clearly and clearly, with no strange feature blocks. This is due to the SSEnet module, which strengthens the separation of content and features of the network so that the final generated image has good content and semantic features. Similarly, in Table 5, MASSE is better than other algorithms. UNITG is not better than other algorithms. GANILLA’s diversity index is better than UNTF, but its clarity index is slightly lower than UNTF.

Multimodal Image Translation

In multimodal image translation, there is no image-specific fixed domain. Therefore, the network is required to separate multiple modality images, such as virtual dressing. Users only needed a photo of their own life and used AI to transform their photo of the selected clothes. Since the network input does not define the specific characteristics of different styles, this will have high requirements on the network. This section will use the mutual comparison between MASSE and MUNIT to verify the effectiveness of the network. The results are shown in Figure 7. From the figure, we found that both algorithms have completed multi-modal image translation well, but in the image translated by MUNIT, the image boundary between different modalities is less obvious. For example, for the fourth image in the swan column, the background of the swan is only similar to the strong and weak changes of the filter, and the background image is not converted into different modes. In MASSE, the swan background has an essential difference, unlike the change of the filter. Further, in the seventh and eighth columns, the three modal diagrams of MUNIT are almost identical, and there is no obvious modal boundary, but it is different in MASSE, which translates the image well into three modalities. Therefore, MASSE has a certain improvement in multimodal translation. The three types of modal FID and EGF indicators are shown in Table 6.

5. Conclusions

In this study, MASSE is proposed by combining deep learning methods and traditional mechanism methods. Specifically, the SENet used SVD to extract features to improve the excitation operation, which helps the network to obtain new channel attention weights and form the attention feature maps. Then, the attention feature maps and convolutional features are integrated to complete the image content features. Finally, the content features and style features are combined to obtain a new style image and complete the translation. Through ablation experiments, when the SVD parameter is 128, the image translated by the network is optimal. What is more, according to FID and EGF indicators, the MASSE algorithm performs better than existing methods in image diversity and clarity. The proposed network is verified by comparing different algorithms, showing that the MASSE algorithm has good translation performance in multi-modal image translation.

Author Contributions

Methodology, H.T.; Writing—review & editing, Z.W.; Supervision, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (No. 51875524), the key Research and Development Program of Zhejiang Province (No. 2023C003189), and Research Incubation Foundation of Hangzhou City University (No. J202316).

Data Availability Statement

The original contributions presented in this study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work; there is no professional or other personal interest of any nature or kind in any product, service, and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

References

Zhang, Y.; Hu, B.; Huang, Y.; Gao, C.; Yin, J.; Wang, Q. HQ-I2IT: Redesign the optimization scheme to improve image quality in CycleGAN-based image translation systems. IET Image Process. 2024, 18, 507–522. [Google Scholar] [CrossRef]
Tu, H.Y.; Wang, Z.; Zhao, Y.W. Unpaired Image-to-Image Translation with Diffusion Adversarial Network. Mathematics 2024, 12, 3178. [Google Scholar] [CrossRef]
Hu, X.; Zhou, X.; Huang, Q.; Shi, Z.; Sun, L.; Li, Q. Qs-attn: Query-selected attention for contrastive learning in i2i translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18291–18300. [Google Scholar]
Wang, C.; Xu, C.; Wang, C.; Tao, D. Perceptual adversarial networks for image-to-image transformation. IEEE Trans. Image Process. 2018, 27, 4066–4079. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Zheng, H.; Yu, Z.; Zheng, Z.; Gu, Z.; Zheng, B. Discriminative region proposal adversarial networks for high-quality image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 770–785. [Google Scholar]
Dou, H.; Chen, C.; Hu, X.; Peng, S. Asymmetric CycleGan for unpaired NIR-to-RGB face image translation. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 1757–1761. [Google Scholar]
Fu, X. Digital Image Art Style Transfer Algorithm Based on CycleGAN. Comput. Intell. Neurosci. 2022, 2022, 6075398. [Google Scholar] [CrossRef] [PubMed]
Dong, Y.; Tan, W.; Tao, D.; Zheng, L.; Li, X. CartoonLossGAN: Learning surface and coloring of images for cartoonization. IEEE Trans. Image Process. 2021, 31, 485–498. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Gool, L.V.; Timofte, R. Deep unfolding network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Xie, Z.; Huang, Z.; Zhao, F.; Dong, H.; Kampffmeyer, M.; Liang, X. Towards scalable unpaired virtual try-on via patch-routed spatially-adaptive gan. Adv. Neural Inf. Process. Syst. 2021, 34, 2598–2610. [Google Scholar]
Fang, H.; Deng, W.; Zhong, Y.; Hu, J. Triple-GAN: Progressive face aging with triple translation loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 804–805. [Google Scholar]
Huang, X.; Liu, M.Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 172–189. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Andrews, H.; Patterson, C.L.I.I.I. Singular value decomposition (SVD) image coding. IEEE Trans. Commun. 1976, 24, 425–432. [Google Scholar] [CrossRef]
Resales; Achan; Frey. Unsupervised image translation. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; pp. 472–478. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
Taigman, Y.; Polyak, A.; Wolf, L. Unsupervised cross-domain image generation. arXiv 2016, arXiv:1611.02200. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Bao, J.; Chen, D.; Wen, F.; Li, H.; Hua, G. CVAE-GAN: Fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2745–2754. [Google Scholar]
Yang, J.; Kannan, A.; Batra, D.; Parikh, D. Lr-gan: Layered recursive generative adversarial networks for image generation. arXiv Preprint 2017, arXiv:1703.01560. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2849–2857. [Google Scholar]
Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1857–1865. [Google Scholar]
Yu, C.; Hu, D.; Zheng, S.; Jiang, W.; Li, M.; Zhao, Z.Q. An improved steganography without embedding based on attention GAN. Peer-to-Peer Netw. Appl. 2021, 14, 1446–1457. [Google Scholar] [CrossRef]
Tang, H.; Liu, H.; Xu, D.; Torr, P.H.; Sebe, N. Attentiongan: Unpaired image-to-image translation using attention-guided generative adversarial networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 1972–1987. [Google Scholar] [CrossRef] [PubMed]
Wu, S.; Dong, C.; Qiao, Y. Blind image restoration based on cycle-consistent network. IEEE Trans. Multimed. 2022, 25, 1111–1124. [Google Scholar] [CrossRef]
Liu, M.Y.; Huang, X.; Mallya, A.; Karras, T.; Aila, T.; Lehtinen, J.; Kautz, J. Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10551–10560. [Google Scholar]
Saito, K.; Saenko, K.; Liu, M.Y. Coco-funit: Few-shot unsupervised image translation with a content conditioned style encoder. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer International Publishing: Cham, Switzerland, 2020; pp. 382–398. [Google Scholar]
Li, Y.; Liang, Q.; Han, Z.; Mai, W.; Wang, Z. Few-shot face sketch-to-photo synthesis via global-local asymmetric image-to-image translation. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–24. [Google Scholar] [CrossRef]
Li, B.; Xue, K.; Liu, B.; Lai, Y.K. Bbdm: Image-to-image translation with brownian bridge diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1952–1961. [Google Scholar]
Alami Mejjati, Y.; Richardt, C.; Tompkin, J.; Cosker, D.; Kim, K.I. Unsupervised attention-guided image-to-image translation. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
Shamsolmoali, P.; Zareapoor, M.; Das, S.; Garcia, S.; Granger, E.; Yang, J. GEN: Generative equivariant networks for diverse image-to-image translation. IEEE Trans. Cybern. 2022, 53, 874–886. [Google Scholar] [CrossRef] [PubMed]
Tu, H.; Wang, W.; Chen, J.; Wu, F.; Li, G. Unpaired image-to-image translation with improved two-dimensional feature. Multimed. Tools Appl. 2022, 81, 43851–43872. [Google Scholar] [CrossRef]
Hicsonmez, S.; Samet, N.; Akbas, E.; Duygulu, P. GANILLA: Generative adversarial networks for image to illustration translation. Image Vis. Comput. 2020, 95, 103886. [Google Scholar] [CrossRef]
Yang, S.; Jiang, L.; Liu, Z.; Loy, C.C. Unsupervised image-to-image translation with generative prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18332–18341. [Google Scholar]

Figure 1. MASSE model structure, The algorithm starts with the original style images of Domain A and Domain B on the left. The original style image is passed through the style encoder and content encoder to obtain the style feature map and content feature map. In order to enhance the feature information in the content feature map, the FLI fusion method is used to provide an effective way to enhance the content feature map with the SSEnet mechanism. Finally, different style feature maps and content feature maps are fused to obtain the required translated style image. In addition, Discriminator1 and Discriminator2 are used to identify whether the generated style image belongs to the required domain image.

Figure 2. Structure of the SSE module, The SSE module consists of two parts, called the SE module and the SVD module. The SE module extracts channel information, while the SVD module assists in deepening important information.

Figure 3. The generator structure of MASSE, which comprises two main parts: the encoder and the decoder. The encoder is further divided into a style encoder and a content encoder. The upper segment of the content encoder consists of a convolutional encoder, while the lower segment incorporates the singular value attention block extracted by the SSEnet. These components are cascaded at the same level to form a new content feature map.

Figure 4. Comparison of results with different N values.

Figure 5. Comparison of MASSE and different algorithms in style transfer.

Figure 6. Comparison of MASSE and different algorithms in image illustration translation.

Figure 7. Multimodal translation comparison of different algorithms.

Table 1. Style and Content Generator parameters.

Type	Kernel Size	Stride	Output
Conv Block	7	1	64
Conv Block	3	2	128
Conv Block	3	2	128
Fusion Block	-	-	256
Conv Block	3	1	256
Conv Block	3	1	256
Conv Block	3	1	256
Conv Block	3	1	256
Conv Block	3	1	256
Conv Block	3	1	256
Deconv Block	3	2	128
Deconv Block	3	2	64
Deconv	7	1	3

Table 2. Discriminator parameters.

Type	Kernel Size	Stride	Output
Conv Block	4	2	64
Conv Block	4	2	128
Conv Block	4	2	256
Conv Block	4	1	512
Conv	4	1	1

Table 3. FID values of results with different N values.

	Cezanne2photo
	256	128	64	32	16
FID	71.59	57.36	62.19	74.64	77.15
EGF	1007.10	1197.49	1024.77	972.50	1001.83

Table 4. FID values of MASSE and different algorithms in style transfer.

	Sunmmer2winter
FID	Ours	UNTF	GEN	CycleGAN	UNIT
FID	64.36	67.44	69.43	75.71	97.96
EGF	Ours	UNTF	GEN	CycleGAN	UNIT
EGF	956.24	880.21	841.26	784.82	741.33
	Apple2orange
FID	Ours	UNTF	GEN	CycleGAN	UNIT
FID	72.49	85.76	86.52	107.44	107.16
EGF	Ours	UNTF	GEN	CycleGAN	UNIT
EGF	891.58	826.22	779.08	727.01	784.71

Table 5. FID and EGF values of different algorithms in Cezanne2photo.

	Cezanne2photo
FID	Ours	GANILLA	UNTF	UNITG
FID	53.99	64.27	68.48	70.43
EGF	Ours	GANILLA	UNTF	UNITG
EGF	1073.01	941.46	967.20	926.93

Table 6. FID and EGF values for multimodal translation of different algorithms.

Cezanne2photo
FID	algorithm	Ours	MUNIT	EGF	algorithm	Ours	MUNIT
	Category 1	71.12	94.22		Category 1	974.51	810.07
	Category 2	79.66	84.84		Category 2	1069.27	904.12
	Category 3	76.46	104.69		Category 3	997.66	878.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tu, H.; Wang, Z.; Zhao, Y. Multimodal Image Translation Algorithm Based on Singular Squeeze-and-Excitation Network. Mathematics 2025, 13, 177. https://doi.org/10.3390/math13010177

AMA Style

Tu H, Wang Z, Zhao Y. Multimodal Image Translation Algorithm Based on Singular Squeeze-and-Excitation Network. Mathematics. 2025; 13(1):177. https://doi.org/10.3390/math13010177

Chicago/Turabian Style

Tu, Hangyao, Zheng Wang, and Yanwei Zhao. 2025. "Multimodal Image Translation Algorithm Based on Singular Squeeze-and-Excitation Network" Mathematics 13, no. 1: 177. https://doi.org/10.3390/math13010177

APA Style

Tu, H., Wang, Z., & Zhao, Y. (2025). Multimodal Image Translation Algorithm Based on Singular Squeeze-and-Excitation Network. Mathematics, 13(1), 177. https://doi.org/10.3390/math13010177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Image Translation Algorithm Based on Singular Squeeze-and-Excitation Network

Abstract

1. Introduction

2. Related Work

2.1. Generative Adversarial Networks

2.2. Image-to-Image Translation

3. Multimodal Image Translation Algorithm Based on Singular Squeeze-and-Excitation Network (MASSE)

3.1. Channel Attention Mechanism

3.2. The Introduction of MASSE Model

3.2.1. The Structure of MASSE

3.2.2. The Singular Squeeze-and-Excitation Network (SSEnet) Structure

3.2.3. The Feature Layer Insertion (FLI) Generator and Discriminator of MASSE

3.3. Global Loss Function

4. Training and Experimental Results

4.1. Parameter Details

4.2. Dataset

4.3. Ablation Experiment

4.4. Analysis of Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI