1. Introduction
Medical imaging technology provides high-resolution images of internal structure and helps doctors make diagnoses and treatment plans. Medical imaging techniques such as computed tomography (CT) and magnetic resonance imaging (MRI) have become standard tools for clinical diagnosis [
1]. By using contrast agents to enhance the visibility of diseased tissue and comparing pre- and post-contrast medical images, we can more clearly show the differences between diseased tissue and healthy tissue. Contrast-enhanced CT (CECT) improves the visualization of blood vessels and tissues, and T1-weighted contrast-enhanced magnetic resonance imaging (T1CE) enhances soft tissue contrast and is widely used for tumor detection. It has been reported that adverse reactions and side effects caused by contrast agents potentially harm the health of patients [
2]. Iodine-based contrast agents can cause severe allergic reactions, whereas gadolinium-based contrast agents increase the risk of nephrogenic systemic fibrosis (NSF). The risk of adverse reactions of ionic contrast agents can be up to 0.12 [
3], and even the European Medicines Agency recommends limiting some intravenous linear agents to prevent the deposition of contrast agents in human tissues [
4], which may cause unknown health problems. Therefore, synthesizing post-contrast medical images without injecting contrast agents into the body is a valuable technique for practical diagnosis and treatment [
5].
GAN-based medical image synthesis provides the ability to bypass the contrast agent administration process to obtain post-contrast medical images. Thus, several generative models have been proposed to synthesize high-quality and perceptually realistic post-contrast images from pre-contrast images. However, existing methods still face problems, including (1) insufficient attention paid to local contrast enhancement regions and (2) missing frequency information. Specifically, contrast-enhanced regions are often concentrated in specific anatomical structures or lesion areas. As shown in
Figure 1, contrast enhancement regions in post-contrast images obtained via previous methods [
6,
7,
8,
9,
10,
11,
12] are often incomplete or ignored. These regions also tend to fit easy-to-synthesize frequencies, so important frequency information is easily ignored. Gaps between real and post-contrast medical images in the frequency domain cause important frequency information to be lost, and image textures and details in the spatial domain to be blurred or even distorted. In addition, current methods enhance images by reducing pixel-level differences but struggle to capture contrast enhancement regions in biological tissues.
Inspired by the above problems, we propose an I nteractive Frequency Generative Adversarial Network (IFGAN) for pre-to-post-contrast medical image synthesis. First, we propose an enhanced interaction module (EIM) to force the model to focus on the contrast enhancement region. Next, we propose focal frequency loss (FFL) to ensure the consistency of real and post-contrast images in the frequency domain and to prevent the loss of important frequency information. In addition, we design feature interactions to achieve fine-grained control of local lesions and eliminate irrelevant details to promote synthesis. The main contributions of this paper can be summarized as follows:
We propose a novel pre-to-post-contrast medical image synthesis method that preserves frequency information and anatomical structure to avoid the risk of adverse reactions and side effects caused by contrast agents.
We propose an enhanced interaction module to focus on the contrast enhancement region, where the features of the target and reconstruction branches interact, to control contrast enhancement feature synthesis and maintain the anatomical structure.
We introduce focal frequency loss to narrow the gap between the real and post-contrast images in the frequency domain and to prevent the loss of frequency information, further maintaining clinically relevant features and texture structure.
Experiments show that our method achieves satisfactory post-contrast synthesis and substantial performance improvement compared with recent state-of-the-art (SOTA) methods.
2. Related Work
Existing deep medical image synthesis methods include CNN-, UNet-, GAN-, transformer-, and diffusion-based methods [
13]. Considering that GANs have been widely used in image synthesis [
14], data augmentation [
15], and cross-modal medical image synthesis [
16], we focus mainly on GAN-based methods. Essentially, any nonlinear GAN with a paired source and target image can be used to achieve pre-to-post-contrast medical image synthesis. To implement medical image synthesis, we generally divide our approach into cross-modal generation, high-quality reconstruction, and contrast enhancement. Related issues can be summarized as follows.
The cross-modal generation method regards the desired image as the target image and builds a GAN framework to output a target image from the source image, where a post-contrast medical image can be generated with the trained GAN by treating pre- and post-contrast medical images as source and target images, respectively. Along these lines, pGAN [
17] employs a cycle-consistency strategy for multi-contrast MRI synthesis. BPGAN [
7] proposes an end-to-end bidirectional prediction method, facilitating flexible cross-modal synthesis between CT and MRI images. Similarly, Bi-MGAN [
10] integrates deep and handcrafted features to constrain feature generation and achieves multi-modal MRI image generation. DC-cycleGAN [
11] regards source samples as negative samples and applies dual-contrast loss to map learned samples away from source images. The authors of [
18] fused radiological features from CT to MRI and identified lesion areas by selecting anchor boxes with the greatest differences in radiomic features across various scales in CT images. FACGAN [
19] generates CT images from MRI images by incorporating residual-frequency channel attention using a frequency cycle strategy to extract more comprehensive tissue structure information. MGDGAN [
20] employs a mask estimation network to guide the generation of different tissues in CT images, resulting in more accurate brain lesion synthesis. The authors of [
21] utilized a cycle-consistent structure to eliminate the need for paired data with perceptual loss, further highlighting high-frequency texture details. SC-GAN [
22] presents truncation loss using a segmentation model to address missing anatomical structures in truncated regions during synthetic CT (sCT) generation. GAN-based cross-modal synthesis methods usually consider direct mapping from the source domain to the target domain [
23], which requires supervised learning of prior knowledge [
24] to ensure the consistency of cross-modal translation.
The high-quality reconstruction method regards the desired image as a high-quality image and builds a GAN framework to output a high-quality image with a low-quality image. In this way, post-contrast medical images can be reconstructed with the trained GAN by treating pre-contrast and post-contrast medical images as low- and high-quality images, respectively. To this end, AR-GAN [
25] introduces a two-stage learning model to determine and dynamically adjust correction parameters for each pixel, generating high-quality SPET images from LPET images. Similarly, the authors of [
26] applied a two-stage GAN to map low-quality ultrasound images to their high-quality counterparts. Ea-GAN [
27] enhances edge information perception using the Sobel detection operator. Vessel-GAN [
28] utilizes expert knowledge to design filters based on the structure of blood vessels, allowing this GAN framework to generate more credible coronary CT angiography (CTA) images from myocardial CT perfusion (CTP) data. Ref. [
29] introduced a multiscale generator architecture combined with a channel-mask attention module, which significantly improves the quality of synthesized contrast-enhanced CT (CECT) images. RG-GAN [
30] designs a specific data augmentation module using low-cost, non-real, labeled data to improve lesion preservation in PET images.
The contrast enhancement method regards the desired image as a post-contrast image and outputs a post-contrast image with a pre-contrast image. Along these lines, DCE-MRI [
12] integrates perceptual and pixel-level features to transform non-contrast breast MRIs into corresponding contrast-enhanced sequences. BICEPS [
31] uses feature decoupling to improve the alignment of pre- and post-contrast MRI sequences. Considering image misalignment in contrast enhancement, RegGAN [
9] adaptively fits the noise distributions of unpaired images using a registration network. The authors of [
32] utilized self-supervised learning and dual-energy CT (DECT) to achieve high-quality contrast enhancement with registered non-contrast CT (NCCT) and contrast-enhanced CT (CECT) image pairs embedded. The authors of [
33] integrated a deformation field learning network with a 3D generator to reduce misalignment and realize the joint synthesis and deformation registration of abdominal CECT images and generate CECT images from NCCT images. SGCDD-GAN [
34] emphasizes key areas by adopting multi-task learning and employing a dual-decoder generator, which ensures that the generator focuses more on lesion areas during NCCT-to-CECT enhancement.
Although some progress has been made in the above methods, there are still several limitations. First, almost all the mentioned GAN-based synthesis methods lack attention to features of local contrast enhancement regions, which leads to blurred details and even missing key regions in the generated post-contrast images. Moreover, most of the current methods do not consider frequency information in the frequency domain. To address these issues, we propose IFGAN for pre-to-post-contrast medical image synthesis. Specifically, we first propose the EIM to focus on the local contrast enhancement region. Then, we introduce focal frequency loss to narrow the gap between the post-contrast and real images in the frequency domain and to prevent the loss of important frequency information to maintain texture and edge details. The subsequent analysis is provided to demonstrate that the optimization can be differential and that focal frequency loss is effective in improving synthesis.
3. The Proposed Method
In this section, we first define notations and the goal of this research. Then, we present the deep architecture of the proposed IFGAN in
Figure 2, in which one generator and one discriminator are involved. The former encodes pre-contrast images into feature representations and fuses different task information to be mapped to different images, i.e., post-contrast images and reconstructed images. The latter determines whether the post-contrast and real images belong to the corresponding domain. Then, we provide the loss function and training algorithm.
3.1. Problem Definition and Notations
In this section, we introduce necessary notation and the problem definition. Given the pre-contrast image (), real image (), and target label (), the goal is to train the generator to synthesize a post-contrast image () with the pre-contrast image () and target label (t) on the basis of less distortion between the pre-contrast image () and the reconstructed image ().
During training, the encoder (
) in the generator (
G) first encodes the spatial information feature (
z) of the pre-contrast image (
), denoted as
, then inputs it into the dual-branch decoder (
and
), which is used to decode the target post-contrast image (
) and reconstructed image (
) according to the target label (
t) and reconstruction label (
s), i.e.,
and
, respectively. Then, the discriminator (
D) is used to determine whether the post-contrast image (
) and real image (
) belong to target domain. Finally, the generator (
G) and discriminator (
D) are jointly trained, and the optimization can be defined as follows:
where
represents the post-contrast image (
). Note that the goal is to obtain a high-quality post-contrast image, so we mainly train the branch decoder (
) to be optimal. Through the EIM and weight sharing, the branch decoder (
) can interact with
to improve the attention on the local contrast enhancement region and keep the anatomical structure unchanged.
3.2. Deep Architecture of IFGAN
3.2.1. Generator
The generator, which contains one encoder and a dual-branch decoder, is designed to synthesize post-contrast images from pre-contrast images. The encoder encodes the pre-contrast image () into a low-resolution feature map (z) with dimensions of 256 × 64 × 64, and the dual-branch decoder decodes z to pixel space according to task labels (t and s) and obtains a post-contrast image () and reconstructed image ().
The encoder includes one convolution layer and four residual blocks. The first two residual blocks are embedded with a max pooling layer for downsampling, and the last two residual blocks preserve the shape of the feature tensor. The decoder consists of one deconvolution layer and four residual blocks. The first two residual blocks interact with feature information through the EIM, and the last two residual blocks are upsampled by nearest-neighbor interpolation to avoid a possible checkerboard effect in transposed convolution and, finally, input into the deconvolution layer to synthesize a post-contrast medical image. The residual block promotes information flow in the deep network and identity mapping mechanisms, which have been proven to be effective in alleviating gradient vanishing. Adaptive instance normalization (AdaIN) [
35] is also adopted in the decoder.
3.2.2. Discriminator
The discriminator distinguishes whether an input image is a post-contrast image or a real image, and it consists of three convolutional layers and six residual blocks. The residual blocks are downsampled bya max pooling layer to obtain an intmerediate tensor of 512 × 4 × 4. Then, the convolutional layers with convolution kernel sizes of 4 × 4 and 1 × 1 are used to output a vector with dimensions of 2 × 1 × 1, indicating the probability that the input image belongs to the real image class. Spectral normalization is introduced into each residual block to constrain discriminative ability to avoid unstable training and sub-optimal performance.
3.2.3. Enhanced Interaction Module
Considering that label guidance and paired supervised learning cannot ensure detailed local features, i.e., contrast enhancement region, in post-contrast images, we define the enhanced interaction module (EIM) to focus on local contrast-enhanced features, where the interaction of the target and reconstruction branch features controls the local contrast enhancement region, with the label serving as guidance to tell model the mapping direction but without providing more information to affect image synthesis. As shown in
Figure 2, the EIM calculates the difference between post-contrast features (
) and reconstructed features (
) and maintains post-contrast features through the target label (
t). It can be represented as follows:
where
highlights differences that need to be noted, and the target label is fused through AdaIN. At the same time, a convolution layer with convolution kernel size of 3 × 3 is used to continue to map to obtain local features, which guide focus on local features, that is,
where ⊗ represents a concatenation operation. A concatenated feature (
) is mapped to the output feature (
) through a convolution layer of with a kernel size of 3 × 3. The EIM enhances feature expression through timely information interaction and enables the generator to control local contrast-enhanced feature synthesis, as verified by subsequent experiments.
3.3. Loss Function
To obtain realistic visually perceived post-contrast medical images, we introduce adversarial loss, pixel-wise mean absolute error, focal frequency loss, and reconstruction loss. The first item is used to drive the generated post-contrast medical images into the manifold of the real images. The second two items reduce the gap between the post-contrast and real images in the spatial and frequency domains, respectively. Additionally, reconstruction loss guides the generation of reconstructed images to assist in maintaining the anatomical structure of post-contrast images.
3.3.1. Adversarial Loss
Adversarial loss supports the generator (
G) and the discriminator (
D) in the form of adversarial training and ultimately enables the generator to generate realistic post-contrast images from pre-contrast images. During training, the encoder extracts the high-dimensional latent features (
z) from pre-contrast images (
), i.e.,
. Then, synthesized post-contrast medical images (
) with target features can be obtained with the guidance of label
t, i.e.,
. The adversarial loss is defined as
where
represents the real post-contrast image,
represents the post-contrast synthesis branch decoder, and
t represents the target label that guides post-contrast medical synthesis. During training, the discriminator distinguishes whether post-contrast and real images belong to the same domain, which indirectly enhances the fitting and generation ability.
3.3.2. Pixel-Wise Mean Absolute Error
The pixel-wise mean absolute error contributes to reducing content differences between post-contrast images (
) and real images (
). Calculating and reducing the pixel-level Manhattan distance between them allows IFGAN to learn the corresponding content relationship and capture and retain clinical information, including the location and shape of lesions, as well as normal anatomical structures. It is defined as follows:
where
W and
H denote the width and height of the post-contrast image, respectively.
and
represent the pixel values of the real and target post-contrast image at coordinates of
, respectively.
3.3.3. Focal Frequency Loss
The focal frequency loss preserves frequency information in the target post-contrast image (
), which avoids the loss of high-frequency information and frequency-spectrum region shifting. The high-frequency components generally correspond to fast-changing regions. By minimizing the frequency gap in the frequency domain, the synthesis of edges and details can be better controlled, which improves visual perception and preserves texture structure. Focal frequency loss is defined as follows:
with
where
represents the pixel value at spatial-domain coordinates of
.
e and
are the Euler number and imaginary unit in the Euler formula (
), respectively.
and
are frequency representations of the real (
) and post-contrast (
) image, respectively, and
represents the spatial-frequency weight at coordinates of
.
3.3.4. Reconstruction Loss
The reconstruction loss is designed to maintain the anatomical structure of the pre-contrast image (
). The reconstruction branch decoder (
) provides timely interactive information with the EIM module to focus on local contrast-enhanced features, which shares feature weights of anatomical structure features with the post-contrast branch decoder (
), which embeds anatomical structure features from the pre-contrast image into the post-contrast image. Reconstruction loss can be defined as follows:
where
denotes the reconstructed image and
s denotes the reconstruction label.
3.3.5. Objective Function and Algorithm
By merging all abovementioned losses together, i.e., adversarial, pixel-wise mean absolute error, focal frequency, and reconstruction loss, adversarial optimization can be summarized as
where
,
, and
are weight-balance hyper-parameters. We use alternated learning and a gradient descent strategy to update the parameters of IFGAN until convergence. The training process of IFGAN is summarized as Algorithm 1.
Algorithm 1 The learning algorithm of IFGAN |
Require: Input image set X; real image set Y; target label t; reconstruction label s; mini-batch sampler . Ensure: Generator G with parameters ; discriminator D with parameters ; Initialization batch size K; iteration count T; learning rate . - 1:
for iter < T do - 2:
Randomly sample a batch of input images and their paired real images; - 3:
Generate the target image and the reconstructed image of the input image; - 4:
Spectral normalization of discriminator parameters ; - 5:
Compute gradients of parameters based on the following formula: - 6:
Update gradients of parameters ; - 7:
Compute gradients of parameters based on the following formula: - 8:
Update generator parameters ; - 9:
end for
|
4. Frequency-Domain Optimization Analysis
As summarized in
Section 2, almost all medical image synthesis methods based on deep learning mainly focus on optimizing the spatial domain, which can fail to capture differences in the frequency domain and lead to blurred regions and a poor structure. As shown in
Figure 3a,b, gaps between the real image and the generated image in the frequency domain lead to the loss of important frequency information, and the texture details of the image in the spatial domain are blurred or even distorted. To address this, we define and embed focal frequency loss to improve synthesis quality and optimize differences in the frequency domain between post-contrast and real images. A detailed analysis of frequency-domain optimization is provided below.
Due to spectral bias, deep models tend to learn low-frequency components and eschew frequency components, which can make it difficult to synthesize frequencies and details [
36]. To this end, we designed focal frequency loss to reduce the weight of easy frequencies using a dynamic-spectrum weight matrix, which prompts the model to focus on hard frequencies that are difficult to generate. First, the real image (
) and post-contrast image (
) are converted to the frequency domain according to the discrete Fourier transform (DFT), which is defined as
where
denotes the width and height of images, respectively.
and
represent frequency-domain representations of the real image (
) and target post-contrast image (
), respectively.
is an imaginary unit. The real and the target post-contrast image can be expressed by complex numbers as follows:
where
and
represent real and the imaginary part, respectively. The amplitude and phase of
can be expressed as
and
, respectively. Similarly,
and
denote the amplitude and phase of
, respectively.
FFL represents
and
as two vectors mapped from
and
, respectively. According to the definitions of amplitude and phase, vectors
and
correspond to the amplitude, and angles
and
correspond to the phase. Therefore, the frequency distance corresponds to the distance between
and
, which takes into account the size and angle of the vector. By using square Euclidean distance, it can be expressed as
Furthermore, the frequency distance between the real and target post-contrast images can be expressed as
A dynamic spectrum weight matrix (
W) is used to focus on frequency components that are difficult to synthesize. The matrix element (
) is defined as follows:
where
is the scale factor of flexibility. The frequency distance matrix and the dynamic-spectrum weight matrix are multiplied by the Hadamard product, and the average value is calculated. The focal frequency loss formula is defined as follows:
where the dynamic-spectrum weight matrix (
) is updated according to the non-uniform distribution of each frequency loss in the current training process.
From Equation (
15), we can generally conclude that
is differentiable, since
and
are differentiable. Thus, the derivative of
with respect to parameters
of the generator can be computed, which means that any simple stochastic gradient descent algorithm can be used for model optimization. Some have reported that the loss of frequency information leads to blurred texture and edge details [
25,
32] and that the weight of each frequency in Equation (
13) is identical, which indicates that the model still produces inherent bias and blurred texture. In contrast, our method uses the dynamic-spectrum weight matrix and frequency distance for element-wise multiplication in Equation (
15), i.e.,
, which prevents IFGAN from favoring easy frequencies by assigning high weights to the hard frequencies. Therefore, it effectively prevents the loss of frequency information and further improves the quality of spatial-domain image synthesis, as verified by subsequent experimental results.
5. Experiment
To verify the effectiveness of IFGAN, we conducted extensive experiments on two public datasets. First, we compare the proposed IFGAN with recent state-of-the-art (SOTA) synthesis methods and analyze the qualitative and quantitative results. Then, we analyze the constraints of FFL on frequency information and provide corresponding results. We also use ablation experiments to verify the effectiveness of the EIM and each loss item. The experiment is based on Python 3.9 deep learning framework PyTorch and OpenCV.
5.1. Training Datasets
BraTS (
http://braintumorsegmentation.org/, accessed on 1 July 2021) [
37], collected for brain tumor segmentation, consists of brain MRI scans provided by multiple medical centers. Each includes four modal brain images, namely Flair, T1, T1CE, and T2. T1CE refers to T1-weighted enhanced images. The original 3D image and Whole Tumor (WT) mask were converted into paired axial 2D slices which were randomly divided in a ratio of 7:2:1, resulting in 2352, 672, and 336 pairs for the training, testing, and validation sets, respectively.
SegRap (
https://segrap2023.grand-challenge.org, accessed on 14 April 2023), a dataset for segmentation of organs at risk (OAR) and gross tumor volume (GTV) in patients with nasopharyngeal carcinoma, includes a total of 200 patients with CT and CECT data. Within it, 120 training data are publicly provided with images and annotations, adjusted to the window width and window level and standardized into two-dimensional paired continuous slices. The slices with lesion areas were screened and randomly divided in a ratio of 7:2:1, resulting in 1344, 384, and 192 pairs for the training, testing, and validation sets, respectively.
5.2. Implementation Details and Metrics
All the experiments were conducted on an Intel(R) Xeon(R) Gold6148CPU @ 2.6 GHz (Intel Corporation, Santa Clara, CA, USA) with 20 cores and 7 Tesla V100-SXM2 GPUs (NVIDIA Corporation, Santa Clara, CA, USA) using the same settings to ensure impartiality and objectivity. An Adam optimizer with was adopted, and the initial learning rate was set to . The batch size was set to 4, and all image resolutions were set to 256 × 256. The weights of the , , and loss functions were set to 100, 1, and 1, respectively.
To evaluate the synthesis performance of IFGAN, we compared it with seven state-of-the-art methods, including medical image synthesis methods RIED-Net [
6], BPGAN [
7], RegGAN [
9], Bi-MGAN [
10], DC-cycleGAN [
11], and DCE-MRI [
12], as well as the StarGAN V2 adaptive domain synthesis method [
8]. All these methods can be used to achieve pre-to-post-contrast medical image synthesis, with source codes or interfaces provided for training and the performance of standardized comparisons. We re-trained all the models on our dataset and platform.
We used structural similarity (SSIM) [
13] and multiscale structural similarity (MSIM) [
10] to evaluate structural similarity and used the peak signal-to-noise ratio (PSNR) [
13], normalized root-mean-square error (NRMSE) [
10], and perceptual image-patch similarity (LPIPS) [
38] to evaluate visual perception. Additionally, we used Frechet inception distance (FID) [
10] and GAN-seg [
10] to measure manifold fittingness and the similarity of the features of contrast enhancement lesion regions. Among them, lower NRMSE, LPIPS, and FID values indicate better performance, and higher SSIM, PSNR, MSIM, and GAN-seg values reflect greater similarity to real images. Corresponds to symbols in all tables, ↑ means higher is better, ↓ means lower is better.
5.3. Training Time and Throughput
To evaluate learning efficiency and computational consumption, we considered training time and throughput. The former measures the time required for the model to reach convergence, and the latter represents the number of images that the model can process per unit of time. A short training time and high throughput show that the model possesses high efficiency.
As shown in
Table 1, IFGAN requires less training time than BPGAN, StarGAN V2, and DC-cycleGAN. Compaed with Bi-MGAN and DCE-MRI, our IFGAN has higher throughput. Although DCE-MRI, RIED-NET, and RegGAN have advantages in training time, our IFGAN has demonstrated commendable performance on both BraTS and SegRap while using just a moderate level of processing resources.
To test the training process, we drew the training loss curves of the IFGAN model on different datasets. In
Figure 4, (a) represents the training loss on BraTS, and (b) represents the training loss on SegRap. In (a) and (b), the generative loss shows a downward trend, while the discriminative loss shows an increasing trend, which is consistent with the loss optimization goal. Obviously, the loss curves of the generator and the discriminator show an opposite fluctuation pattern and gradually tend toward an equilibrium state with the increase in the number of epochs, which indicates that our model can effectively converge.
5.4. Qualitative Results
To evaluate the visual perception of synthesized post-contrast images, we provide qualitative results of all methods on the BraTS and SegRap datasets in
Figure 5 and
Figure 6, respectively. We enlarged local lesion regions to compare the differences and employed and MAE heat map to visualize structural and shape differences between the post-contrast and real images. Regions with large differences are shown as bright colors on the heat map, and fewer color pixels indicate less deformation. The color bar on the right side represents color pixel corresponding to MAE, ranging from 0 to 1.
As shown in
Figure 5 and
Figure 6, the post-contrast synthesized results of IFGAN are closer to real images than those synthesized by other methods. Specifically, it can be seen from
Figure 5 that the introduction of diversity loss in StarGAN V2 leads to a misunderstanding of the anatomical structure of the model. The lack of mandatory constraint of the pixel intensity difference in Reg GAN may lead to obvious noise due to the wrong reflection of the gray value. The synthesis results of DCE-MRI and RIED-Net are obviously blurred. Most importantly, except for RIED-Net, none of them correctly reflected the contrast enhancement region. In
Figure 6, it can also be found that DCE-MRI failed to delineate the contrast enhancement region, with blurred edges. The other methods capture more tiny contrast enhancement regions, which may be caused by the smaller target range in the image. These methods do not enhance the extraction of specific features, and there is still a risk of ignoring the contrast enhancement region. Anatomical structure features yielded by RIED-Net, RegGAN, and StarGAN V2 were not correctly learned.
We can see that although IFGAN, BPGAN, Bi-MGAN, and DC-cycleGAN successfully capture tiny contrast enhancement regions while maintaining correct anatomical structure, MAE heat maps of IFGAN are more excellent in terms of detail preservation, with less deviation from real images. Combining all the results in
Figure 5 and
Figure 6, we can generally conclude that IFGAN successfully maintains contrast enhancement regions and anatomical structure and that the visual perception by our IFGAN is close to real images.
5.5. Quantitative Results
To measure quality of synthesized post-contrast images, we used SSIM, PSNR, MSIM, and NRMSE to evaluate the degree of image distortion and structural changes on the BraTS and SegRap datasets. Quantitative comparison results are shown in
Table 2 and
Table 3.
As shown in
Table 2, compared with other methods, IFGAN achieved the best scores on all evaluation indicators. Specifically, on the BraTS dataset, out method achieved average increments of 10.4% SSIM, 39.7% PSNR, and 14.7% MSIM, while an average decrement in NRMSE of 51.7% was achieved. As shown in
Table 3, compared with other advanced methods, IFGAN achieved better PSNR, MSIM, and NRMSE results on the SegRap dataset. Our method achieved average increments of 32.8% PSNR and 2.4% MSIM and an average decrement of 54.2% in NRMSE. It is worth noting that our IFGAN incurred minor drops in SSIM comparing with Bi-MGAN, since it introduces two adversarial systems and requires more training time. Otherwise, our method possesses clear advantages in terms of PSNR and NRMSE.
Combining all results in
Table 2 and
Table 3, we conclude that our IFGAN achieves superior performance in PSNR and NRMSE, along with competitive results in SSIM and MSIM, owing to the introduction of the enhanced interaction module, which enhances the detail retention ability of the model so that anatomical structure and visual quality can be better maintained.
Apart from the above comparisons, we also used LPILS, FID, and GAN-seg to explore feature similarity and manifold fittingness in latent space. Based on the U-net segmentation model, GAN-seg obtained Dice similarity of segmentation results of the contrast enhancement region in generated post-contrast images. The evaluations are shown in
Table 4.
From
Table 4, it can be computed that our method achieved average decrements of 59.9% LPIPS and 34.4% FID. Although it exhibited some drops in LPIPS and FID, our method also achieved higher scores than the other methods and a clear advantage in GAN-seg, which indicates that IFGAN successfully captures enhanced regional features.
5.6. Focal Frequency Analysis
To further demonstrate that focal frequency loss preserves important frequency information, we present visualization results and corresponding average spectral images. In
Figure 7 and
Figure 8, the first rows shows the real post-contrast images and the corresponding average spectral images, the second row shows the generated post-contrast images and the corresponding average spectral images without focal frequency loss, and the third row shows the results with focal frequency loss applied. In the SegRap dataset, images were cropped to focus the frequency conversion on the foreground physiological tissue. The application of FFL on the BraTS dataset yielded clearer textures, while subtle changes are noted in the SegRap results. The average spectral images indicate that focal frequency loss significantly narrows the frequency-domain gap between real and generated post-contrast images across both datasets.
In spectral images, the central area reflects low-frequency components (red and yellow pixels), while the surrounding regions correspond to high-frequency components (blue pixels). A lack of high-frequency information results in image blurring and loss of texture, while spectral shifts can cause distortion of details. Focal frequency loss helps reduce the gap in the frequency domain, preserving essential frequency information and maintaining texture details. Compared to the SegRap dataset, the brain tissue MRI images in the BraTS dataset offer richer anatomical structure information. The focal frequency loss demonstrates a more pronounced improvement in detail for the BraTS dataset, highlighting its effectiveness in preserving real textures.
5.7. Bidirectional Synthesis Analysis
IFGAN can achieve bidirectional synthesis of medical images using a single generator, demonstrating generality. IFGAN can realize both pre-to post-contrast medical image synthesis and post-to-pre-contrast medical image synthesis without the need for re-training for each mapping direction. The qualitative results of the comparison of IFGAN with other bidirectional synthesis comparison methods on the BraTS and SegRap datasets are shown in
Figure 9.
As shown in
Figure 9, IFGAN achieved generally satisfactory results compared with other bidirectional synthesis methods. In the synthesis of pre-to-post-contrast medical images, IFGAN maintains attention to the enhanced region and anatomical structure details. In the synthesis of post-to-pre-contrast medical images, the results of IFGAN are also closer to real images. To comprehensively evaluate the proposed method, we also provide quantitative results for another mapping direction (post to pre-contrast), as shown in
Table 5. The proposed method still achieved the highest scores.
Combining all results in
Figure 9 and
Table 2,
Table 3,
Table 4 and
Table 5 shows that IFGAN can achieve high-quality bidirectional synthesis of pre- and post-contrast medical images, with wide applicability and flexibility.
5.8. Discussion and Limitations
To evaluate the impact of the designed EIM and various losses on the quality of the generated post-contrast image, we present the design of ablation experiments in this section. Excluding each component separately is helpful in explaining their contributions to preserving anatomical structure and texture details. In
Table 6 and
Figure 10 and
Figure 11, ‘w/o EIM’ represents the variant of the EIM removed by IFGAN, ‘w/o ffl’ represents the variant without focal frequency loss, ‘w/o rec’ represents the variant without reconstruction loss, and ‘w/o pmae’ represents the variant withtout pixel-wise mean absolute error.
As shown in
Table 6, the scores of all evaluation indicators in the results of ‘w/o pmae’ decreased significantly, followed by ‘w/o EIM’. After removing the focal frequency loss and reconstruction loss, the scores decreased slightly.
The quantitative results presented in
Figure 10 and
Figure 11 are consistent with the qualitative results. In
Figure 10, the anatomical structure of the generated post-contrast image of ‘w/o pmae’ is blurred or even distorted. The capture of the enhanced region in ‘w/o EIM’ is obviously not enough and slightly inferior to that achieved with the proposed method. Compared with ‘w/o ffl’ and ‘w/o rec’, the proposed method is more realistic in retaining texture details. In
Figure 11, the subjective influence of each loss function on the SegRap dataset is not obvious, but the necessity is shown in
Table 6.
Figure 10 and
Figure 11 demonstrate that the pixel-wise mean absolute error is crucial for preserving the image structure. The EIM component helps to improve the image quality by focusing on the local contrast enhancement region. The reconstruction loss and focal frequency loss further ensure the integrity of the anatomical structure and details. This shows that the EIM and various losses contributed to improving the quality of the generated post-contrast images on the BraTS and SegRap datasets.
Although IFGAN has made significant progress in pre-to post-contrast medical image synthesis. There are still several problems to be solved. First, the EIM enhances the attention to enhanced regional features through a reasonable structural design, but there is a lack of visual and quantifiable methods to explain its extraction of specific features. Secondly, IFGAN shows the potential for bidirectional mapping between two domains, albeit with some flexibility. designing a separate generator for each domain, IFGAN achieves mapping by sharing potential features and adjusting statistics, which makes it possible to extend it to multi-domain adaptive synthesis in future work.