1. Introduction
Medical imaging scans, including Magnetic Resonance Imaging (MRI) and Computed Tomography (CT), are routinely acquired and used clinically to macroscopically assess, diagnose, and monitor patients with brain abnormalities. MRI in particular can depict normal anatomy and apparent pathologies while providing data relating to the anatomical structure, tissue density, and microstructure, as well as tissue vascularization, depending on the acquired sequence [
1,
2,
3]. Structural MRI sequences represent the basic scans acquired across comprehensive centers and community-based healthcare sites, comprising native T1-weighted (T1W), post-contrast T1-weighted (T1Gd), T2-weighted (T2W), and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR). T1W scans facilitate observation and analysis of the brain anatomy, with the T1Gd scans particularly being able to easily identify the boundaries of an active tumor, while T2-weighted scans (T2w and FLAIR) help in identifying brain abnormalities, both those related to vascular lesions (e.g., stroke) and vasogenic edema [
4]. The simultaneous assessment of multiple varying MRI scans (also known as multi-parametric MRI—mpMRI) from the same patient is the standard clinical practice for the evaluation of patients suspected of stroke or diffuse glioma, as it offers the maximal available medical diagnostic information.
Acquisition of mpMRI might not be possible at all times, due to numerous reasons, including but not limited to the patient’s cooperation during a scanning session that could result in motion-degraded scans, thereby hindering further diagnostic usage [
5,
6,
7]. Towards this end, the artificial synthesis of specific MRI scans has been an active area of research [
8,
9], with the intention of either substituting specific MRI scans corrupted by various artifacts, or generating scans that were not acquired at all. Although such synthetic scans have been successfully used in many applications, and Generative Adversarial Networks (GANs) have significantly improved their realism, the final result may not always look as realistic and/or may contain information that adversely affects downstream quantitative analyses [
10,
11]. Cross-domain synthesis of medical images has drawn significant interest in the medical imaging community and describes the artificial generation of a target-modality scan by learning the relationship between paired given source-modality scans and their associated target modality scans [
12,
13]. Of note, the data is here described as paired when it arises from the same individual at different points in time.
In recent years, deep-learning procedures and particularly Convolutional Neural Networks (CNNs) [
14] and GANs have rapidly dominated the domain of medical image synthesis [
15,
16]. GANs use two competing CNNs: one that generates new images and another that discriminates the generated images as either real or fake. To address the problem of unpaired cross-domain data, which is common in healthcare, the Cycle Generative Adversarial Network (CycleGAN) [
17] is typically chosen to obtain high-quality information translatable across images. In CycleGAN, based on the image of a subject b
1 in the source domain, the purpose is to estimate the relevant image of the same subject b
2 in the target domain. In theory, the CycleGAN model entails two mapping functions, i.e., G
1: X → Y and G
2: Y → X, and associated adversarial discriminators D
Y and D
X. D
Y encourages G
1 to translate X into outputs indistinguishable from domain Y and contrariwise for D
X and G
2. Nie et al. [
18] trained a fully convolutional network (FCN) to generate CT scans from corresponding MRI scans. They specifically used the adversarial training method to train their FCN. Welander et al. [
19] evaluated two models, i.e., CycleGAN and UNIT [
20], for image-to-image translation of T1- and T2W MRI slices by comparing synthetic MRI scans to real ones. They used paired T1W and T2W images from 1113 axial images (only slice 120). The scans were registered to a standard anatomical template, so they were in the same coordinate space and of the same size. Two models were compared using quantitative metrics, including mean absolute error (MAE), mutual information (MI), and peak signal-to-noise ratio (PSNR). It was shown that the executed GAN models can synthesize visually realistic MRI slices. Dar et al. [
21] proposed a method for multi-contrast MRI synthesis based on Conditional GANs (CGANs). They demonstrated how CGANs could generate a T2W scan from a T1W. Theis et al. [
22] found that GANs can generate more realistic training data to improve the classification performance of machine learning methods. Also, they showed that models creating more visually realistic synthetic images do not certainly have better quantitative error measurements when compared to real images. Despite the mounting promise of GANs for healthcare, both optimal model selection and quantitative evaluation remain challenging tasks, and solutions produced so far are use-specific and not generalizable. Specifically, for their quantitative performance evaluation, several metrics (such as the MAE, Mean Squared Error (MSE), and PSNR) have been proposed in the literature [
23], albeit no consensus has been reached on an optimal evaluation metric for a particular domain.
Radiomics describe a novel and rapidly advancing area in medical imaging. In contrast to the traditional clinical assessment of considering medical images as pictures intended only for visual interpretation, radiomics represent visual and sub-visual quantitative measurements (also called “features”) extracted from acquired radiology scans [
24,
25], following specific mathematical formulations, and resulting in measurements that are not even perceivable by the naked eye, i.e., sub-visual [
26,
27,
28,
29,
30,
31,
32,
33,
34]. These features are widely used in both clinical and pre-clinical research studies attempting to identify associations between radiologic scans and clinical outcomes, or even molecular characteristics [
35,
36,
37,
38]. The hypothesis is that quantitative computational interrogation of medical scans can provide more and better information than the physician’s visual assessment. This is further exacerbated in observations related to the texture analyses of different imaging modalities. Various open-source tools have been developed to facilitate the harmonized extraction of high throughput radiomic features [
39,
40,
41], contributing to the increasing evidence of their value. The primary purpose of these tools has been to expedite robust quantitative image analyses based on radiomics and standardize both feature definitions and computation strategies, thereby guaranteeing the reproducibility and reliability of radiomic features. In this study, considering the importance of T2-weighted scans (T2W and FLAIR), we focus on generating FLAIR from T2W MRI scans, and vice versa, based on the CycleGAN [
17] and dual cycle-consistent adversarial network (DC
2Anet) [
42] architectures. We further consider radiomics as a novel way to quantify the dissimilarity between the distribution of the actual/real and synthesized scans. We think that radiomics can represent the first potential solution for the quantitative performance evaluation of GANs in the domain of radiology. For comparison, we also compare with traditional metrics, including MSE, MAE, and PSNR [
43].
3. Results
We evaluated the proposed CycleGAN and DC
2Anet architectures on T2W and FLAIR brain tumor images. For the quantitative performance evaluation of the T2W and FLAIR synthesis, in line with current literature we considered the three metrics of MAE, MSE, and PSNR (
Table 3).
The CycleGAN model shows superior performance when compared with the DC
2Anet model across all the 3 quantitative metrics. In general, we observe a better performance for FLAIR images. Of note, smaller values of the MAE and MSE indicate better results, as opposed to PSNR where larger values are better. The quantitative superiority of CycleGAN and DC
2Anet for FLAIR images corresponds to the visual realness and the mapping in
Figure 6 and
Figure 7 which show axial views of five patients’ synthetic and real images for CycleGAN and DC
2Anet, for both the T2W and the FLAIR images. Of note, for both of the T2W and FLAIR images in the perceptual study similar to quantitative analyses, CycleGAN shows the best performance. Hence, the CycleGAN model trained using adversarial and dual cycle consistency generates more realistic synthetic MR images. More examples of CycleGAN and DC
2Anet for FLAIR and T2W translation are provided in
Supplementary Material (Figures S1 and S2).
The differences observed between real and synthesis imaging modalities can be attributed to multiple factors. Firstly, synthesis images are computer-generated simulations of original images based on GAN models, while real original images are directly acquired from patients using MRI scanners. Consequently, variations can arise due to the inherent limitations and assumptions of the synthesis process. Furthermore, physiological and technical factors, such as variances in tissue contrast, signal intensity, and image artifacts, can contribute to dissimilarities between synthesis and real images. To further investigate and address these differences, future studies should focus on refining the synthesis algorithms, incorporating more realistic training data, and exploring the impact of various imaging parameters on the synthesis process.
We then performed a secondary quantitative performance evaluation that considers the radiophenotypical properties of the images, by virtue of the extracted radiomic features. Unlike the metrics of MAE, MSE, and PSNR, significance levels and values of radiomic features vary depending on the type of feature and image. Following a comparison of the radiomic features for both the T2W and FLAIR images, our results pointed out that for most radiomic features, there was no significant difference between the real and synthetic T2W, as well as for the real and synthetic FLAIR images. The mean and standard error (SE) of the GLCM features for both T2W and FLAIR images, as well as their statistically significant difference for the CycleGAN and DC
2Anet models, are shown in
Table 4 and
Table 5. No significant differences were observed for all extracted GLCM features using the CycleGAN model between real T2W images and synthetic T2W images. However, this was not the case for three features (cluster prominence, contrast, and correlation) extracted from the synthetic images of the DC
2Anet model (
Table 4). Notably, there was a significant difference for FLAIR images using the DC
2Anet model for two extracted features (cluster prominence and correlation), and for the cluster prominence feature of the CycleGAN model (
Table 5).
Similarly, with GLCM features, no significant differences were observed for all extracted GLRLM features using the CycleGAN model between real T2W images and synthetic T2W images (
Table 6). However, with feature extraction on the DC
2Anet synthetic images, there was a significant difference for two extracted features (High Grey Level Run Emphasis, and Long Run Low Grey Level Emphasis) between the real and synthetic T2W, as well as between the real and synthetic FLAIR images (
Table 7).
Extracted features based on GLSZM for the two models used are shown in
Table 8 and
Table 9. By comparing the real and synthetic T2W based on CycleGAN and DC
2Anet, except for two features (Grey Level Nonuniformity, and Large Zone Low Grey Level Emphasis), no significant differences were observed for other extracted GLSZM features (
Table 8). However, for FLAIR images using the DC
2Anet model as shown in
Table 8, for two features (High Grey Level Emphasis, and Large Zone Low Grey Level Emphasis), significant differences were observed.
4. Discussion
A within-modality synthesis strategy was presented for Generating FLAIR images from T2W images and vice versa based on CycleGAN and DC2Anet networks. Comprehensive evaluations were conducted for two distinct methods where training images were registered within single subjects. It has been shown, via a perceptual study and in terms of quantitative assessments based on MAE, MSE, PSNR metrics, as well as based on a novel radiomic evaluation, that CycleGAN and DC2Anet can be used to generate visually realistic MR images. While our synthesis approaches were primarily evaluated for two specific brain MRI sequences, it has the potential to be applied for image-to-image MRI synthesis, as well as for synthesis across imaging modalities (such as MRI, CT, and PET). The proposed CycleGAN technique uses adversarial loss functions and cycle-consistency loss for learning to synthesize from registered images for improved synthesis. In the DC2Anet model, in addition to used losses in the CycleGAN model, four-loss functions, including voxel-wise, gradient difference, perceptual, and structural similarity losses, were used. In modern medical imaging modalities, generating realistic medical images that can be utterly similar to their real ones remains a challenging objective. Generated synthetic images can ensure a trustworthy diagnosis. Based on the quantitative evaluation, for all metrics, the CycleGAN model was found accurate and outperformed the DC2Anet model.
CycleGAN and DC
2Anet models learn the mapping directly from the space of T2W images to the corresponding FLAIR images, and vice versa. Moreover, three metrics including MAE, MSE, and PSNR applied to the data from 510 T2W/FLAIR paired slices from 102 patients, were favorably compared with other reported results in the brain region literature. For example, Krauss et al. [
61] compared the assessment results of synthetic and conventional MRIs for patients with Multiple Sclerosis (MS). The images were prospectively acquired for 52 patients with the diagnosed MS. In addition to quantitative evaluations and using the GAN-based approach, the CycleGAN model obtained better results than the study of Krauss et al. Han et al. [
62] proposed a GAN-based approach to generate synthetic multi-sequence brain MRI using Deep Convolutional GAN (DCGAN) and Wasserstein GAN (WGAN). Their model was validated by an expert physician who employed Visual Turing test. In agreement with our study, their results revealed that GANs could generate realistic multi-sequence brain MRI images. Nevertheless, this study was different from other research attempts in terms of the employed quantitative evaluation and the proposed models, CycleGAN and DC
2Anet. In addition to the MAE, MSE, and PSNR metrics, we also conducted a novel evaluation based on radiomic features, to compare the real and synthetic MRI cases. Li et al. [
63] proposed a procedure to synthesize brain MRI from CT images. They reported an MAE value of 87.73 and an MSE value of 1.392 × 10
4 for real MRI and synthesized MRI. Although their results differed from our findings, there was a common understanding that the application of the CycleGAN model was subject to much error.
For evaluation, generated images and comparing them with real images, quantitative evaluations can be used. It is clear that even if the model achieves a relatively satisfactory score in quantitative measurements including MAE, MSE, and PSNR metrics, it does not necessarily generate visually realistic images. Although visually CycleGAN produced realistic images there is not much difference between the two models CycleGAN and DC
2Anet in quantitative measurements including MAE, MSE, and PSNR metrics, as shown in
Table 3. It can be implied that the process of determining whether or not an image is visually realistic cannot be done based on the mentioned metrics. However, the current study employed radiomic features as a new evaluation approach to compare the real MRI and their synthetic counterparts. Our results (
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9) revealed that for the vast majority of radiomic features regarding the two T2W and FLAIR images, no significant difference was observed between real images and images synthesized using CycleGAN. On the other side, some radiomic features indicate significant differences for images synthesized by DC
2Anet, indicating that the set of radiomic features is more successful in assessing the realism of the generated images than traditional metrics such as MAE, MSE, and PSNR. Therefore, according to the metrics used in this study, it can be concluded that performing evaluations based on radiomic features is a viable option in the GAN models.
In this study, we have used the ACRIN 6677/RTOG 0625 data set which is a multi-center dataset and is one of the strengths of this study. Of note, as a limitation, future studies with a large sample size are suggested. This study considered synthesis for two-contrast brain MRI; hence the proposed models can also be used for other related tasks in medical image analysis such as T1W, CT-PET, and MR-CT. For future research, it is suggested that relevant evaluations be carried out based on radiomic features with larger data sets and other anatomical areas.
Despite the demonstrated effectiveness of our method in generating a T2W MRI from a FLAIR, and vice versa, it is important to acknowledge that the applicability of our approach may have limitations in certain specific cases. The performance of our method may be influenced by factors such as extreme variations in tumor size, irregular tumor shapes, or cases with substantial edema or necrosis. While our methodology has shown promising results in brain tumor patients, further research is needed to investigate its robustness in challenging scenarios and to develop additional techniques to address these limitations. Future studies should also consider expanding the dataset to include a larger cohort of patients with a wider spectrum of brain pathologies to ensure the generalizability of our findings.