*3.1. Experimental Setup*

We used the CelebFaces Attributes (CelebA) dataset (Figure 4) [20], which consists of 202,599 facial images of celebrities cropped to 178 × 218. We cropped each image further to 170 × 170, and then resized them to 128 × 128. For progressive training, we downsampled the images to 32 × 32 and used them in the first stage. The height and width of each training image were doubled every five training epochs. In our experiments, the coefficients of the objective functions in Equations (3) and (6) were set to γ = 0.5 and α = 0.1. We used the ADAM [21] solver with β<sup>1</sup> = 0.5 and β<sup>2</sup> = 0.999, and the learning rate was set initially to 0.0005. *L*<sup>1</sup> loss was adopted as the loss function for the auto-encoder. All other parameters were the same as in BEGAN. We used Tensorflow with cuDNN as the deep-learning framework and an NVIDIA GTX 1080Ti graphics card.

**Figure 4.** CelebFaces Attributes (CelebA) dataset.

#### *3.2. Qualitative Results*

We conducted qualitative analysis by comparing the output of our model with those of two other auto-encoder-based GAN models, BEGAN [12] and BEGAN-CS [18]. BEGAN-CS adds a latent constraint to BEGAN. The results are shown in Figure 5. The columns (a) to (c) represent 5, 10, and 15 epochs, respectively, while each row represents the qualitative results from the compared methods. The output images are produced by the generator and the input vectors are sampled randomly from a Gaussian distribution. It should be noted that the output of our model in (a) (5 epochs) has a lower resolution than the other models because of the progressive learning strategy it employs. The visual quality of the images improves as training progresses in all three models. However, the results from BEGAN contain some artifacts, such as checkerboard patterns, while BEGAN-CS produces blurred and unstructured facial images (Figure 5d). Once the size of the training images is increased, our model produces similar visual quality in Figure 5b and clearer images than the other models after 15 epochs (Figure 5c,d).

**Figure 5.** Qualitative results for facial image generation. The rows from top to bottom present the results of BEGAN, BEGAN-CS, and the proposed model, respectively. Each column represents five epochs; i.e., (**a**) epoch 5, (**b**) epoch 10, and (**c**) epoch 15. (**d**) An enlargement of the red box in (c).

### *3.3. Quantitative Results*

It is difficult to verify the diversity of output images using several images. Therefore, we conducted quantitative experiments using the Fréchet inception distance (FID) [22]. The FID score can be used to measure the quality and diversity of images. The FID score is calculated using Equation (11):

$$\text{FID} = \|\mu\_x - \mu\_y\|\_2^2 + Tr\left(\Sigma\_x + \Sigma\_y - 2\left(\Sigma\_x \Sigma\_y\right)^{\frac{1}{2}}\right),\tag{11}$$

where *x* and *y* denote the image sets *x* and y. In our experiments, *x* and *y* consist of real images and fake images, respectively. Because the FID score considers the mean (μ) and variance (Σ) of the images, it can represent the visual quality and diversity of the images. If the two sets of images have a similar probability distribution, the FID score is low (Equation (11)). Therefore, lower FID scores are better

when comparing GAN models. We measured the FID score based on 5000 real samples and 5000 fake samples in epoch 15 for each model. The results are summarized in Table 1.

**Table 1.** Visual quality in terms of the Fréchet inception distance (FID) score, where a lower score is better.


Our model produces the best results. As a result, our model can be seen as superior in terms of image quality and diversity.

#### *3.4. Facial Synthesis Results*

We test the synthesis of facial images using our model as described in Section 3.4. In Figure 6, the right-most image in each row represents the synthesis output of the two left-side images. The front module of our decoder takes a latent vector encoded from the left image, and the back module takes a latent vector from the right image. The output image has the characteristics of both images but different scale-specific features. In other words, the output has the coarse-scale characteristics of the first image (e.g., the overall structures or locations of facial attributes) and the fine-scale features of the second image (e.g., the eyes or the skin color). Note that facial synthesis is achieved without requiring additional information, such as binary attribute labels for each image.

**Figure 6.** Qualitative results of facial synthesis with the proposed model. The right-most image is the synthesis of the two left-side images.

#### **4. Conclusions**

In this paper, we proposed an enhanced GAN model for unsupervised facial image generation and synthesis. To overcome the limitations of GAN models (particularly auto-encoder-based models), we first introduced an enhanced generator and discriminator structure. Our generator and decoder utilize two input vectors, and every block reflects the information from the input vectors with adaptive instance normalization layers. Each layer plays a role in producing scale-specific components of the facial image. We also applied a progressive learning method to the proposed auto-encoder-based model, in which the training process was divided into several stages depending on the size of the training image. Consequently, our model can both generate and synthesize facial images via an auto-encoder

structure. Our model can generate arbitrary images because it also takes noise as input and synthesizes two existing images using an encoder and decoder within the discriminator. Therefore, it does not require additional training to encode existing images or a pre-trained network. We demonstrated that the visual quality and diversity of the output images were higher than those of the baseline models using both qualitative and quantitative analysis. Additionally, we presented a method for synthesizing two existing images by exploiting the auto-encoder structure of the discriminator. Our model did not need to train a subnetwork that could encode the images for mixing. All of the networks in our model were trained in an end-to-end manner without the labeling of the images. In future research, we will further investigate this novel method from a variety of perspectives to enhance the visual quality of the output images and to ensure stable training for large-scale image generation. Furthermore, we will extend our model for use in not only unsupervised generation tasks but also conditional image generation or synthesis tasks, such as image-to-image translation.

**Author Contributions:** Conceptualization: J.g.K.; methodology: J.g.K.; software: J.g.K.; investigation: J.g.K.; writing—original draft preparation: J.g.K.; writing—review and editing: H.K.; supervision: H.K. All authors have read and agreed to the published version of the manuscript.

**Acknowledgments:** This research was supported by a National Research Foundation (NRF) grant funded by the MSIP of Korea (number 2019R1A2C2009480).

**Conflicts of Interest:** The authors declare no conflicts of interest.
