2.2.2. Objective Function

A fake image generated from the input vector *z*1, *z*<sup>2</sup> can be expressed as *G*(*z*1, *z*2). The goal of the discriminator is to distinguish real image *x* from fake image *G*(*z*1, *z*2). Therefore, the discriminator attempts to reconstruct *x* only, not *G*(*z*1, *z*2). On the other hand, the generator attempts to produce an image that can be reconstructed well by the discriminator. As a result of the adversarial training of the generator and the discriminator, the output images from the generator become more realistic to deceive the discriminator. In other words, the generator is trained to reduce the Wasserstein distance between the loss distributions of real and fake samples in the auto-encoder. The adversarial loss of the generator and discriminator can be expressed as

$$\begin{aligned} L\_{\mathrm{D}} &= L(\mathbf{x}; \boldsymbol{\theta}\_{\mathrm{D}}) - k\_{\mathrm{l}} L(\mathbf{G}(\mathbf{z}\_{\mathrm{D}}; \boldsymbol{\theta}\_{\mathrm{G}}); \boldsymbol{\theta}\_{\mathrm{D}}), \\ k\_{\mathrm{l+1}} &= k\_{\mathrm{l}} - \lambda\_{\mathrm{k}} (\boldsymbol{\gamma} L(\mathbf{x}) - L(\mathbf{G}(\mathbf{z}\_{\mathrm{G}})) \text{ for each step } \mathbf{t}, \end{aligned} \tag{3}$$

and

$$L\_{\mathbb{G}} = L(G(z\_{\mathcal{D}}; \theta\_{\mathcal{G}}); \theta\_{\mathcal{D}}), \tag{4}$$

where *L*(·) denotes the *L*<sup>1</sup> loss from the auto-encoder, and *kt* is the parameter that controls the proportion of generator and discriminator loss introduced in BEGAN. This is required because the discriminator cannot achieve a suitable reconstruction quality at the beginning of training. At this time, *kt* has a value close to zero and gradually increases as training progresses.

As mentioned in Section 2.2.1, z1 and z2 are both involved in the generation of different scale-specific areas. To apply this principle to the decoder, we add a novel constraint on the encoded latent vectors, referred to as double-constraint loss. It includes the difference between the input vector and the encoded vector as defined by

$$\begin{aligned} L\_{dc} &= \|z\_1 - z\_1^\*\|\_1 + \|z\_2 - z\_2^\*\|\_1, \\ [z\_1^\*, z\_2^\*] &= Enc(G(z\_1, z\_2)), \end{aligned} \tag{5}$$

where *Enc*(·) denotes the output of the encoder. The double-constraint loss is designed to stabilize training because the inputs of the generator and decoder would be similar. It can also be extended to the synthesis of existing images because real samples are mapped to a space similar to the latent space of the input. Hence, the generator loss can be modified as

$$L\_{\mathbb{G}} = L(G(z\_{\mathcal{D}}; \theta\_{\mathcal{G}}); \theta\_{\mathcal{D}}) + a \cdot L\_{\text{dcf}} \tag{6}$$

where hyperparameter α represents a weighting factor for the double-constraint loss.

#### 2.2.3. Training Scheme

Unstable training is a major concern when using GANs, and it can occasionally result in mode collapse or low-quality output. In auto-encoder-based GAN models in particular, the reconstruction performance of the discriminator is a decisive factor in establishing the visual quality of an output image. However, excellent reconstruction performance cannot be guaranteed because the importance of the reconstruction error for a real image decreases as *kt* increases, as can be seen in Equation (3). Training a discriminator on relatively large images (e.g., 128 × 128) is slow and difficult, and *kt* becomes larger because the discriminator does not effectively function as an auto-encoder. Motivated by PGGAN [17], our model attempts to overcome this problem by starting the training process with low-resolution images. In other words, the size of the training images increases as training progresses (Figure 2). When the size of an image is larger, new layers are added to both the generator and discriminator to adjust the size of the input and output correctly. While PGGAN starts training with 4 × 4 images, our model begins with 32 × 32 images because our model performs sufficiently without threatening stability when generating images with sizes of 32 × 32 or lower, thereby reducing the training time. After a few epochs of training, the size of the training images is doubled, and new layers are added to the generator, encoder, and decoder while maintaining the weights in the conventional layers. By progressively training the generator and the discriminator in this manner, our model achieves better reconstruction performance than when *kt* remains constant. Because the discriminator is trained to some extent in the previous stage, the training process becomes more stable and the visual quality is higher than when training with 128 × 128 images directly. In addition, the layers of the generator and the decoder can accurately reflect the spatial properties of their input.

**Figure 2.** Progressive training of the proposed generator and discriminator. Our model starts with a 32 × 32 image in the first stage, and the size of the training images is doubled in the next stage.

#### 2.2.4. Facial Synthesis Method

In addition to generating images from random noise, as with other unsupervised GANs, our model can also be used to synthesize two images. StyleGAN introduced style mixing, which exploits two or more input vectors, but it was used on only random noise input, not existing images. To mix two existing images, an additional encoder needs to be trained to map the images onto the latent space of the input. However, our model does not require an additional network because our discriminator already has an encoder. By taking advantage of the auto-encoder structure of our discriminator, we present a method for mixing existing images. The encoder encodes an input image as two latent vectors, and they are exploited in different layers of the decoder. Reconstruction occurs when the decoder uses the two latent vectors from a single image. However, if the decoder exploits a combination of the two latent vectors from two different images, the output of the decoder is a mixed image.

Let *X* and *Y* denote the two images to be mixed; the output of the encoder when the input is *X* and *Y* can be expressed as

$$\mathbb{E}\left[z\_{X\_1}^\* z\_{X\_2}^\*\right] = \operatorname{Enc}(\lambda), \ \left[z\_{Y\_1}^\* z\_{Y\_2}^\*\right] = \operatorname{Enc}(\mathcal{Y}).\tag{7}$$

If the decoder decodes the image using *z*∗ *<sup>X</sup>*<sup>1</sup> and *<sup>z</sup>*<sup>∗</sup> *X*2 , it reconstructs *X*, and if it uses *z*∗ Y1 and *<sup>z</sup>*<sup>∗</sup> *Y*2 , it reconstructs Y; i.e.,

$$dX^\* = \text{Dec}(z\_{\chi\_1 \prime}^\*, z\_{\chi\_2}^\*) \to \text{Reconstruction of } X,\tag{8}$$

$$Y^\* = \text{Dec}(z^\*\_{Y\_1}, z^\*\_{Y\_2}) \to \text{Reconstruction of } \Upsilon,\tag{9}$$

where *X*∗ and *Y*∗ denote the reconstructed images of *X* and *Y*, respectively, and *Dec*(.) denotes our decoder. To synthesize *X* and *Y*, the decoder needs to take the latent vectors from the two images as input. A mixed image of *X* and *Y* is acquired by exploiting a combination of the latent vectors from the two images (e.g., *z*∗ *<sup>X</sup>*<sup>1</sup> and *<sup>z</sup>*<sup>∗</sup> *Y*2 ), as illustrated in Figure 3. The two blue boxes in Figure 3 represent the two parts of the decoder; i.e., one is involved in generating a 32 × 32 feature map from a given latent vector, and the other is involved in generating a 128 × 128 image from a given 32 × 32 feature map. Therefore, the synthesis process can be expressed as

$$I\_{X,Y} = \text{Dec}(z\_{X\_1}^\*, z\_{Y\_2}^\*) \to Sym \text{theisi of } X \text{ and } Y,\tag{10}$$

where *IX*,*<sup>Y</sup>* is an image that has the structural or coarse-scale characteristics of X and the details or fine-scale characteristics of Y.

**Figure 3.** The facial image synthesis process for our model. The decoder takes one encoded vector from each image.
