*2.1. Conditional Variational Generative Adversarial Networks (CVAE-GAN)*

The model structure is shown in Figure 1 and includes four parts: encoder network, E, generator network, G, discriminator network, D, and classifier network, C.

**Figure 1.** Model structure of CVAE-GAN.

The encoder network, E, maps a sample, *x*, to a potential representation, *z*, via a learnable distribution, *P*(*z*|*x*,*c*), with *c* denoting the class of the data. Bounds on the prior *P*(*z*) and the recommended distribution are reduced using *KL* loss:

$$L\_{KL} = \frac{1}{2}(-\log \sigma^2 + \mu^2 + \sigma^2 - 1) \tag{1}$$

where *μ* and *σ* are the mean and covariance of the output of the potential vector from encoder network E.

The generative network, *G*, generates the data, *x* , by sampling from the learnable distribution, *P*(*x* |*z*,*c*). The functions of *G* and *D* are the same as GAN. The network, *G*, attempts to learn the distribution of the real data by means of gradients from the discriminator network, *D*, which is able to distinguish between true/false samples. The loss function of the discriminator network, *D*, is:

$$L\_D = -E\_{x \sim p\_r} [\log D(\mathbf{x})] - E\_{z \sim p\_z} [\log(1 - D(G(z)))] \tag{2}$$

where *x* is the input data and *z* is the potential vector from encoder network, E.

The generator uses an average feature matching the objective function. This objective function requires the feature centers of the synthetic samples to match the feature centers of the real samples. The generator, G, tries to minimize the loss function as:

$$L\_{\rm GD} = \frac{1}{2} \left\| E\_{\rm x \sim p\_r} f\_D(\mathbf{x}) - E\_{\rm z \sim p\_z} f\_D(\mathbf{G}(\mathbf{z})) \right\|\_2^2 \tag{3}$$

where *fD*(*x*) denotes the features in the middle layer of the discriminator, *D*.

The generating network, *G*, uses the average feature to match the objective function. Let the network, *G*, attempt to minimize:

$$L\_{\rm GC} = \frac{1}{2} \sum\_{c} \left\| E\_{\rm x \sim p\_{\rm r}} f\_{\mathbb{C}}(\mathbf{x}) - E\_{\rm z \sim p\_{\rm z}} f\_{\mathbb{C}}(G(\mathbf{z}, \mathbf{c})) \right\|\_{2}^{2} \tag{4}$$

where *fC*(*x*) denotes the intermediate layer outputs of the classifier and *c* denotes the label of the input data, *x*.

Then, an *L*2 reconstruction loss and pairwise feature matching-based loss are added to *x* and *x* :

$$L\_G = \frac{1}{2} (\left\|\mathbf{x} - \mathbf{x}'\right\|\_2^2 + \left\|f\_D(\mathbf{x}) - f\_D(\mathbf{x}')\right\|\_2^2 + \left\|f\_\mathbb{C}(\mathbf{x}) - f\_\mathbb{C}(\mathbf{x}')\right\|\_2^2) \tag{5}$$

where *x* is the input data and *x* is the generated data from the generator, *G*.

Network *C* takes *x* as input and outputs a *k*-dimensional vector, which is then converted to probability-like values using the softmax function. Each port of the output represents the posterior probability, *P*(*c*|*x* ). In the training phase, network, *C*, attempts to minimize the softmax loss. The function of the classifier network, *C*, is to measure the posterior of *P*(*c*|*x* ):

$$L\_{\mathbb{C}} = -E\_{\mathbf{x} \sim p\_r} [\log P(\mathbf{c}|\mathbf{x}')] \tag{6}$$

The total loss function is:

$$L = L\_{KL} + L\_{\text{G}} + L\_{\text{GD}} + L\_{\text{GC}} + L\_{\text{D}} + L\_{\text{C}} \tag{7}$$

*LKL* is only relevant to the encoder network, E, indicating whether the distribution of potential vectors is as expected. *LG*, *LGD*, and *LGC* are relevant to the generator network, G, indicating whether the synthetic sample is the same as the input training samples, the real sample, and other samples in the same category, respectively. *LC* is relevant to the classifier network, *C*, indicating how well the network is used to classify different categories of samples; *LD* is relevant to the discriminator network, *D*, indicating how well the network is able to distinguish between real/synthetic samples. All these objective functions are complementary to each other and ultimately lead to optimal results for the algorithm.
