2.1. A Brief Introduction to GAN
GAN [
36], proposed in 2014, was originally used as a generative model, which can be used to generate images, audio, etc. The quality of generative objects increases year by year. GAN comes originally from a game, and the learning framework is composed of a generator (
G) and a discriminator (
D), which play different roles in the game.
For a given training data set, the purpose of the G is to generate samples having the same probability distribution as the training data set. The D belongs to the common binary classifier and is mainly responsible for two tasks. Firstly, it is necessary to determine whether the input comes from the real data distribution or the G. Secondly, the D guides the G through the back-propagation gradient to create a more realistic sample, which is the only way for the G to optimize its model parameters. During the game, the G takes in random noise as input and outputs a , which is to be maximized by the D that this is the probability of the decision from the real training set.
During the training, the
D takes in the image of the training set as an input for half of the time, and takes in the image
obtained by the
G as an input for the other half of the time. The
D is trained to maximize the distance between categories, and to distinguish between the real image from the training set and the fake sample from the
G. Finally, the training will eventually reach equilibrium—Nash equilibrium. Because this equilibrium is difficult to find, there are many research papers for solving this problem [
39,
40,
41].
Therefore, the G should be able to make the generated probability distribution and the real data distribution as close as possible so that the D cannot distinguish between real and fake samples. Therefore, in the adversarial process, the G’s ability to learn the real data distribution becomes stronger and stronger, and the D’s feature learning and discriminative ability is also becoming stronger and stronger. Finally, this research is applied in different real-life scenarios, such as image synthesis, scene synthesis, face synthesis, style migration, image super resolution, and image domain conversion.
2.2. Deconvolution, Convolution, Normalization and Activation
After a brief introduction to GAN, deconvolution, convolution, normalization and activation will be discussed below.
2.2.1. Deconvolution and Convolution
Convolution performs a perfect conversion of the input data, and is commonly used to obtain a compact and high-level latent feature that lays a good foundation for separation or distinguishing in the future steps.
The standard neural network structure consists of input layer
x, output layer
y, and some hidden layer
h. Every layer has many units. Usually, every hidden unit
receives all output from the last hidden layer, which is calculated with the formula of non-linear combination as follows:
is the weight value to control the intensity between input units and hidden units, is the bias of hidden units, and F is the non-linear function, such as the sigmoid function. Commonly, multi-layer neural networks need a lot of parameters. However, with the rapid development of hardware, the dilemma of a lack of computing resource has been solved. Because the convolution neural network depends on feature sharing principles, every feature map output through channel is created by the same size filter. Compared to standard neural network structures, convolution neural networks depend on fewer model parameters. At the same time, the convolution neural network uses a pooling layer to ensure the translation invariance of image. Meanwhile, the pooling operation broadens the receptive field so as to receive more input. The larger receptive field can be good at learning inner feature representation by deep layer learning. Average pooling, one of the most common operations, averages the pixel value of the receptive field to comprehensively consider the characteristics of surrounding pixels. Max pooling extracts important information from receptive field pixels’ value to void model learning unused features.
Deconvolution and convolution are basically the same, the difference is mainly that deconvolution requires a filling process and needs to be cropped after the last deconvolution. In this paper, the D is equivalent to an encoder with a classifier, and the G is equivalent to a decoder. The convolution features are usually used as input data for the classifier. Usually, the performance of the classifier depends on not only the data quality of the convolution feature, but also the methods used in the classification, and even the normalization of the intermediate stages. So, normalization methods, activation layers, and classification layers will be introduced next.
2.2.2. Normalization
In order to improve the stability and generalization of model training, there are an increasing number of normalization methods available. Normalization is a special function transformation method for logarithmic values. That is, assuming that there is a normalized function f, the original value before the normalization is converted x, and finally a normalized value f(x) is obtained. The so-called normalization is to satisfy certain characteristics by converting values in order to prevent the entire network from collapsing during training, especially in deep works. The current standardization methods applied to neural networks can currently be divided into three categories:
The first: to normalize the weights on the edges of connected neurons, for example, weight normalization, and this adds L1 regular terms or L2 regular terms to the Loss function to avoid over-fitting of the model during training.
The second: to normalize the activation values of layer neurons, such as Batch Normalization(BN) [
42], Layer Normalization(LN) [
43], Instance Normalization(IN) [
44], Group Normalization(GN) [
45], and spectral normalization(SN) [
46].
The third: the fusion of the above methods, for example, switchable normalization [
47] (2018) proposed switchable normalization to LN, BN, IN, which is the appropriate normalization in each layer by adding 6 weight parameters.
The main difference between the methods of recording the input image as [N, C, H, W] is as depicted in
Figure 1.
2.2.3. Activation Layer
The activation function is a function that runs on the neurons of the neural network and is responsible for mapping the input of the neurons to the output. In order to improve the ability of the network to express deep features of the input data, a nonlinear activation function is introduced. The commonly used activation layers are sigmoid, tanh, rectified linear unit(ReLU) and leaky rectified linear unit (LReLU).
The sigmoid function, also called the logistic function, is used for hidden layer neuron output, and its value ranges from 0 to 1, which can map a real number to this interval for binary classification.
The tanh function, also a tangent function, ranges from −1 to 1.
The above sigmoid and tanh are saturation activation functions, while ReLU and its variants are unsaturated activation functions. The advantage of using the unsaturated activation function is two-fold: first, the unsaturated activation function solves the so-called gradient disappearance problem to a certain extent; and second, it speeds up the convergence. ReLU outputs the positive number as it is, and the negative number is directly set to zero. The calculation of the ReLU function is performed after convolution, so it is the same as the tanh function and the sigmoid function, and belongs to the non-linear activation function. When the input is negative, ReLU is not activated at all, which means that once a negative number is entered, ReLU cannot be activated.
The formula of
ReLU function is as follows:
In contrast, leaky
ReLU assigns a non-zero slope to all negative values. The function formula is as follows:
where
is a fixed parameter,
represents that different channels corresponding to
. The Softmax function is used for multi-class neural network output. The Softmax function will compress each class between 0 and 1 and divide by the output sum. It can actually represent the input probability of every class. The Softmax function is best used at the output layer of the classifier. The function is as follows:
2.3. Architecture of SSGAN
Semi-supervised learning is one of the most prominent applications of GAN. For the generative scene, the
D computes the true and false probability for guiding to train the
G, and may be discarded after training. However, for the scene of semi-supervised, especially multi-class, the
D in training provides the probability of data from the generated data of the
G and the real data. Then, these probabilities send messages back for the
G’s improvement in learning the features of the real data. So, the
D and
G improve with each other. This was shown in [
38] where the
G generated realistic data to boost the
D to class accurately under semi-supervised learning; Also, the
D could class accurately to provide feedback to the realistic generated data by the
G.
The
G, like a decoder, starts with initializing a random vector with a normal distribution, then maps the vector to a higher dimension by a process like decoding (such as the decoder in VAE [
47]), and finally generates fake data similar with the shape of the input data of
D. Then, the auto-encoding calculates the loss function by comparing the difference between each pixel point between the two pictures, and the loss function is calculated by the adversarial process in the generative adversarial network. Obviously, during the adversarial process, the
G constantly improves itself to try to gain the trust of the
D [
48].
The G usually consists of some deconvolution, normalization and activation layers. The G’s input is separately a randomly generated vector, and the shape of its output is the same as the input of the D. The D of SSGAN is usually not the binary classifier, we assume the input data have K categories, in supervised learning, the D will be a K classifier, and in the unsupervised learning, the D will be a binary classifier. The extra one is generated by the G. The D includes some convolution, normalization and fully-connected layers, which end with a Softmax layer.
In unsupervised learning, fake data and real data are taken into the
D for the optimization of the
D to discriminate real and fake data. For the fake data generated by
G, the
D tries to judge it as fake data. For the real unlabeled data, the
D tries to judge it as real data. In semi-supervised GAN, the real data usually consists of labeled data and unlabeled data. Here we need to explain the importance of unlabeled data in semi-supervised learning, as depicted in
Figure 2. White and black dots in
Figure 2 are different classificatory labeled data, and gray dots are unlabeled data. This figure intuitively conveys the importance of unlabeled data in semi-supervised learning when labeled data are rare.
In the adversarial process, the received data of the D is mainly from three points:
- (1)
Labeled real data: from the training data set. When training, the D just needs to try to identify them.
- (2)
Unlabeled real data: from the training data set. When training, the D just needs to regard them as real data, and attempt to give a probability as close to 1 as possible.
- (3)
Unlabeled fake data: generated by the G. When training the D tries to distinguish them from unlabeled real data as a probability as close to 0 as possible.
The loss of
D includes two parts: (1) the loss of unsupervised learning
, and (2) the loss of supervised learning
. So, the total loss
L is the sum of loss of the supervised and unsupervised learning.
Firstly, a standard classifier Softmax should classify a sample data
x into one of
K possible categories. The
D of SSGAN take in x as input and outputs a
K-dimensional vector of logits
, which are finally converted into class probabilities by Softmax:
. In supervised learning, SSGAN should be trained by minimizing the cross-entropy between the labels of the real labeled data and SSGAN predictive distribution
. In unsupervised learning, the fake data are labeled with a new class
, then we use
to represent the probability that
x is fake data. Assuming the ratio between the fake data and the real data is 1:1,
For unsupervised learning, the
D outputs true or false. Then we use
to denote
:
We bring Formula (10) into the
of Formula (9), then we easily find the formula of the unsupervised loss function
of SSGAN is exactly the loss of standard GAN:
After this loss function is determined, training operations start by minimizing this loss. It should be noted that Formulas (9)–(11) come from [
38].