Next Article in Journal
Iterative Equalization and Decoding over an Additive White Gaussian Noise Channel with ISI Using Low-Density Parity-Check Codes
Previous Article in Journal
Characterization of the Migration of Soil Particles in Lateritic Soils under the Effect of Rainfall
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning Kernel Stein Discrepancy for Training Energy-Based Models

1
Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu 610041, China
2
School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China
3
State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(22), 12293; https://doi.org/10.3390/app132212293
Submission received: 18 September 2023 / Revised: 26 October 2023 / Accepted: 31 October 2023 / Published: 14 November 2023

Abstract

:
The primary challenge in unsupervised learning is training unnormalized density models and then generating similar samples. Few traditional unnormalized models know what the quality of the trained model is, as most models are evaluated by downstream tasks and often involve complex sampling processes. Kernel Stein Discrepancy (KSD), a goodness-of-fit test method, can measure the discrepancy between the generated samples and the theoretical distribution; therefore, it can be employed to measure the quality of trained models. We first demonstrate that, under certain constraints, KSD is equal to Maximum Mean Discrepancy (MMD), a two-sample test method. PT KSD GAN (Kernel Stein Discrepancy Generative Adversarial Network with a Pulling-Away Term) is produced to compel generated samples to approximate the theoretical distribution. The generator, functioning as an implicit generative model, employs KSD as loss to avoid tedious sampling processes. In contrast, the discriminator is trained to identify the data manifold, also known as an explicit energy-based model. To demonstrate the effectiveness of our approach, we undertook experiments on two-dimensional toy datasets. Our results highlight that our generator adeptly captures the accurate density distribution, while the discriminator proficiently recognizes the unnormalized approximate distribution shape. When applied to linear Independent Component Analysis datasets, the log likelihoods of PT KSD GAN improve by about 5‰ over existing methods when the data dimension is less than 30. Furthermore, our tests on image datasets reveal that the PT KSD GAN excels in navigating high-dimensional challenges, yielding authentically genuine samples.

1. Introduction

In unsupervised learning, training unnormalized density models and generating similar samples are hot topics, which are also the fundamental work in probability statistics [1]. In density estimation, we often achieve an unnormalized model, which contains a normalization constant. The computation of this normalization constant has been proven to be an NP-hard problem and intractable in high-dimensional datasets [2]. Only with the density function of given samples can we conduct some downstream tasks, such as sample generation and quickest change detection [3].
The traditional unnormalized density models approximate the theoretical probability distribution of given datasets by calculating the maximum likelihood estimate (MLE), which derives a variety of inference algorithms, like variational inference and MCMC [4]. But, the initializing probability distribution and Markov chain structures are hand-designed [5]; therefore, these algorithms lack universality. Simultaneously, high-dimensional curse and integral items are barriers in the sampling process [6], greatly restricting the application scenarios.
Various unnormalized models have been proposed in recent years. Energy-based models [7] are widely used in unnormalized density models because of their freedom and simplicity in designing energy function. But, in order to normalize the model, the denominator in the density function is usually an integral term. To calculate the tedious integral item and its gradient, diversiform deep energy models [8,9,10] have been proposed. Their models indirectly eliminate integral terms, and the process of generating samples from the Gibbs distribution is very complex. Few of these models know what the quality of the trained model is, as most models are evaluated via downstream tasks [11] or by the quality of generated samples from trained models [12].
The generative adversarial network (GAN) [13] indirectly evaluates generative models using generated samples. GAN removes the probability density function estimation process in the unnormalized model and directly simulates the generation of samples. By avoiding the complex density estimation process and the bias caused by the estimation, GAN makes sample generation no longer a difficult task. GAN has brought about revolutionary evolution in image generation [14] and other domains [15], but the minimax loss function leads to mode collapse and a vanishing gradient during GAN training [16]. Various solutions have been proposed to alleviate these challenges in GAN, such as substituting the regularization terms and loss functions: WGAN [17], Fisher GAN [18], CLC-GAN [19], etc. GAN does not consider the probability distribution of given samples, nor can it provide an explicit estimate. As a result, GAN cannot conduct approximate inference and related downstream tasks, such as Independent Component Analysis.
From a probabilistic statistical point of view, it is natural to introduce classical hypothesis testing methods into machine learning. Applying hypothesis testing to train unnormalized models is usually a novel method favored by researchers. GMMN [20] replaces the critic of GAN with a two-sample test based on MMD [21]. DEAN [22] optimizes models using goodness-of-fit test statistical methods. χ 2 GAN [23] measures the discrepancy between two probability distributions by χ 2 distance. However, because the density distribution of samples needs to be explicitly given, KSD is rarely used in training unnormalized models.
In this paper, we propose a new unnormalized model to address the problems mentioned above without sampling from the model, and we can measure the quality of the model directly during the training process. We introduce the PT KSD GAN, an unnormalized model that merges Kernel Stein Discrepancy (KSD) [24,25] as a goodness-of-fit test method with the GAN framework. Our discriminator leverages a deep energy network to assign energy values to each sample, effectively distinguishing between genuine and synthetic samples. Simultaneously, our generator aims to minimize the Kernel Stein Discrepancy between generated samples and the theoretical density distribution. Our work has several main contributions:
(1)
To summarize, this paper provides evidence that, under certain assumptions, Kernel Stein Discrepancy as a goodness-of-fit test method is equivalent to Maximum Mean Discrepancy as a two-sample test method.
(2)
An adversarial learning algorithm, PT KSD GAN, is proposed. We select a deep energy network as the discriminator so that the density distribution of given samples is explicitly obtained through the energy model. The generator of PT KSD GAN can produce samples that minimize Kernel Stein Discrepancy between generated samples and the obtained density distribution. We eliminate the tedious calculation of integral terms in the energy model and can easily generate the desired samples through the generator.
(3)
For the first time, we demonstrate experimentally that KSD is valid in high-dimensional datasets. In linear Independent Component Analysis datasets, where the data dimension is less than 30, the performance of PT KSD GAN outperforms other unnormalized training methods. Our tests on image datasets mark a significant advancement in addressing high-dimensional data challenges.

2. Relative Work

2.1. Score-Matching Generative Models

This paper is about density estimation, where the general problem description is as below:
Problem 1.
Given a finite set of samples x i i = 1 n χ R d (also named true samples) from an unknown theoretical density distribution q x , which we approximate by p θ ; z , where z is a random variable, which is sampled from a known simple distribution q z , we optimize parameters θ so that the output values (also named as false samples) of p θ ; z match the theoretical distribution q x .
The energy model is a commonly used explicit density estimation algorithm. In the energy model, energy function e transforms every observation into a single scalar, named as “energy”. Then, the Gibbs distribution can be expressed:
q x = exp e x Z
where e : R d R , and Z = x exp e ( x ) is the integral item, which is also known as the partition function. When using Formula (1) to calculate the explicit density distribution of samples, the partition function is intractable or even unsolvable. In order to solve the intractable problem, Hyvärinen [26] proposed score matching to eliminate computation of the partition function. The score is the derivative of the log density of the Gibbs distribution in Formula (1):
s q x = x log q x = x log exp e x Ζ = x e x x log Ζ
The second term x log Z is a constant term, which can be ignored; we just need to calculate the first term x log e x . Score matching minimizes the squared distance of two density distributions’ scores, which can enable the fake distribution to approach the true theoretical distribution. Score-matching generative models [27,28,29] directly train the score network to approach the score of given samples. When the training is completed, the unnormalized density distribution is achieved, which Langevin dynamics can sample from. However, this generation model has some limitations on the density distribution. The Langevin dynamics sampler is not suitable for mixed distributions, and the results are not accurate in low-density regions.

2.2. Goodness-of-Fit Test and Generative Models

In the goodness-of-fit test, we are given a finite set of samples y j j = 1 n , which are samples from an unknown distribution model p θ ; z and an unnormalized theoretical model q x , which we already know. The goal of the goodness-of-fit test is to determine whether samples y j j = 1 n generated from distribution p θ ; z coincide with the theoretical density distribution q x . So, there are two hypotheses:
H 0 : p = q         H 1 : p q
We reject hypothesis H 0 or H 1 , based on the selected thresholds and statistics. It is rational to apply the goodness-of-fit test to generative models to measure the discrepancy between fake samples and the theoretical density distribution. References [22,30] both introduce the goodness-of-fit test into deep generative models, but they know nothing about what the quality of the trained model is and can only be measured via the quality of samples generated. Simultaneously, without an accurate density function, they cannot perform the related tasks of approximate inference. Also, they construct complicated autoencoder networks; we only use simple convolutional architectures mentioned in DCGAN [31], which make sample generation and approximate inference simple tasks.

3. Maximum Mean Discrepancy and Kernel Stein Discrepancy

In this section, we first introduce two methods of hypothesis testing: Maximum Mean Discrepancy (MMD) and Kernel Stein Discrepancy (KSD). We then demonstrate the relationship between them.

3.1. Maximum Mean Discrepancy: A Two-Sample Test Method

From the perspective of hypothesis testing, the most popular solution paradigm of density distribution is the two-sample test. During the two-sample test, equivalent samples are picked from two density distributions separately. Then, we verify whether p θ ; z equals q x by calculating the values of the statistics. Due to the spatial mapping capability, Maximum Mean Discrepancy (MMD) [21] is a valid and widely used two-sample test method. MMD maps the samples into high-dimensional even infinite feature space RKHS (also named Reproducing Kernel Hilbert Space) to measure their distance. For all f RKHS , E x f = f , μ p is called the mean embedding of p , among μ p   RKHS . If the distance between the mean embeddings of two density distributions is less than a certain threshold value, we can be sure that the two sets of samples are generated from the same distribution, namely q x = p θ ; z .
MMD employs kernel functions to transform spatial coordinate bases. We denoted that k x ,   x is a positive definite kernel and also a function: χ × χ R . H k is the Reproducing Kernel Hilbert Space (RKHS) defined by k x ,   x . RKHS is an inner product space with reproducibility, separableness, and completeness. In density distribution, a finite set of samples x i i = 1 n χ R d are sampled from an unknown density distribution q x . Another set of samples are defined as y j j = 1 m Y R d (also named fake samples), and y j is an independent identically distributed sample from the density distribution p θ ; z . The square of MMD can be represented by the mean embeddings μ p and μ q , and they can be written as follows:
M M D k 2 = μ p μ q k 2 = E q [ k ( x i , x i ) ] 2 E q , p [ k ( x i , y j ) ] + E p [ k ( y j , y j ) ]

3.2. Kernel Stein Discrepancy: A Goodness-of-Fit Test Method

In Formula (2), S q x = x log q x is also named a Stein score function. The condition is that q x is a continuous and differentiable density function and that the derivative of q x exists. In [25], a continuous differentiable function f : χ R is defined in the Stein class of p when it satisfies the following:
x X f x p x d x = 0
When F x = f 1 x ,   , f r x is a vector function, F x is in the Stein class of p x , if i r and f i meets the condition (5). Assume χ = R d . Through integrating, by parts, the term on the left-hand side of Equation (5), Formula (5) is transformed into the following:
lim x f x p x = 0
The Stein operator of p is named A p f x ; it is a linear operator of f x . Among them, f x is in the Stein class of p x .   A p f x is defined as follows:
A p f x = s p x f x + x f x
Inspired by the equation E p A q f x = E p s q x s p x f x T , Kernelized Stein Discrepancy (KSD) [24,25] is proposed to measure the distance between the two distributions. KSD is denoted as follows:
S q , p = E p s q y j s p y j T k y j , y j s q y j s p y j
In the above equation, s q y i and s p y j are the score function of q and p ; they are unknown in most unnormalized models, and we only have their samples in hand. In [25], the author eliminated the explicit expression of p by applying the constraints of the Stein class. Kernel function k y j ,   y j is a special continuous differentiable function f . And, it is easy to prove that if k y j ,   y j has continuous second-order partial derivatives, k y j ,   y j is in the Stein class of p . In other words, k y j ,   y j meets the condition in Equation (5). Then:
E p k y j , y j s q y j s p y j = E p k y j , y j s q y j + y j k y j , y j
Let us plug Formula (9) into Formula (8) and simplify:
                                  S q , p = E p u q y , y                                u q y , y = s q y j T k y j , y j s q y j + s q y j T y j k y j , y j                                             + y j k y j , y j s q y j + t r a c e y j , y j k y j , y j
We can compute S q ,   p without the density distribution p and only need score function s q y i of the density distribution q . And, when k y j ,   y j is a strictly positive definite kernel, it is easy to demonstrate that p = q , when and only when S q ,   p = 0 . In [25], KSD is applied to bootstrap goodness-of-fit testing to measure differences between two probability distributions. That is, we can calculate the KSD between the density distribution q and samples from p to determine whether q equals p .

3.3. The Connection between MMD and KSD

In this section, we will discuss and demonstrate the connection between MMD and KSD below.
Lemma 1.
If k y j ,   y j   is a strictly positive definite kernel, the kernel k y j ,   y j is still a strictly positive definition.
k y j , y j = s q y j s p y j T k y j , y j s q y j s p y j
Proof.
It is easy to demonstrate this using the definition of strictly positive definite kernel and eigenvalue decomposition. □
Assume k y j ,   y j is a manually assigned and specific positive definite kernel. Replace the kernel function k in Formula (4) with k y j ,   y j ; the square of MMD can be written as below:
M M D k 2 = E q s q s q x i s p x i T k x i , x i s q x i s p x i 2 E p , q s q x i s p x i T k x i , y j s q y i s p y j + E p s q y j s p y j T k y j , y j s q y j s p y j
Lemma 2.
If k y j ,   y j is both in the Stein class of p  and q , then:
M M D k 2 = E p s q y j s p y j T k y j , y j s q y j s p y j
Lemma 2 is proved in Appendix A. We can see the right-hand sides of Formula (8) and Formula (13) make no odds. So, the Maximum Mean Discrepancy is equivalent to the Kernelized Stein Discrepancy when the kernel satisfies the conditions in Lemma 2.

4. Learning the Kernel Stein Discrepancy of Energy-Based Models

In the above section, we introduced two methods of hypothesis testing. MMD has been widely used in various unnormalized models [32,33]. But, due to the need to estimate the theoretical density distribution of q , KSD’s use in unnormalized models is extremely limited, only applied in the mixed Gaussian model [34]. In this article, we propose an algorithm for solving Problem 1 by using Kernel Stein Discrepancy to train unnormalized density models.
The bottleneck of measuring KSD as the explicit density estimation of q is unsuspected. In [35], two generative models were proposed for mutual calibration between the explicit and implicit generative model. The difference between explicit energy-based models and the implicit generative model is measured through the Stein discrepancy. But, its objective function includes three kinds of statistical distances, which increases the complexity of the calculation. And, in the experimental part, no experimental results on likelihood inference are given. Inspired by [35] and the right-hand side of the third equation in Formula (2), it is convenient to obtain the explicit Gibb distribution of samples in explicit energy-based models. Enlightened by this, we replaced s q y j in Formula (10) with y j e y j . The Kernelized Stein Discrepancy of energy-based models is defined as follows:
                      KSD e q , p = E p [ u q y , y ]                             u q y , y = y j e y j T k y j , y j y j e y j y j e y j T y j k y j , y j                            y j k y j , y j y j e y j + t r a c e y j , y j k y j , y j
In the above formula, the explicit density estimation of q is limited to the Gibb distribution of the energy-based model. And, e is the energy value of input samples. Although designing the energy function e is unconstrained, it is hard to find the right energy function for unknown density distribution. Because we know nothing about the distribution of given samples, we have to look for the right energy function e within a larger function range. In the vast majority of cases, a deep neural network is selected as the energy function, attributed to its strong nonlinear fitting ability and large value range. So, our model also selects a deep energy network as the energy function e . k y j ,   y j in Formula (14) is a strictly positive definite kernel, which maps input samples to the Reproducing Kernel Hilbert Space and returns their inner product. The choice of kernel function for high-dimensional data is always a difficult issue. Different kernel functions map input samples to different feature spaces, and sample features and their inner product values vary accordingly. Here, we select the RBF (Radial Basis Function) as the kernel function, also known as the Gaussian kernel function. The RBF kernel function is defined as follows:
k x , x = exp 1 2 σ 2 x x 2
The RBF kernel meets the conditions we described in Lemma 2 and has a continuous second-order partial derivative. Hyperparameter σ needs to be set manually. The RBF kernel maps samples to infinite dimensional space, has fewer parameters, and has simple model complexity. Therefore, it is selected as the kernel function to be used in various application scenarios. Then, we can obtain the kernel matrix of input samples by calculating the RBF kernel function. y j k y j ,   y j , y j k y j ,   y j , and t r a c e ( y j ,   y j k y j ,   y j ) can be established by taking the derivative of the matrix. Through the reverse derivation of the deep neural network, the value of y j e y j , y j e y j can be obtained.
So, in Formula (14), each term of K S D e q , p can be calculated and measures the degree of matching between the input samples and density distribution q represented by the energy-based model. The Gibbs distribution of the energy-based model removes the integral term by derivation and greatly simplifies the operation. Then, we analyze the application of KSD in training unnormalized models. The input samples y j j = 1 m Y R d are a finite set of fake samples from a density distribution p . Samples of the density distribution p should be readily available. And, the density distribution q is represented by the energy-based model, which p aims to approximate. Since we know nothing about q at the beginning, the initial p must be far away from q . We calculate the K S D e of the energy-based model in Formula (14) and update the fake samples of p and p iteratively, which will make p slowly approach q . Finally, when the value of K S D e is less than a given small value, we think p   and   q are the same density distribution.

5. Training Unnormalized Models with KSD

In Formula (14), fake samples y j , y j and the energy model e need to be provided. In this paper, to avoid bias caused by samplers and to reduce the cost of sample generation, we adapt a specialized generative network of the generative adversarial network to generate fake samples y j , y j . Meanwhile, we choose the explicit energy-based model as an estimation of unnormalized density distribution q , which can also be employed as the discriminator in generative adversarial networks.

5.1. Generative Adversarial Network (GAN)

The generative adversarial network is a popular method for training generative models. The generative adversarial network (GAN) [13] involves a generator G and a discriminator (or critic) D. The goal of the generator is to transform latent codes z p z into generated samples, which the discriminator cannot distinguish from real data samples, and the goal of the discriminator is to distinguish real data samples and generated samples as well as possible. With i.i.d. data samples, the optimal generator parameter θ g and discriminator parameter θ d are the solution to the following minmax game:
L = m i n θ g m a x θ d V θ g , θ d = E x q   x log D x , θ d + E z q z log 1 D G z , θ g , θ d
The minimax game highlights the adversarial nature of the training process. The discriminator, parameterized by θ d , aims to maximize its ability to differentiate between real and generated data, while the generator, parameterized by θ g , endeavors to minimize the discriminator’s accuracy by producing increasingly realistic data. E x q   x log D x , θ d term signifies the expected log probability over real data samples x drawn from the true distribution q x that the discriminator D correctly identifies as genuine. The goal for the discriminator is to minimize this value, ensuring a high probability of correctly classifying real samples. E z q z log 1 D G z , θ g , θ d denotes the expected log probability, for noise variables z sampled from a distribution q z , that the discriminator D correctly labels the data generated by the generator G (using the noise variables z) as fake. The discriminator aims to maximize this value to efficiently identify and label the artificially generated data, while the generator seeks to minimize this, thereby improving its ability to generate data that the discriminator misclassifies as real. Together, these terms formulate the objective function for the GAN, capturing the adversarial relationship between the generator and discriminator during the training process.

5.2. Generative Adversarial Networks for Unnormalized Density Models

The discriminator of traditional GAN solves classification problems. The discriminator is a single or multiple classification network. We study the use of the discriminator in unnormalized density problems.
In [36], EBGAN was proposed to find the data manifold based on the energy model. This assigns low-energy values to data manifold and high-energy values to other data regions. The mission of the discriminator is to discern whether the data are from the true samples manifold. Nevertheless, the generator is trained to approximate the data manifold to confuse the discriminator, so that the discriminator loses discernment. The antagonistic process can be defined by the loss:
L D = D x i + [ m D ( G ( z ) ) ] + L G = D ( G ( z ) )
L D is the loss for the discriminator; D x i is the discriminator’s output for a real data sample x i . D G z ) is the discriminator’s output for a generated sample G z , m is a positive margin, and   + denotes the positive part function, defined as y + = max 0 , y . Following this adaptation, the model is equipped to tackle issues pertaining to unnormalized density problems. However, the autoencoder employed within the discriminator proved computationally intensive. The Stein score function S q x of the autoencoder is hard to obtain.

5.3. PT KSD GAN (Kernel Stein Discrepancy Generative Adversarial Network with a Pulling-Away Term)

In the above section, we analyzed the disadvantages of the autoencoder network architecture in EBGAN. Consequently, we optimized it to a standard neural network architecture. To ensure stability, we adjusted the hyperparameter m in Formula (17) to 1. The output of the discriminator is regarded as energy value of samples, which is the numerator of the explicit Gibbs density estimation. As stated in [25], the Kernel Stein Discrepancy (KSD) can measure the distance of different density distributions. That is, minimizing the K S D e of the generator is equivalent to approaching fake samples, coinciding with the theoretical distribution of true samples. The modified formulation of L D and L G is presented as follows:
L D = D x i + ( 1 D ( G ( z ) ) ) L G = KSD e q , p
In the above formula, the generator and the discriminator are iteratively updated until, finally, the discriminator cannot tell whether the input sample is real or fake. We call this process of adversarial generation in Formula (18), KSD GAN. The discriminator is conducted as an energy-based model during training to discover the explicit Gibbs distribution of real samples. At the beginning, fake samples generated by the generator are very far from the density distribution region of real samples, and the discriminator can easily identify fake samples from real ones. During alternating iterations, samples generated by the generator are constantly narrowing K S D e from the theoretical distribution of real samples, which is represented by the energy model distribution of the discriminator. As the number of iterations increases and the Kernel Stein Discrepancy shrinks, samples generated by the generator approximate the Gibbs distribution of true samples, fooling the discriminator into thinking they are real samples.
In fact, this iterative process of KSD GAN can also be regarded as the confrontation process between the implicit generation model and the explicit generation model. The discriminator is trained as an explicit generation model, and its density distribution is the Gibbs distribution based on the energy model. Of course, we can use the sampler to sample from this Gibbs distribution, which we want to obtain. As an implicit generation model, the generator bypasses the estimation of the density distribution and generates imitation samples directly by reducing K S D e .
GAN is notoriously difficult to train, and mode collapse often occurs. This is due to the widely dispersed density distribution of high-dimensional datasets, and there is a huge span between regions. It is very possible for the generator to find a local optimal and then stop optimizing. Then, samples generated by the generator are only one or several types, and the sample diversity is insufficient. Inspired by [36,37], reducing the similarity and increasing orthogonalization of generated samples in energy-based models can avoid model collapse and speed up convergence. We add a repelling regularizer named the Pulling-away Term (PT) to the generator loss:
P T S = 1 M M 1 i j i S i T S j S i S j 2
After adding the regular term to the generator, the loss L D and L G of KSD GAN are defined as follows:
L D = D x i + ( 1 D ( G ( z ) ) ) L G = KSD q , p + α * P T
In Formula (20), KSD GAN with the regular term PT added is called PT KSD GAN. Our model structure of PT KSD GAN is shown in Figure 1.
The figure above shows that our model is actually an enhanced GAN framework incorporating the Kernelized Stein Discrepancy. The data processing steps are described below:
1
Two data streams, denoted as x i and y j , are added to the system. The former is directed towards the discriminator, while the latter interfaces with the generator.
2
Acting as the GAN’s energy model, the discriminator, D x processes the data x . Its primary responsibility is to differentiate between the real sample x i and fake sample y j generated by the generator.
3
Simultaneously, the generator G z , creates synthetic samples based on an input noise vector z. This noise vector serves as a seed, ensuring diversity in generated samples.
4
The fake samples are subjected to the PT Kernel Stein Discrepancy module. The integration of KSD, a non-parametric measure of discrepancy between distributions, aims to quantitatively evaluate the deviation of the generated samples from the target distribution. This evaluation imparts valuable feedback to the generator, steering it towards optimal data synthesis.
5
The outcomes from the PT Kernel Stein Discrepancy and Mean Absolute Error modules are collated to furnish a composite gradient that guides the generator’s training. By assimilating feedback not only from the discriminator but also from an external discrepancy measure (KSD), the generator’s training becomes more informed, focused, and robust against mode collapse.

6. Experiment

We introduce the PT KSD GAN for training unnormalized models (Figure 1). Using an energy-based discriminator, it assigns energy values to samples, while the generator, minimizing Kernel Stein Discrepancy, produces samples reflecting true density. Initial tests on 2D toy datasets (Figure 2) validate the generator’s alignment with true density and sample diversity.
We further validate using linear Independent Component Analysis (ICA), a standard for unnormalized model training. ICA decomposes signals into independent sources. Our focus was on retrieving original latent variables from mixtures, with the results shown in Table 1.
Finally, PT KSD GAN’s performance on image datasets like MNIST and CIFAR-10, using a DCGAN framework, is depicted in Figure 3, indicating realistic image generation with room for refinement.

6.1. Toy Datasets

We initiate our study by assessing PT KSD GAN’s performance on two-dimensional toy datasets. These datasets, derived from Sklearn’s library, provide a range of shapes in two-dimensional data. For the generator, we use inputs { z 1 , z 2 , , z N } sampled from a standard normal distribution. Both the generator and discriminator are structured as a two-layer multi-layer perceptron, producing two-dimensional outputs. Key hyperparameters include an RBF kernel function bandwidth σ of 1.0; a regularized term coefficient α of 0.8; and Adam Optimizer settings with a learning rate of 0.002, β 1 of 0.5, and β 2 of 0.99. We standardize the batch size to 1000 across all toy datasets and set a training cap at 10,000 iterations. The outcomes of this training are illustrated in Figure 2.
Figure 2 delineates the density distributions: real samples are shown on the left, the energy-based model’s output (or the discriminator) in the center, and the generator-produced sample density on the right. Notably, the energy-based model provides unnormalized energy values, leading to density estimates that capture only the approximate shape of the true samples’ density distribution. While potential exists to use samplers like the Hamiltonian Monte Carlo, biases from initial probability distributions and other constraints deterred us from this approach.
Our approach employs specialized networks to produce fake samples, aiming to minimize the Kernel Stein Discrepancy against the true distribution. As evident in Figure 2, the generator’s outputs align closely with the true density distribution, ensuring sample diversity and authenticity. The PT KSD GAN training process can be viewed as a competitive interaction between two generative networks. While the discriminator captures the true samples’ explicit Gibbs distribution, the generator focuses on producing realistic samples without explicit density distribution searches.

6.2. Linear ICA

Linear Independent Component Analysis (ICA) [26] is a statistical method designed to separate a multivariate signal into independent non-Gaussian signals. It is particularly useful in scenarios where multiple signals overlap, and the objective is to discern the original signals from their mixtures. Building on this, we further explored the efficacy of PT KSD GAN within the context of the linear ICA model. This model serves as a benchmark in the evaluation of algorithms tailored for training unnormalized models.
The generative process of true samples is relatively simple. The latent variables s i , i = 1 , , n are independent and identically distributed, and they are sampled from the standard Laplace distribution, that is, s ~ L a p l a c e 0 , 1 . Then, take a linear transformation:
x = W s
The model parameter W is an N × N matrix, and the determinant is not zero. Under Formula (21), we can deduce x ’s log density:
log p ( x ; W ) = log p s ( W 1 x ) log W
p s · is the probability density function of latent variables s. Thus, (22) can be used as non-normalized likelihood to evaluate generative models. There are two hyperparameters: the bandwidth parameter σ and the regularizer α . The bandwidth parameter σ of the RBF kernel function is self-adapting according to the input sample number and sample feature dimension. Sample dimensions N range from 2 to 50. The value of regularized term coefficients is 0.8. We set the learning rate of the Adam Optimizer as 0.001 and the two hyperparameters of the Adam Optimizer as β 1 = 0.3 , β 2 = 0.9 . The number of iterations is 100,000, and the batch size is set to 1000. See Table 1 for the results of linear ICA with PT KSD GAN.
Table 1 shows the log likelihoods of various algorithms: score matching [26], maximum likelihood [38], Learned Stein Discrepancy (LSD) [39], and Noise-Contrastive Estimation (NCE) [40]. For low-dimensional samples, we find that the power of PT KSD GAN is optimal. As the dimensions increase, the performance of PT KSD GAN is slightly inferior to ML and LSD. The reason is that the manual selection of kernel functions makes it impossible to select the optimal kernel function for the current dataset. On the whole, our approach can achieve the same performance as maximum likelihood.

6.3. Preliminary Image Results

To validate our approach’s applicability to train more expressive unnormalized models and to generate samples, we conducted experiments assessing PT KSD GAN’s performance on high-dimensional image datasets, specifically MNIST and CIFAR-10, with a training capacity of 50K images. Our experimental setup adhered to the DCGAN [31] model architecture, employing simple convolutional neural networks for both the generator and the discriminator. The discriminator computes the energy value for each image vector, and we subsequently averaged these values to determine the energy for an image batch. Optimization was performed using the Adam Optimizer, set with a learning rate of 0.00005 and hyperparameters β 1 = 0.5 and β 2 = 0.999 . We standardized the batch size to 64 across all datasets.
Given the intricate nature of the image datasets, we restricted our regularization term calculations to the initial twenty images, enhancing computational efficiency. This means that the value of M in Formula (20) remains consistent at 20 for each training batch. This approach significantly elevates the diversity of the generated images. The results, as depicted in Figure 3, show that the images produced from MNIST and CIFAR-10 training are convincingly realistic, though there is potential for refining image precision in future iterations.
The images displayed in the figure represent the generator’s outputs within the adversarial network. While it is feasible to use a sampler to derive samples from the discriminator’s trained energy model, we prioritize efficiency in large-scale sample generation and, thus, opt for the generation network over the sampler. Admittedly, the quality of these generated samples is not the best in the field. However, they mark a pioneering demonstration of KSD’s ability to manage high-dimensional data evaluation and generation. Future endeavors could explore intricate network architectures and sophisticated samplers to handle even more complex image datasets.

7. Conclusions

In this paper, we demonstrated that Kernel Stein Discrepancy is an efficient way to train unnormalized density models and to generate samples from these models. In Section 3, we connected the two-sample test method and the goodness-of-fit test by demonstrating that Kernel Stein Discrepancy is equal to Maximum Mean Discrepancy under certain assumptions. We introduced a new adversarial learning paradigm to train the unnormalized density model, named PT KSD GAN. The discriminator was trained as an energy model to determine the explicit Gibbs density distribution of real samples. Low energies were assigned to regions within the true data manifold, while high energies were assigned to other areas. The generator was responsible for generating fake samples to fool the discriminator. The quality of the generated samples was measured using a goodness-of-fit test method, KSD.
Three experiments were conducted to prove the feasibility and robustness of our algorithm. In toy datasets, an unnormalized energy model of PT KSD GAN can capture the approximate and blurry shape of the density distribution of true samples. The generated samples lie strictly within the true density distribution. The generator strictly and precisely identifies the true data manifold. In the linear Independent Component Analysis task, the performance of PT KSD GAN outperformed other unnormalized training methods when the data dimension was less than 30. The manual selection of kernel functions and hyperparameters limits the application of KSD in high-dimensional datasets. In more complex image datasets, the results showed that KSD is no longer cursed by high-dimensional datasets; however, the sharpness of the generated pictures should be improved in future work.
Unsupervised models are a critical area of machine learning, where the estimation of the density distribution is important for unlabeled samples. With the density distribution, it is possible to generate similar samples and to determine whether the input sample is abnormal. In future work, we will study the application of adversarial generation networks and hypothesis testing methods in anomaly detection.

Author Contributions

Conceptualization, L.N. and S.L., methodology, L.N. and S.L., software, L.N. and Z.L.; validation, S.L.; formal analysis, L.N.; investigation, L.N.; resources, L.N.; data curation, L.N.; writing—original draft preparation, L.N.; writing—review and editing, S.L. and Z.L.; visualization, L.N. and Z.L.; supervision, S.L.; project administration, S.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China, grant number “52275480”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

You can find MNIST dataset at http://yann.lecun.com/exdb/mnist/index.html, CIFAR-10 dataset at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 9 October 2023). No new data were created.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of Lemma 2.
We know the condition is as follows:
y j Y y j k y j , y j p y j d y j = 0
x i X x i k x i , x i q x i d x i = 0
Then, we can calculate the first term on the right-hand side of Equation (12) via integral operation, and the result is zero.
          E q s q x i s p x i T k x i , x i s q x i s p x i           = E q s q x i s p x i T A q k x i x i           = E q s q x i s p x i T k x i , x i s q x i + x i k x i , x i           = x i q x i q q x i q x i s q x i s p x i T k x i , x i s q x i + x i k x i , x i d x i d x i           = x i q x i q q x i s q x i s p x i T q x i k x i , x i s q x i + q x i x i k x i , x i d x i d x i           = x i q x i q q x i s q x i s p x i T k x i , x i x i q x i + q x i x i k x i , x i d x i d x i           = x i q x i q q x i s q x i s p x i T x i k x i , x i q x i d x i d x i           = 0
In a similar way, we can prove the following:
E p , q s q x i s p x i T k x i , y j s q y j s p y j = 0

References

  1. Wasserman, L. All of Statistics: A Concise Course in Statistical Inference; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
  2. Wu, S. Score-Based Approach to Analysis of Unnormalized Models and Applications. Doctoral Dissertation, Duke University, Durham, NC, USA, 2023. [Google Scholar]
  3. Wu, S.; Diao, E.; Ding, J.; Banerjee, T.; Tarokh, V. Robust quickest change detection for unnormalized models. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR, Pittsburgh, PA, USA, 31 July–4 August 2023. [Google Scholar]
  4. Andrieu, C.; De Freitas, N.; Doucet, A.; Jordan, M.I. An introduction to MCMC for machine learning. Mach. Learn. 2003, 50, 5–43. [Google Scholar] [CrossRef]
  5. Wang, D.; Liu, Q. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv 2016, arXiv:1611.01722. [Google Scholar]
  6. Grathwohl, W.S.; Kelly, J.J.; Hashemi, M.; Norouzi, M.; Swersky, K.; Duvenaud, D. No MCMC for Me: Amortized Sampling for Fast and Stable Training of Energy-Based Models, International Conference on Learning Representations. 2021. Available online: https://slideslive.com/38953930/no-mcmc-for-me-amortized-sampling-for-fast-and-stable-training-of-energybased-models (accessed on 3 May 2021).
  7. Lecun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.A.; Huang, F.J. A Tutorial on Energy-Based Learning. Predict. Struct. Data 2006, 1. Available online: http://yann.lecun.com/exdb/publis/orig/lecun-06.pdf (accessed on 15 July 2020).
  8. Saremi, S.; Mehrjou, A.; Schölkopf, B.; Hyvärinen, A. Deep energy estimator networks. arXiv 2018, arXiv:1805.08306. [Google Scholar]
  9. Du, Y.; Mordatch, I. Implicit generation and modeling with energy based models. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://proceedings.neurips.cc/paper/2019/hash/378a063b8fdb1db941e34f4bde584c7d-Abstract.html (accessed on 3 April 2020).
  10. Xie, J.; Zhu, S.; Wu, Y.N. Learning energy-based spatial-temporal generative convnets for dynamic patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 516–531. [Google Scholar] [CrossRef]
  11. Wan, F.; Kontogiorgos-Heintz, D.; de la Fuente-Nunez, C. Deep generative models for peptide design. Digit. Discov. 2022, 1, 195–208. [Google Scholar] [CrossRef]
  12. Bond-Taylor, S.; Leach, A.; Long, Y.; Willcocks, C.G. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7327–7347. [Google Scholar] [CrossRef]
  13. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Sherjil Ozairy, A.C.Y.B. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27. Available online: https://proceedings.neurips.cc/paper_files/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html (accessed on 24 April 2018).
  14. Song, J.; Ermon, S. Bridging the gap between f-gans and wasserstein gans. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020. [Google Scholar]
  15. Chen, X.; Zhang, Z.; Sui, Y.; Chen, T. Gans can play lottery tickets too. arXiv 2021, arXiv:2106.00134. [Google Scholar]
  16. Weng, L. From GAN to WGAN. arXiv 2019, arXiv:1904.08994. [Google Scholar]
  17. Arjovsky, M.; Bottou, S.C.A.L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
  18. Mroueh, Y.; Sercu, T. Fisher gan. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/07042ac7d03d3b9911a00da43ce0079a-Abstract.html (accessed on 21 April 2019).
  19. Xu, K.; Li, C.; Zhu, J.; Zhang, B. Understanding and stabilizing GANs’ training dynamics using control theory. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020. [Google Scholar]
  20. Li, Y.; Swersky, K.; Zemel, R. Generative Moment Matching Networks. arXiv 2015, arXiv:1502.02761. [Google Scholar]
  21. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Scholkopf, B.; Smola, A. A Kernel Two-Sample Test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  22. Ding, L.; Yu, M.; Liu, L.; Zhu, F.; Liu, Y.; Li, Y.; Shao, L. Two generator game: Learning to sample via linear goodness-of-fit test. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://proceedings.neurips.cc/paper/2019/hash/b075703bbe07a50ddcccfaac424bb6d9-Abstract.html (accessed on 2 August 2020).
  23. Tao, C.; Chen, L.; Henao, R.; Feng, J.; Duke, L.C. Chi-square generative adversarial network. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  24. Chwialkowski, K.; Strathmann, H.; Gretton, A. A Kernel Test of Goodness of Fit. arXiv 2016, arXiv:1602.02964. [Google Scholar]
  25. Liu, Q.; Lee, J.D.; Jordan, M.I. A Kernelized Stein Discrepancy for Goodness-of-fit Tests. arXiv 2016, arXiv:1602.03253. [Google Scholar]
  26. Hyvärinen, A.; Hurri, J.; Hoyer, P.O.; Hyvärinen, A.; Hurri, J.; Hoyer, P.O. Estimation of non-normalized statistical models. In Natural Image Statistics: A Probabilistic Approach to Early Computational Vision; Springer: London, UK, 2009; pp. 419–426. [Google Scholar]
  27. Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
  28. Song, Y.; Ermon, S. Improved techniques for training score-based generative models. Adv. Neural Inf. Process. Syst. 2020, 33, 12438–12448. [Google Scholar]
  29. Yoon, J.; Hwang, S.J.; Lee, J. Adversarial purification with score-based generative models. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021. [Google Scholar]
  30. Palmer, A.; Dey, D.K.; Bi, J. Reforming Generative Autoencoders via Goodness-of-Fit Hypothesis Testing. UAI 2018. Available online: https://dblp.org/rec/conf/uai/PalmerDB18.html (accessed on 9 October 2023).
  31. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
  32. Li, C.; Chang, W.; Cheng, Y.; Yang, Y.; Póczos, B. Mmd gan: Towards deeper understanding of moment matching network. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper/2017/file/dfd7468ac613286cdbb40872c8ef3b06-Paper.pdf (accessed on 10 August 2019).
  33. Sutherland, D.J.; Tung, H.; Strathmann, H.; De, S.; Ramdas, A.; Smola, A.; Gretton, A. Generative models and model criticism via optimized maximum mean discrepancy. arXiv 2016, arXiv:1611.04488. [Google Scholar]
  34. Hu, T.; Chen, Z.; Sun, H.; Bai, J.; Ye, M.; Cheng, G. Stein neural sampler. arXiv 2018, arXiv:1810.03545. [Google Scholar]
  35. Wu, Q.; Gao, R.; Zha, H. Bridging Explicit and Implicit Deep Generative Models via Neural Stein Estimators. Adv. Neural Inf. Process. Syst. 2021, 34, 11274–11286. [Google Scholar]
  36. Zhao, J.; Mathieu, M.; Lecun, Y. Energy-Based Generative Adversarial Networks. 2017. Available online: https://openreview.net/forum?id=ryh9pmcee (accessed on 3 May 2021).
  37. Laakom, F.; Raitoharju, J.; Iosifidis, A.; Gabbouj, M. On feature diversity in energy-based models. arXiv 2023, arXiv:2306.01489. [Google Scholar]
  38. Nijkamp, E.; Hill, M.; Han, T.; Zhu, S.; Wu, Y.N. On the anatomy of mcmc-based maximum likelihood learning of energy-based models. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
  39. Grathwohl, W.; Wang, K.; Jacobsen, J.; Duvenaud, D.; Zemel, R. Learning the stein discrepancy for training and evaluating energy-based models without sampling. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020. [Google Scholar]
  40. Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010. [Google Scholar]
Figure 1. Model structure of PT KSD GAN.
Figure 1. Model structure of PT KSD GAN.
Applsci 13 12293 g001
Figure 2. Result of PT KSD GAN on two-dimensional toy datasets. (Left): data. (Middle): densities of energy-based models. (Right): learned densities of the generator.
Figure 2. Result of PT KSD GAN on two-dimensional toy datasets. (Left): data. (Middle): densities of energy-based models. (Right): learned densities of the generator.
Applsci 13 12293 g002
Figure 3. Results of PT KSD GAN on MNIST, CIFAR-10.
Figure 3. Results of PT KSD GAN on MNIST, CIFAR-10.
Applsci 13 12293 g003
Table 1. Linear ICA training results.
Table 1. Linear ICA training results.
MethodData Dimension
1020304050
Max. Likelihood−10.98−18.48−21.49−23.43−25.53
PT KSD GAN−10.90−18.35−21.21−25.18−25.49
LSD−10.95−18.37−21.23−25.14−25.36
Score Matching−11.13−27.20−21.48NaNNaN
NCE−10.92−22.52−30.33−55.53−73.62
CNCE−11.00−18.77−24.47−37.64−36.31
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Niu, L.; Li, S.; Li, Z. Learning Kernel Stein Discrepancy for Training Energy-Based Models. Appl. Sci. 2023, 13, 12293. https://doi.org/10.3390/app132212293

AMA Style

Niu L, Li S, Li Z. Learning Kernel Stein Discrepancy for Training Energy-Based Models. Applied Sciences. 2023; 13(22):12293. https://doi.org/10.3390/app132212293

Chicago/Turabian Style

Niu, Lu, Shaobo Li, and Zhenping Li. 2023. "Learning Kernel Stein Discrepancy for Training Energy-Based Models" Applied Sciences 13, no. 22: 12293. https://doi.org/10.3390/app132212293

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop