1. Introduction
In unsupervised learning, training unnormalized density models and generating similar samples are hot topics, which are also the fundamental work in probability statistics [
1]. In density estimation, we often achieve an unnormalized model, which contains a normalization constant. The computation of this normalization constant has been proven to be an NP-hard problem and intractable in high-dimensional datasets [
2]. Only with the density function of given samples can we conduct some downstream tasks, such as sample generation and quickest change detection [
3].
The traditional unnormalized density models approximate the theoretical probability distribution of given datasets by calculating the maximum likelihood estimate (MLE), which derives a variety of inference algorithms, like variational inference and MCMC [
4]. But, the initializing probability distribution and Markov chain structures are hand-designed [
5]; therefore, these algorithms lack universality. Simultaneously, high-dimensional curse and integral items are barriers in the sampling process [
6], greatly restricting the application scenarios.
Various unnormalized models have been proposed in recent years. Energy-based models [
7] are widely used in unnormalized density models because of their freedom and simplicity in designing energy function. But, in order to normalize the model, the denominator in the density function is usually an integral term. To calculate the tedious integral item and its gradient, diversiform deep energy models [
8,
9,
10] have been proposed. Their models indirectly eliminate integral terms, and the process of generating samples from the Gibbs distribution is very complex. Few of these models know what the quality of the trained model is, as most models are evaluated via downstream tasks [
11] or by the quality of generated samples from trained models [
12].
The generative adversarial network (GAN) [
13] indirectly evaluates generative models using generated samples. GAN removes the probability density function estimation process in the unnormalized model and directly simulates the generation of samples. By avoiding the complex density estimation process and the bias caused by the estimation, GAN makes sample generation no longer a difficult task. GAN has brought about revolutionary evolution in image generation [
14] and other domains [
15], but the minimax loss function leads to mode collapse and a vanishing gradient during GAN training [
16]. Various solutions have been proposed to alleviate these challenges in GAN, such as substituting the regularization terms and loss functions: WGAN [
17], Fisher GAN [
18], CLC-GAN [
19], etc. GAN does not consider the probability distribution of given samples, nor can it provide an explicit estimate. As a result, GAN cannot conduct approximate inference and related downstream tasks, such as Independent Component Analysis.
From a probabilistic statistical point of view, it is natural to introduce classical hypothesis testing methods into machine learning. Applying hypothesis testing to train unnormalized models is usually a novel method favored by researchers. GMMN [
20] replaces the critic of GAN with a two-sample test based on MMD [
21]. DEAN [
22] optimizes models using goodness-of-fit test statistical methods.
GAN [
23] measures the discrepancy between two probability distributions by
distance. However, because the density distribution of samples needs to be explicitly given, KSD is rarely used in training unnormalized models.
In this paper, we propose a new unnormalized model to address the problems mentioned above without sampling from the model, and we can measure the quality of the model directly during the training process. We introduce the PT KSD GAN, an unnormalized model that merges Kernel Stein Discrepancy (KSD) [
24,
25] as a goodness-of-fit test method with the GAN framework. Our discriminator leverages a deep energy network to assign energy values to each sample, effectively distinguishing between genuine and synthetic samples. Simultaneously, our generator aims to minimize the Kernel Stein Discrepancy between generated samples and the theoretical density distribution. Our work has several main contributions:
- (1)
To summarize, this paper provides evidence that, under certain assumptions, Kernel Stein Discrepancy as a goodness-of-fit test method is equivalent to Maximum Mean Discrepancy as a two-sample test method.
- (2)
An adversarial learning algorithm, PT KSD GAN, is proposed. We select a deep energy network as the discriminator so that the density distribution of given samples is explicitly obtained through the energy model. The generator of PT KSD GAN can produce samples that minimize Kernel Stein Discrepancy between generated samples and the obtained density distribution. We eliminate the tedious calculation of integral terms in the energy model and can easily generate the desired samples through the generator.
- (3)
For the first time, we demonstrate experimentally that KSD is valid in high-dimensional datasets. In linear Independent Component Analysis datasets, where the data dimension is less than 30, the performance of PT KSD GAN outperforms other unnormalized training methods. Our tests on image datasets mark a significant advancement in addressing high-dimensional data challenges.
4. Learning the Kernel Stein Discrepancy of Energy-Based Models
In the above section, we introduced two methods of hypothesis testing. MMD has been widely used in various unnormalized models [
32,
33]. But, due to the need to estimate the theoretical density distribution of
, KSD’s use in unnormalized models is extremely limited, only applied in the mixed Gaussian model [
34]. In this article, we propose an algorithm for solving Problem 1 by using Kernel Stein Discrepancy to train unnormalized density models.
The bottleneck of measuring KSD as the explicit density estimation of
is unsuspected. In [
35], two generative models were proposed for mutual calibration between the explicit and implicit generative model. The difference between explicit energy-based models and the implicit generative model is measured through the Stein discrepancy. But, its objective function includes three kinds of statistical distances, which increases the complexity of the calculation. And, in the experimental part, no experimental results on likelihood inference are given. Inspired by [
35] and the right-hand side of the third equation in Formula (2), it is convenient to obtain the explicit Gibb distribution of samples in explicit energy-based models. Enlightened by this, we replaced
in Formula (10) with
. The Kernelized Stein Discrepancy of energy-based models is defined as follows:
In the above formula, the explicit density estimation of
is limited to the Gibb distribution of the energy-based model. And,
is the energy value of input samples. Although designing the energy function
is unconstrained, it is hard to find the right energy function for unknown density distribution. Because we know nothing about the distribution of given samples, we have to look for the right energy function
within a larger function range. In the vast majority of cases, a deep neural network is selected as the energy function, attributed to its strong nonlinear fitting ability and large value range. So, our model also selects a deep energy network as the energy function
.
in Formula (14) is a strictly positive definite kernel, which maps input samples to the Reproducing Kernel Hilbert Space and returns their inner product. The choice of kernel function for high-dimensional data is always a difficult issue. Different kernel functions map input samples to different feature spaces, and sample features and their inner product values vary accordingly. Here, we select the RBF (Radial Basis Function) as the kernel function, also known as the Gaussian kernel function. The RBF kernel function is defined as follows:
The RBF kernel meets the conditions we described in Lemma 2 and has a continuous second-order partial derivative. Hyperparameter σ needs to be set manually. The RBF kernel maps samples to infinite dimensional space, has fewer parameters, and has simple model complexity. Therefore, it is selected as the kernel function to be used in various application scenarios. Then, we can obtain the kernel matrix of input samples by calculating the RBF kernel function. , and can be established by taking the derivative of the matrix. Through the reverse derivation of the deep neural network, the value of , can be obtained.
So, in Formula (14), each term of can be calculated and measures the degree of matching between the input samples and density distribution represented by the energy-based model. The Gibbs distribution of the energy-based model removes the integral term by derivation and greatly simplifies the operation. Then, we analyze the application of KSD in training unnormalized models. The input samples are a finite set of fake samples from a density distribution . Samples of the density distribution should be readily available. And, the density distribution is represented by the energy-based model, which aims to approximate. Since we know nothing about at the beginning, the initial must be far away from . We calculate the of the energy-based model in Formula (14) and update the fake samples of and iteratively, which will make slowly approach . Finally, when the value of is less than a given small value, we think are the same density distribution.
5. Training Unnormalized Models with KSD
In Formula (14), fake samples and the energy model need to be provided. In this paper, to avoid bias caused by samplers and to reduce the cost of sample generation, we adapt a specialized generative network of the generative adversarial network to generate fake samples . Meanwhile, we choose the explicit energy-based model as an estimation of unnormalized density distribution , which can also be employed as the discriminator in generative adversarial networks.
5.1. Generative Adversarial Network (GAN)
The generative adversarial network is a popular method for training generative models. The generative adversarial network (GAN) [
13] involves a generator
G and a discriminator (or critic)
D. The goal of the generator is to transform latent codes
into generated samples, which the discriminator cannot distinguish from real data samples, and the goal of the discriminator is to distinguish real data samples and generated samples as well as possible. With i.i.d. data samples, the optimal generator parameter
and discriminator parameter
are the solution to the following minmax game:
The minimax game highlights the adversarial nature of the training process. The discriminator, parameterized by , aims to maximize its ability to differentiate between real and generated data, while the generator, parameterized by , endeavors to minimize the discriminator’s accuracy by producing increasingly realistic data. term signifies the expected log probability over real data samples drawn from the true distribution that the discriminator D correctly identifies as genuine. The goal for the discriminator is to minimize this value, ensuring a high probability of correctly classifying real samples. denotes the expected log probability, for noise variables z sampled from a distribution , that the discriminator D correctly labels the data generated by the generator G (using the noise variables z) as fake. The discriminator aims to maximize this value to efficiently identify and label the artificially generated data, while the generator seeks to minimize this, thereby improving its ability to generate data that the discriminator misclassifies as real. Together, these terms formulate the objective function for the GAN, capturing the adversarial relationship between the generator and discriminator during the training process.
5.2. Generative Adversarial Networks for Unnormalized Density Models
The discriminator of traditional GAN solves classification problems. The discriminator is a single or multiple classification network. We study the use of the discriminator in unnormalized density problems.
In [
36], EBGAN was proposed to find the data manifold based on the energy model. This assigns low-energy values to data manifold and high-energy values to other data regions. The mission of the discriminator is to discern whether the data are from the true samples manifold. Nevertheless, the generator is trained to approximate the data manifold to confuse the discriminator, so that the discriminator loses discernment. The antagonistic process can be defined by the loss:
is the loss for the discriminator; is the discriminator’s output for a real data sample . ) is the discriminator’s output for a generated sample , m is a positive margin, and denotes the positive part function, defined as . Following this adaptation, the model is equipped to tackle issues pertaining to unnormalized density problems. However, the autoencoder employed within the discriminator proved computationally intensive. The Stein score function of the autoencoder is hard to obtain.
5.3. PT KSD GAN (Kernel Stein Discrepancy Generative Adversarial Network with a Pulling-Away Term)
In the above section, we analyzed the disadvantages of the autoencoder network architecture in EBGAN. Consequently, we optimized it to a standard neural network architecture. To ensure stability, we adjusted the hyperparameter
in Formula (17) to 1. The output of the discriminator is regarded as energy value of samples, which is the numerator of the explicit Gibbs density estimation. As stated in [
25], the Kernel Stein Discrepancy (KSD) can measure the distance of different density distributions. That is, minimizing the
of the generator is equivalent to approaching fake samples, coinciding with the theoretical distribution of true samples. The modified formulation of
and
is presented as follows:
In the above formula, the generator and the discriminator are iteratively updated until, finally, the discriminator cannot tell whether the input sample is real or fake. We call this process of adversarial generation in Formula (18), KSD GAN. The discriminator is conducted as an energy-based model during training to discover the explicit Gibbs distribution of real samples. At the beginning, fake samples generated by the generator are very far from the density distribution region of real samples, and the discriminator can easily identify fake samples from real ones. During alternating iterations, samples generated by the generator are constantly narrowing from the theoretical distribution of real samples, which is represented by the energy model distribution of the discriminator. As the number of iterations increases and the Kernel Stein Discrepancy shrinks, samples generated by the generator approximate the Gibbs distribution of true samples, fooling the discriminator into thinking they are real samples.
In fact, this iterative process of KSD GAN can also be regarded as the confrontation process between the implicit generation model and the explicit generation model. The discriminator is trained as an explicit generation model, and its density distribution is the Gibbs distribution based on the energy model. Of course, we can use the sampler to sample from this Gibbs distribution, which we want to obtain. As an implicit generation model, the generator bypasses the estimation of the density distribution and generates imitation samples directly by reducing .
GAN is notoriously difficult to train, and mode collapse often occurs. This is due to the widely dispersed density distribution of high-dimensional datasets, and there is a huge span between regions. It is very possible for the generator to find a local optimal and then stop optimizing. Then, samples generated by the generator are only one or several types, and the sample diversity is insufficient. Inspired by [
36,
37], reducing the similarity and increasing orthogonalization of generated samples in energy-based models can avoid model collapse and speed up convergence. We add a repelling regularizer named the Pulling-away Term (
PT) to the generator loss:
After adding the regular term to the generator, the loss
and
of KSD GAN are defined as follows:
In Formula (20), KSD GAN with the regular term
PT added is called PT KSD GAN. Our model structure of PT KSD GAN is shown in
Figure 1.
The figure above shows that our model is actually an enhanced GAN framework incorporating the Kernelized Stein Discrepancy. The data processing steps are described below:
- 1
Two data streams, denoted as and , are added to the system. The former is directed towards the discriminator, while the latter interfaces with the generator.
- 2
Acting as the GAN’s energy model, the discriminator, processes the data . Its primary responsibility is to differentiate between the real sample and fake sample generated by the generator.
- 3
Simultaneously, the generator , creates synthetic samples based on an input noise vector z. This noise vector serves as a seed, ensuring diversity in generated samples.
- 4
The fake samples are subjected to the PT Kernel Stein Discrepancy module. The integration of KSD, a non-parametric measure of discrepancy between distributions, aims to quantitatively evaluate the deviation of the generated samples from the target distribution. This evaluation imparts valuable feedback to the generator, steering it towards optimal data synthesis.
- 5
The outcomes from the PT Kernel Stein Discrepancy and Mean Absolute Error modules are collated to furnish a composite gradient that guides the generator’s training. By assimilating feedback not only from the discriminator but also from an external discrepancy measure (KSD), the generator’s training becomes more informed, focused, and robust against mode collapse.
6. Experiment
We introduce the PT KSD GAN for training unnormalized models (
Figure 1). Using an energy-based discriminator, it assigns energy values to samples, while the generator, minimizing Kernel Stein Discrepancy, produces samples reflecting true density. Initial tests on 2D toy datasets (
Figure 2) validate the generator’s alignment with true density and sample diversity.
We further validate using linear Independent Component Analysis (ICA), a standard for unnormalized model training. ICA decomposes signals into independent sources. Our focus was on retrieving original latent variables from mixtures, with the results shown in
Table 1.
Finally, PT KSD GAN’s performance on image datasets like MNIST and CIFAR-10, using a DCGAN framework, is depicted in
Figure 3, indicating realistic image generation with room for refinement.
6.1. Toy Datasets
We initiate our study by assessing PT KSD GAN’s performance on two-dimensional toy datasets. These datasets, derived from Sklearn’s library, provide a range of shapes in two-dimensional data. For the generator, we use inputs
} sampled from a standard normal distribution. Both the generator and discriminator are structured as a two-layer multi-layer perceptron, producing two-dimensional outputs. Key hyperparameters include an RBF kernel function bandwidth
of 1.0; a regularized term coefficient α of 0.8; and Adam Optimizer settings with a learning rate of 0.002,
of 0.5, and
of 0.99. We standardize the batch size to 1000 across all toy datasets and set a training cap at 10,000 iterations. The outcomes of this training are illustrated in
Figure 2.
Figure 2 delineates the density distributions: real samples are shown on the left, the energy-based model’s output (or the discriminator) in the center, and the generator-produced sample density on the right. Notably, the energy-based model provides unnormalized energy values, leading to density estimates that capture only the approximate shape of the true samples’ density distribution. While potential exists to use samplers like the Hamiltonian Monte Carlo, biases from initial probability distributions and other constraints deterred us from this approach.
Our approach employs specialized networks to produce fake samples, aiming to minimize the Kernel Stein Discrepancy against the true distribution. As evident in
Figure 2, the generator’s outputs align closely with the true density distribution, ensuring sample diversity and authenticity. The PT KSD GAN training process can be viewed as a competitive interaction between two generative networks. While the discriminator captures the true samples’ explicit Gibbs distribution, the generator focuses on producing realistic samples without explicit density distribution searches.
6.2. Linear ICA
Linear Independent Component Analysis (ICA) [
26] is a statistical method designed to separate a multivariate signal into independent non-Gaussian signals. It is particularly useful in scenarios where multiple signals overlap, and the objective is to discern the original signals from their mixtures. Building on this, we further explored the efficacy of PT KSD GAN within the context of the linear ICA model. This model serves as a benchmark in the evaluation of algorithms tailored for training unnormalized models.
The generative process of true samples is relatively simple. The latent variables
are independent and identically distributed, and they are sampled from the standard Laplace distribution, that is,
. Then, take a linear transformation:
The model parameter
is an
matrix, and the determinant is not zero. Under Formula (21), we can deduce
’s log density:
is the probability density function of latent variables
s. Thus, (22) can be used as non-normalized likelihood to evaluate generative models. There are two hyperparameters: the bandwidth parameter σ and the regularizer
. The bandwidth parameter σ of the RBF kernel function is self-adapting according to the input sample number and sample feature dimension. Sample dimensions
range from 2 to 50. The value of regularized term coefficients is 0.8. We set the learning rate of the Adam Optimizer as
and the two hyperparameters of the Adam Optimizer as
,
. The number of iterations is 100,000, and the batch size is set to 1000. See
Table 1 for the results of linear ICA with PT KSD GAN.
Table 1 shows the log likelihoods of various algorithms: score matching [
26], maximum likelihood [
38], Learned Stein Discrepancy (LSD) [
39], and Noise-Contrastive Estimation (NCE) [
40]. For low-dimensional samples, we find that the power of PT KSD GAN is optimal. As the dimensions increase, the performance of PT KSD GAN is slightly inferior to ML and LSD. The reason is that the manual selection of kernel functions makes it impossible to select the optimal kernel function for the current dataset. On the whole, our approach can achieve the same performance as maximum likelihood.
6.3. Preliminary Image Results
To validate our approach’s applicability to train more expressive unnormalized models and to generate samples, we conducted experiments assessing PT KSD GAN’s performance on high-dimensional image datasets, specifically MNIST and CIFAR-10, with a training capacity of 50K images. Our experimental setup adhered to the DCGAN [
31] model architecture, employing simple convolutional neural networks for both the generator and the discriminator. The discriminator computes the energy value for each image vector, and we subsequently averaged these values to determine the energy for an image batch. Optimization was performed using the Adam Optimizer, set with a learning rate of 0.00005 and hyperparameters
and
. We standardized the batch size to 64 across all datasets.
Given the intricate nature of the image datasets, we restricted our regularization term calculations to the initial twenty images, enhancing computational efficiency. This means that the value of
in Formula (20) remains consistent at 20 for each training batch. This approach significantly elevates the diversity of the generated images. The results, as depicted in
Figure 3, show that the images produced from MNIST and CIFAR-10 training are convincingly realistic, though there is potential for refining image precision in future iterations.
The images displayed in the figure represent the generator’s outputs within the adversarial network. While it is feasible to use a sampler to derive samples from the discriminator’s trained energy model, we prioritize efficiency in large-scale sample generation and, thus, opt for the generation network over the sampler. Admittedly, the quality of these generated samples is not the best in the field. However, they mark a pioneering demonstration of KSD’s ability to manage high-dimensional data evaluation and generation. Future endeavors could explore intricate network architectures and sophisticated samplers to handle even more complex image datasets.
7. Conclusions
In this paper, we demonstrated that Kernel Stein Discrepancy is an efficient way to train unnormalized density models and to generate samples from these models. In
Section 3, we connected the two-sample test method and the goodness-of-fit test by demonstrating that Kernel Stein Discrepancy is equal to Maximum Mean Discrepancy under certain assumptions. We introduced a new adversarial learning paradigm to train the unnormalized density model, named PT KSD GAN. The discriminator was trained as an energy model to determine the explicit Gibbs density distribution of real samples. Low energies were assigned to regions within the true data manifold, while high energies were assigned to other areas. The generator was responsible for generating fake samples to fool the discriminator. The quality of the generated samples was measured using a goodness-of-fit test method, KSD.
Three experiments were conducted to prove the feasibility and robustness of our algorithm. In toy datasets, an unnormalized energy model of PT KSD GAN can capture the approximate and blurry shape of the density distribution of true samples. The generated samples lie strictly within the true density distribution. The generator strictly and precisely identifies the true data manifold. In the linear Independent Component Analysis task, the performance of PT KSD GAN outperformed other unnormalized training methods when the data dimension was less than 30. The manual selection of kernel functions and hyperparameters limits the application of KSD in high-dimensional datasets. In more complex image datasets, the results showed that KSD is no longer cursed by high-dimensional datasets; however, the sharpness of the generated pictures should be improved in future work.
Unsupervised models are a critical area of machine learning, where the estimation of the density distribution is important for unlabeled samples. With the density distribution, it is possible to generate similar samples and to determine whether the input sample is abnormal. In future work, we will study the application of adversarial generation networks and hypothesis testing methods in anomaly detection.