Next Article in Journal
Barrier Lyapunov Function-Based Adaptive Back-Stepping Control for Electronic Throttle Control System
Next Article in Special Issue
DNA-Guided Assembly for Fibril Proteins
Previous Article in Journal
Fuzzy Set Qualitative Comparative Analysis of Factors Influencing the Use of Cryptocurrencies in Spanish Households
Previous Article in Special Issue
Networks of Picture Processors with Filtering Based on Evaluation Sets as Solvers for Cryptographic Puzzles Based on Random Multivariate Quadratic Equations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dynamics of Fourier Modes in Torus Generative Adversarial Networks

by
Ángel González-Prieto
1,*,†,
Alberto Mozo
2,†,
Edgar Talavera
2,† and
Sandra Gómez-Canaval
2,†
1
Departamento de Matemáticas, Facultad de Ciencias, Universidad Autónoma de Madrid, 28049 Madrid, Spain
2
Escuela Técnica Superior de Ingeniería de Sistemas Informáticos, Universidad Politécnica de Madrid, 28031 Madrid, Spain
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2021, 9(4), 325; https://doi.org/10.3390/math9040325
Submission received: 15 December 2020 / Revised: 29 January 2021 / Accepted: 1 February 2021 / Published: 6 February 2021
(This article belongs to the Special Issue Bioinspired Computation: Recent Advances in Theory and Applications)

Abstract

:
Generative Adversarial Networks (GANs) are powerful machine learning models capable of generating fully synthetic samples of a desired phenomenon with a high resolution. Despite their success, the training process of a GAN is highly unstable, and typically, it is necessary to implement several accessory heuristics to the networks to reach acceptable convergence of the model. In this paper, we introduce a novel method to analyze the convergence and stability in the training of generative adversarial networks. For this purpose, we propose to decompose the objective function of the adversary min–max game defining a periodic GAN into its Fourier series. By studying the dynamics of the truncated Fourier series for the continuous alternating gradient descend algorithm, we are able to approximate the real flow and to identify the main features of the convergence of GAN. This approach is confirmed empirically by studying the training flow in a 2-parametric GAN, aiming to generate an unknown exponential distribution. As a by-product, we show that convergent orbits in GANs are small perturbations of periodic orbits so the Nash equillibria are spiral attractors. This theoretically justifies the slow and unstable training observed in GANs.

1. Introduction

Since their very inception, Generative Adversarial Network (GANs) have revolutionized the areas of machine learning and deep learning. They address very successfully one of the most outstanding problems in pattern recognition: given a collection of examples of a certain phenomenon that we want to replicate, construct a generative model able to create new completely synthetic instances following the same patterns as the original ones. Ideally, the goal would be to capture the underlying pattern so subtly that no external critic would be able to distinguish between real samples and synthesized instances.
The proposal of Goodfellow et al. [1] is to confront two neural networks in an adversary game to solve this problem. More precisely, they propose to consider a neural network G playing the role of a generator agent and a network D acting as the discriminator. The discriminator D is trained to distinguish as accurately as possible between real samples and fake/synthetic samples. On the other hand, G aim to generate synthetic instances of high quality in such a way that D is barely able to distinguish from real data. The two networks are, thus, in effective competition. When, as a by-product of this competition, the agents reach an optimal point, we obtain a generator able to generate almost indistinguishable synthetic samples as well as a discriminator very proficient in classifying real and fake instances.
The way in which these networks are trained to reach this optimal point is through a common objective function. Explicitly, in [1], it is proposed to consider the following function:
F ( θ D , θ G ) = E Ω log D θ D ( X ) + E Λ log 1 D θ D ( G θ G ) ,
where θ D are the inner weights of D, θ G are the weights of G, Ω is the probability space of the real data, and Λ is the latent probability space from which G samples the noise to be transformed into synthetic instances. In this manner, F is essentially the error that D suffers in the classification problem between real and fake examples so D tries to maximize it and G tries to minimize it. Hence, it gives rise to a non-convex min–max game and the goal of the training process is to reach a Nash equilibrium.
Several training approaches have been proposed to reach these Nash equlibria, but the most widely used method is the so-called Alternating Gradient Descend (AGD). Roughly speaking, the idea is to, alternatively, train D by tuning θ D with cost function F and weights θ G fixed and, after a certain amount of epochs, to reverse the roles and to update θ G with cost function F and weights θ D fixed. This optimization procedure has led to astonishing results, particularly in the domain of image processing and generation. Using several architectures and sophisticated multi-level training, GANs are able to generate images with such a high quality that a human eye is not capable to distinguish them from real images [2].
Despite these achievements, the stability of the AGD algorithm for GANs is a major issue. In [3], the authors proved that the Nash equillibria for GANs are locally stable provided that some ideal conditions on the optimality of the equillibria are fulfilled. Nevertheless, these conditions may be unfeasible, as shown in [4], so actual convergence and stability are not guaranteed in real applications. In particular, one of the most challenging problems arising during the training of GANs is the so-called mode collapse [5]. This state is characterized by a generator that has degenerated into a network that is only able to generate a single synthetic sample (or a very small number of them) with almost no variation and such that the discriminator confuses it with a real sample (typically, because the synthetic sample is actually very close to a real one). In this state, the system is no longer a generative model but simply a copier of real data.
Furthermore, by construction, neural network-based GANs have some intrinsic constraints in their expressivity that lead to very unrealistic synthetic samples in the context far from image generation. For instance, neural networks produce a smooth output function, which provokes GANs having lots of difficulties in dealing with the generation of real samples drawn from a discrete distribution (e.g., according to an exponential distribution) [6] or with some drastic semantic restrictions (e.g., nonnegative values for counters) [7]. These scenarios do not typically appear in image generation but are common in other domains such as data augmentation for machine learning [8]. These problems lead to additional inconveniences for stable convergence and usually give rise to highly unstable models that require a very handcrafted stopping criteria and optimization heuristics.
A multitude of works have been oriented towards a deeper understanding of the instability of the training of GANs as well as to propose solutions. A thorough theoretical study of the sources of instability and their causes can be found in [9], and in [10,11], the authors analyzed the real capability of the GAN for learning the distribution both through a theoretical and an empirical approach. In addition, in order to mitigate the instability of the training in [12], the authors proposed a collection of heuristical methods through variations of the standard backpropagation algorithm that contribute to stabilizing the training process of GANs. Moreover, in [13], the use of regularization procedures was proposed to speed up the convergence.
Another very active research line is the proposal of alternative models for GANs that guarantee better convergence. It is well known that the key reason why GAN should capture the original distribution is because they implicitly optimize the Jensen–Shannon divergence (JSD) between the real underlying distribution and the generated distribution of the synthetic data [1]. In order to change this framework, in [14], the authors proposed to modify the cost function in such a way that the new GAN did not optimize JSD but an Earth-mover distance known as Wasserstein distance, giving rise to the celebrated Wasserstein Generative Adversarial Network (WGANs). In a similar vein, in [15], it was proposed to use the f-divergence (a divergence in the spirit of the Kullback–Leibler divergence) as the criterion for training GANs. Even genetic algorithms have been used to stabilize the training process, as in [16], where the authors applied genetic programming to optimize the use of different adversarial training objectives and evolved a population of generators to adapt to the discriminator, which acts as the hostile environment driving evolution. Nevertheless, despite all these efforts, no master method is currently available, and hence, assuring a fast, or even effective, convergence of GANs is an open problem.
Our contribution. In this paper, we propose a novel method to analyze the convergence of GANs through Fourier analysis. Concretely, we propose to approximate the objective function F by its Fourier series, truncated with enough precision that the local dynamics of F can be understood by means of a trigonometric polynomial.
Recall that any function F ( θ ) : T n C defined on the n-dimensional torus T n = ( S 1 ) n (equivalently, an n-periodic function on R n ) can be decomposed into a series of complex exponential functions, known as its Fourier series:
F ( θ ) = m Z n α m e 2 π i m · θ ,
where the series is indexed by the so-called Fourier modes or frequencies, m defined on the rectangular lattice Z n R n . In principle, the previous equality must be understood as a decomposition in the Hilbert space of square-integrable functions, L 2 ( T n ) . However, if F has enough regularity, then the Fourier series on the right-hand side also converges uniformly to the original function F . This implies that, taking enough Fourier modes, F can be effectively approximated by a truncated Fourier series. Moreover, if F is real-valued, expressing the complex exponential as a combination of sine and cosine functions, we obtain an approximation of F by a trigonometric polynomial, Θ ( F ) .
This approximation can be applied to the study of the convergence of GANs as follows. The continuous version of the AGD algorithm can be the thought of as a path of weights, ( θ D ( t ) , θ G ( t ) ) , depending on the time parameter t R . In particular, ( θ D ( 0 ) , θ G ( 0 ) ) are the initial random weights of the GAN and ( θ D ( t ) , θ G ( t ) ) determine the state of the networks after training for a time t > 0 . In this manner, if we seek to increase F ( θ D , θ G ) in the direction θ D and to decrease it in the direction θ G , the AGD gives rise to a system of Ordinary Differential Equation (ODEs) given by
θ D = D F ( θ D , θ G ) , θ G = G F ( θ D , θ G ) ,
where θ D and θ G denote the derivatives of the functions θ D ( t ) and θ G ( t ) with respect to time t. This flow aims to converge to a Nash equilibrium of the objective function F of the GAN, and for this reason, we refer to it as the Nash flow.
However, in many interesting cases, the function F may be very involved and lacks an analytic closed expression that would enable an explicit analysis (e.g., even in the toy example of Equation (13), the cost function is intractable analytically). To address this problem, we propose to approximate F by its truncated Fourier series, Θ ( F ) . In this way, at least locally, the dynamic of the original Nash flow can be read from the solutions to the simplified system
θ D = D Θ ( F ) ( θ D , θ G ) , θ G = G Θ ( F ) ( θ D , θ G ) .
In order to analyze this system of ODEs, we propose a novel method focused on studying the dynamics of the Nash flow on Fourier basic functions and on subsequent further approximations. As we will see, for the Nash flow of a basic trigonometric function, the Nash equillibria are not attractors of the flow but centers, that is, they are surrounded by periodic functions that spin around the critical point. When we consider more Fourier modes in the Fourier expansion of F , these periodic orbits may break, leading to spiral attractors or spiral repulsors. The conditions that bifurcate the centers into spiral sinks or sources can be given explicitly in terms of the combinatorics of the considered Fourier modes.
This provides a theoretical justification to the empirically observed instability of the GAN training: the convergent orbits towards a Nash equilibrium are mere perturbations of periodic orbits, falling slowly and spirally to the optimal point. For this reason, small variations in the training hyperparameters, such as the learning rate, the number of epochs, or the batch size, may lead to very different dynamics, which confers to training its characteristic instability. In addition, in this paper, we empirically evaluate this method against a GAN that aims to generate samples according to an unknown exponential distribution. To facilitate the visualization, we consider a simple GAN, with 1-dimensional parameter spaces in each network, in such a way that the Nash flow can be plotted as a planar path. We show that the proposed approach allows us to understand the simplified dynamics of the GAN and to extract qualitative information of the Nash flow.
It is worth mentioning that, in order to have a natural Fourier series, the considered objective function F of the GAN must be periodic. This may seem unrealistic in real-life GANs, but this is actually not a very strong condition. Usually, seeking to prove theoretical results about the convergence of GANs, most work forces F to have compact support (for instance, to assure that it is Lipschitz as in WGANs). In practice, this is accomplish by clipping the output of the generator and discriminator functions for large inputs. This provokes that, artificially, the objective function turns into a periodic function, and thus, it can be studied through the method introduced in this paper. We expect that this work will open the door to new methods for analyzing and quantifying the convergence of GANs by importing well-established techniques of harmonic analysis and dynamical systems on closed manifolds, as studied in global analysis.
The structure of this paper is as follows. In Section 2, we review the theoretical fundamentals of GANs and their associated objective function and training method. In Section 2.1, we sketch briefly some basic concepts of Morse theory, a very successful theory that allows us to relate the analytic properties of the function to be optimized with the topological properties of the underlying space. In Section 2.2, we introduce the Nash flow and discuss some of the arising problems for its convergence. In Section 3, we introduce torus GANs, and particularly, in Section 3.1 we explain how to perform Fourier analysis on the torus. Section 4 is devoted to the analysis of the Nash flow for truncated Fourier series both for basic function (Section 4.1 and Section 4.2) and for more complicated combinations (Section 4.3 and Section 4.4). In addition, in Section 5, the empirical testing of this method is performed, with comparisons between the real dynamic and the predicted ideal dynamic. Finally, in Section 7, we summarize some of the keys ideas of this paper and sketch some lines of future work.

2. GANs Dynamics

As introduced by Goodfellow in [1], a GAN network is a competitive model in which two intelligent agents (typically two neural networks) compete to improve their performance and to generate very precise samples according to a given distribution.
To be precise, let X : Ω R d be a d-dimensional random vector, defined on a certain probability space Ω . This random vector X should be understood as a very complex phenomenon whose samples we would like to replicate. For this purpose, we consider two functions:
D : R d × Θ D R , G : Λ × Θ G R d ,
called the discriminator and the generator, respectively. Here, Λ is a probability space, called the latent space, and Θ D , Θ G are two given topological spaces. These functions should be seen as parametric families of functions D θ D : R d R and G θ G : Λ R d , parametrized by θ D Θ D and θ G Θ G .
The aim of the GAN is to tune the parameters θ D and θ G is such a way that, given x R d , D θ D ( x ) intends to predict whether x = X ( ω ) for some ω Ω , i.e., whether x is compatible with being a real instance or it is a fake datum. Observe that, throughout this paper, we follow the convention that D θ D ( x ) is the probability of being a real instance; thus, D θ D ( x ) = 1 means that D θ D is sure that x is real, and D θ D ( x ) = 0 means that D θ D is sure that x is fake. On the other hand, the generative function, G θ G , is a d-dimensional random vector that seeks to converge in distribution to the original distribution X. Typically, the probability space Λ is R l with a certain standard probability distribution λ , as the spherical normal distribution or a uniform distribution on the unit cube.
Remark 1.
In typical applications in machine learning, Ω is given by a finite set Ω = x 1 , , x N , with x i R d , and endowed with a discrete probability (typically, the uniform one) so X is just the identity function. In customary applications of GANs, we have that the instances x i are images, represented by their pixel map, so the objective of the GAN is to generate new images as similar as possible to the ones in the dataset Ω .
The competition appears because the agents D and G try to improve non-simultaneously satifactible objectives. On one hand, D tries to improve its performance in the classification problem, but on the other hand, G tries to generate as best results as possible to cheat D. To be precise, recall that perfect fit for the classification problem for D θ D is given by D θ D ( x ) = 1 if x is an instance of X and D θ D ( x ) = 0 if not. Hence, the L 1 error made by D θ D with respect to perfect classification is
E ( θ D , θ G ) = E Ω 1 D θ D ( X ) + E Λ D θ D ( G θ G ) = 1 E Ω D θ D ( X ) + E Λ D θ D ( G θ G ) ,
where E Ω and E Λ denote the mathematical expectation on Ω and Λ , respectively. In this way, the objective of D θ D is to minimize E , while the goal of G θ G is to maximize it. It is customary in the literature to consider the function 1 E as the objective and to weight the error with a certain smooth concave function f : R R . In this way, the final cost function is
F ( θ D , θ G ) = E Ω f D θ D ( X ) + E Λ f D θ D ( G θ G ) .
Remark 2.
Typical choices for the weight function f are f ( s ) = log ( 1 + exp ( s ) ) , as in the original paper of Goodfellow [1], or f ( s ) = s , as in the Wasserstein GAN [9].
However, in sharp contrast with what is typical in machine learning, the aim of the GAN is not to maximize/minimize F . The objectives of the D and G agents are opposing: while D tries to maximize F , the generator tries to minimize it. In this vein, the objective of the GAN is
min θ G max θ D F ( θ D , θ G ) = min θ G max θ D E Ω f D θ D ( X ) + E Λ f D θ D ( G θ G ) .
In the case that the latent space Λ is naturally equipped with a topology (as in the case Λ = ( R l , λ ) ), it is customary to require that F : Θ D × Θ G R is a continuous function. In addition, in our case, Θ G and Θ D are differentiable manifolds, so we require that both D and G are C 2 maps in both arguments, and thus, F is a differentiable function on Θ D × Θ G .
To be precise, the algorithm proposed by Goodfellow [1] suggests to freeze the internal weights of G and to use it to generate a batch of fake examples from Λ . With this set of fake instances and another batch of real instances created using X (i.e., sampling randomly from the dataset of real instances), we train D to improve its accuracy in the classification problem with the usual backpropagation (i.e., gradient descent) method. Afterwards, we freeze the weights of D and we sample a batch of latent data of Λ (i.e., we randomly sample noise using the latent distribution) and we use it to train G using gradient descent for G with objective function θ G E Λ f ( D ( G θ G ) ) . Finally, we can alternate this process as many times as needed until we reach the desired results. Several metrics have been proposed to quantify this performance, specially regarding the domain of image generation, such as Inception Score (IS) [12], Fréchet Inception Distance (FID) [17], or perceptual similarity measures [18]. For a survey of these techniques, please refer to [19].

2.1. Review of Morse Theory

Let us suppose for a while that, instead of looking for solutions of (2), we were seeking the local maxima of F . In this situation, the standard approach in machine learning is to consider the Morse flow, also known as gradient ascent flow. For it, let us fix riemannian metrics on Θ D and Θ G . Using them, we can compute the gradient of F , F = ( D F , G F ) , where D F , G F denote the gradient in the θ D , θ G directions, respectively. Then, the Morse flow is the differentiable flow on Θ D × Θ G generated by the vector field F . Explicitly, it is given by the system of ODEs:
θ D = D F ( θ D , θ G ) , θ G = G F ( θ D , θ G ) .
This flow has been the objective of very intense studies in the context of differentiable geometry and geometric topology. For instance, it is the crucial tool used in Smale’s proof of the Poincaré conjecture in high dimension [20] and has been successfully used to understand the topology of moduli spaces of solutions to highly nonlinear partial differential equations coming from theoretical physics [21], among others.
Obviously, the critical points of the system (3) are exactly the critical points of F in the sense that the differential d F | ( θ D 0 , θ G 0 ) = 0 . In order to control the dynamics of this ODE around a critical point, a key concept is the notion of index of a point.
Definition 1.
Let ( θ D 0 , θ G 0 ) be a critical point of F . The Hessian of F at ( θ D 0 , θ G 0 ) is the symmetric 2-form H F | θ D 0 , θ G 0 S y m 2 ( T θ D 0 * Θ D T θ G 0 * Θ G ) given by
Hess ( F ) | θ D 0 , θ G 0 ( v , w ) = w ( v ˜ ( F ) ) ,
for v T θ D 0 Θ D , w T θ G 0 Θ G , and v ˜ , with any extension of v to an vector field in a small neighborhood of ( θ D 0 , θ G 0 ) .
The point ( θ D 0 , θ G 0 ) is said to be non-degenerate if Hess ( F ) | θ D 0 , θ G 0 is non-degenerated in the 2-form. In that case, the index of the point, denoted λ ( θ D 0 , θ G 0 ) , is the number of negative eigenvalues of Hess ( F ) | θ D 0 , θ G 0 . A function F is said to be Morse if all its critical points are non-degenerate.
More explicitly, let D 1 , , D d D be a basis of T θ D 0 Θ D and G 1 , , G d G be a basis of T θ G 0 Θ G , where d D and d G are the dimensions of Θ D and Θ G respectively. Then, Hessian is the matrix of second derivatives:
Hess ( F ) = 2 F θ D i θ D j 2 F θ D i θ G j 2 F θ G i θ D j 2 F θ G i θ G j
If Θ D and Θ G are compact, Morse functions are known to form a dense open set of the space of continuous functions on Θ D × Θ D [20]. Moreover, the critical points of a Morse function are isolated in the sense that there exists an open neighborhood of each critical point that contains only that critical point. Indeed, the stability of a critical point ( θ D , θ G ) is fully determined by its index. Then, ( θ D , θ G ) is a sink in a hypersurface of dimension λ ( θ D , θ G ) while it is a source in a hypersurface of dimension d D d G λ ( θ D , θ G ) . In particular, the only sinks of the Morse flow are precisely the local maxima of F , in which Hess ( F ) is negative-definite and, thus, λ ( θ D , θ G ) = d D d G .
Another important fact that we use is the following topological interpretation of the indices, known as the Poincaré–Hopf theorem. It claims that, if Θ D and Θ G are compact, then
( θ D , θ G ) Crit ( F ) ( 1 ) λ ( θ D , θ G ) = χ ( Θ D × Θ G ) = χ ( Θ D ) χ ( Θ G ) .
Here, Crit ( F ) denotes the (finite) set of critical points of F and χ is the Euler characteristic of the space.

2.2. The Nash Flow

Now, let us come back to our optimization problem (2). Despite the simplicity of the formulation of the cost function, this problem is very far from being trivial. The best scenario would be to obtain a so-called Nash equilibrium.
Definition 2.
Let F : Θ D × Θ G R be a differentiable function. A point ( θ D 0 , θ G 0 ) Θ D × Θ G is said to be a Nash equilibrium if
  • the function θ D F ( θ D , θ G 0 ) has a maximum at θ D 0 .
  • the function θ G F ( θ D 0 , θ G ) has a minimum at θ G 0 .
Remark 3.
A Nash equilibrium is in particular a critical point of F .
In this vein, it is natural to consider an analogous differentiable flow to (3) but converging to Nash equilibria. For this purpose, fix riemannian metrics on Θ D and Θ G as above and consider the gradient F = ( D F , G F ) . Now, we twist the gradient to consider the Nash vector field:
N ( F ) = ( D F , G F ) .
Definition 3.
The Nash flow is the differentiable flow on Θ D × Θ G generated by the Nash vector field N ( F ) . Explicitly, it is the system of ODEs:
θ D = D F ( θ D , θ G ) , θ G = G F ( θ D , θ G ) .
This flow (or, more precisely, the associated discrete-time version known as the AGD flow) has been intensively used for training GANs from their very inception. Already in Goodfellow’s seminar paper [1], this flow was proposed as a method for seeking Nash equilibriums of the game (2).
To understand the dynamics of the Nash flow, let us study it around a critical point. Working in a local chart around a critical point, with an adapted basis D 1 , , D d D , G 1 , , G d G of T θ D 0 Θ D T θ G 0 Θ G , the differential of the Nash vector field is the Nash Hessian:
N Hess ( F ) = N ( F ) * = 2 F θ D i θ D j 2 F θ D i θ G j 2 F θ G i θ D j 2 F θ G i θ G j
In this manner, in a small neighborhood of a critical point ( θ D 0 , θ G 0 ) Θ D × Θ G of F (in particular, around a Nash equilibrium), the dynamics are determined by the linearized version:
θ D θ G = 2 F θ D i θ D j 2 F θ D i θ G j 2 F θ G i θ D j 2 F θ G i θ G j ( θ D 0 , θ G 0 ) θ D θ G
However, in sharp contrast with the Morse flow, even if F has non-degenerate critical points, it may happen that the Nash equilibria are not attractors. For instance, if the Nash Hessian has a vanishing diagonal (as in Section 4.2), then periodic orbits arise around the critical point and the flow is non-convergent.
Nonetheless, this behavior can be controlled. Suppose for simplicity that d D = d G = 1 (higher dimensional scenarios can be treated analogously by splitting the tangent space). In that case, the eigenvalues of N Hess ( F ) are either both real or complex conjugated.
  • If the eigenvalues are real around a Nash equilibrium, both eigenvalues must be nonnegative, since in the usual Hessian, they have different signs. Hence, the Nash equilibrium is a non-repulsor of the Nash flow. Moreover, if F is Morse, then its eigenvalues do not vanish and, thus, the Nash equilibrium is an attractor.
  • If the eigenvalues are complex conjugated, say λ , λ ¯ C , then the dynamic is controlled by the real part of λ , Re ( λ ) . There is an invariant way of computing this quantity as through the trace of N Hess ( F ) since
    2 Re ( λ ) = λ + λ ¯ = tr N Hess ( F ) = 2 F θ D 2 2 F θ G 2 .
    Observe that this is nothing but the wave operator acting on F . In the case that this trace is negative, the critical point is an attractor with spiral dynamic; if it is positive, it is a repulsor, and if it vanishes, it is a center with surrounding periodic orbits.
It is worth mentioning that, in the case of GANs, the function F of (2) to be optimized does not define a convex–concave game so, in general, the convergence of the usual training methods through Nash flow is not guaranteed [3]. Under some ideal assumptions on the behaviour of the game around the Nash equilibrium points, in [3], the authors proved that the Nash flow is locally asymptotically stable. However, the hypotheses needed to apply this result are quite strong and seem to be unfeasible in practice. For instance, in [4], the authors show an example of a very simple GAN, the so-called Dirac GAN, for which the usual gradient descend does not converge.

3. Torus GANs

From now on, let us focus on a very particular case of GAN that we call a torus GAN. Let us denote
T n = S 1 × S 1 n times
as the n-dimensional torus. Then, we take as parameter spaces Θ D = T d D and Θ G = T d G . In this way, the cost functional becomes a function:
F : T d D × T d G = T d D + d G R .
Remark 4.
This particular choice is not as arbitrary as it may seem at a first sight. In the end, a torus GAN is any GAN in which the generator and discriminator are periodic functions on their parameters θ D and θ G for some large enough period. In standard neural network-based GANs, it is customary to clip the output of the neural network in order to prevent the internal weights from becoming arbitrarily large. This is particularly important in Wasserstein GANs, where the objective function is required to be Lipschitz, and this is achieved by forcing the cost function to have compact support. In this way, after clipping, both the generator and the discriminator agents are periodic functions, and thus, they define a torus GAN.
Working on the torus has important consequences to the dynamics the Morse flow. Some of them are the following:
  • Divergent orbits are not allowed. Since T n is compact, standard results of prologability of solutions for a short time show that the orbits of any vector flow cannot blow up. Intuitively, they cannot escape by tending to infinity. In particular, if F is a Morse function, all the orbits in the Morse flow must converge to a critical point. This is a consequence of the fact that, along a non-constant orbit of the Morse flow, the function F is strictly increasing since
    d d t F ( θ D , θ G ) = d F ( θ D , θ G ) = d F ( F ) = | | F | | 2 > 0 .
    Thus, since F is bounded, the flow is forced to converge to a constant orbit, that is, to a critical point of F . This prevents the appearance of periodic orbits in the Morse flow. In the Nash flow, this may no longer hold and periodic orbits may arise (as in Section 4.2).
  • Topological restrictions: the Euler characteristic of T n is χ ( T n ) = χ ( S 1 ) n = 0 . Hence, Equation (4) implies that
    ( θ D , θ G ) Crit ( F ) = 0 .
    In other word, there is the same number of critical points of even index as of odd index. In particular, if d D = d G = 1 , there are as many saddle points (which are points of index 1) as maxima and minima (which are points of index 2 or 0).

3.1. Fourier Analysis in the Torus

In order to understand the cost function F of a torus GAN, we apply techniques of harmonic analysis to it. We suppose that the reader is familiar with basic notions of Fourier and harmonic analysis, such as Hilbert spaces and orthogonal Schauder basis on them. Otherwise, please refer to [22].
Let us consider T n = R n / Z n so that functions on T n are n-periodic functions on the unit square. Recall that a fundamental result of Fourier analysis is that the space L 2 ( T n ) of complex-valued square-integrable functions on T n is a Hilbert space with product given by
F , G = T n F ( θ ) G ( θ ) ¯ d θ .
Moreover, this space is spanned by the orthonormal basis of functions:
e m ( θ ) = e 2 π i m · θ ,
where m = ( m 1 , , m n ) Z n , θ = ( θ 1 , , θ n ) T n and m · θ = m 1 θ 1 + + m n θ n is the standard inner product. In other words, any F L 2 ( T n ) can be uniquely written as a sum:
F ( θ ) = m Z n α m e m ( θ ) = m Z n α m e 2 π i m · θ ,
in the sense that this sum is convergent in L 2 ( T n ) and converges to F . This expression is referred to as the Fourier series of F . The coefficients α m are called the Fourier coefficients or the Fourier modes of F . Using the orthogonality of the functions e m ( θ ) , they can be obtained as
α m = F , e m ( θ ) = T n F ( θ ) e 2 π i m · θ d θ .
In principle, the convergence of the Fourier series to F is only in the L 2 sense (c.f. [23] for a Fourier series of a continuous function not converging pointwise everywhere or [24] for an everywhere divergent Fourier series of a L 1 function). However, if F is C 1 , since we are working on a compact space, it is automatically Hölder and, thus, its Fourier series converges uniformly [25]. This means that, for every ϵ > 0
F m i = N N α m e m = sup θ T n F ( θ ) m i = N N α m e 2 π i m · θ < ϵ ,
for all N large enough. Similar approximations can be obtained for the k first derivatives of F if it has enough regularity (concretely, if it is C k + 1 ).
This approximation is very useful for estimating the associated flow. Recall that, using the Gronwall inequality [26], if X , Y are two Lipschitz vector fields, then there exists a constant M > 0 such that their associated flows θ ( t ) and ϑ ( t ) satisfy
| θ ( t ) ϑ ( t ) | e M t 1 M | | X Y | |
for all t. In other words, for medium times, the flow of X may be approximated through the flow of Y.
Remark 5.
The previous estimation implies that, locally, the dynamics of the flows θ ( t ) and ϑ ( t ) are similar. In particular, this is useful for analyzing convergence around critical points. Nevertheless, the global dynamics of θ ( t ) and ϑ ( t ) may be quite different, say, they may have different numbers of critical points.
In our context, this idea can be exploited as follows. Let us denote by
Θ N ( F ) = m i = N N α m e m
the truncated Fourier series of F . If F is C 2 , then F and Θ N ( F ) are close vector fields and, thus,
| θ ( t ) θ N ( t ) | e M t 1 M | | F Θ N ( F ) | | ϵ ( e M t 1 )
for N large enough, where θ ( t ) is the Morse flow for F and θ N ( t ) is the Morse flow for Θ N ( F ) . Working verbatim with the Nash vector fields, we obtain similar estimates for the solutions of the Nash flow.

4. Dynamics of Fourier Basis

In this section, we focus on the Nash flow of truncated approximations of Fourier series of a C 2 function F . As we mentioned above, these solutions approximate quite well the real Nash flow of F for short times (particularly, around critical points).
For the sake of simplicity, in this section, we focus on the 2-dimensional case in which d D = d G = 1 so that F = F ( θ 1 , θ 2 ) is a function:
F : T 2 R .
Moreover, we truncate the Fourier series at the level N = 2 . Similar arguments can be carried out for higher dimension and more accurate precision of the Fourier series with similar results, but the calculations become more involved.
First, let us rewrite the Fourier series of F as a trigonometric polynomial. Recall that the trigonometric functions can be obtained from the complex exponential as
cos ( 2 π θ ) = e 2 π i θ + e 2 π i θ 2 , sin ( 2 π θ ) = e 2 π i θ e 2 π i θ 2 i .
Since the function F is real-valued, we can group the coefficients and obtain a formula for the Fourier series in term of trigonometric functions as
F ( θ 1 , θ 2 ) = m 1 , m 2 = 0 a m 1 , m 2 0 , 0 sin ( 2 π m 1 θ 1 ) sin ( 2 π m 2 θ 2 ) + m 1 , m 2 = 0 a m 1 , m 2 0 , 1 sin ( 2 π m 1 θ 1 ) cos ( 2 π m 2 θ 2 ) + m 1 , m 2 = 0 a m 1 , m 2 1 , 0 cos ( 2 π m 1 θ 1 ) sin ( 2 π m 2 θ 2 ) + m 1 , m 2 = 0 a m 1 , m 2 1 , 1 cos ( 2 π m 1 θ 1 ) cos ( 2 π m 2 θ 2 ) .
The coefficients are real numbers that can be obtained as
a m 1 , m 2 0 , 0 = δ m 1 , m 2 F , sin ( 2 π m 1 θ 1 ) sin ( 2 π m 2 θ 2 ) = δ m 1 , m 2 T 2 F ( θ 1 , θ 2 ) sin ( 2 π m 1 θ 1 ) sin ( 2 π m 2 θ 2 ) d θ 1 d θ 2 , a m 1 , m 2 0 , 1 = δ m 1 , m 2 F , sin ( 2 π m 1 θ 1 ) cos ( 2 π m 2 θ 2 ) = δ m 1 , m 2 T 2 F ( θ 1 , θ 2 ) sin ( 2 π m 1 θ 1 ) cos ( 2 π m 2 θ 2 ) d θ 1 d θ 2 , a m 1 , m 2 1 , 0 = δ m 1 , m 2 F , cos ( 2 π m 1 θ 1 ) sin ( 2 π m 2 θ 2 ) = δ m 1 , m 2 T 2 F ( θ 1 , θ 2 ) cos ( 2 π m 1 θ 1 ) sin ( 2 π m 2 θ 2 ) d θ 1 d θ 2 , a m 1 , m 2 1 , 1 = δ m 1 , m 2 F , cos ( 2 π m 1 θ 1 ) cos ( 2 π m 2 θ 2 ) = δ m 1 , m 2 T 2 F ( θ 1 , θ 2 ) cos ( 2 π m 1 θ 1 ) cos ( 2 π m 2 θ 2 ) d θ 1 d θ 2 ,
where δ m 1 , m 2 is a coefficient that δ m 1 , m 2 = 1 if m 1 = m 2 = 0 ; δ m 1 , m 2 = 2 if m 1 = 0 and m 2 > 0 , m 1 > 0 and m 2 = 0 ; and δ m 1 , m 2 = 4 m 1 , m 2 > 0 .
To shorten notation, from now on, we denote
Λ m 1 , m 2 0 , 0 ( θ 1 , θ 2 ) = sin ( 2 π m 1 θ 1 ) sin ( 2 π m 2 θ 2 ) , Λ m 1 , m 2 0 , 1 ( θ 1 , θ 2 ) = sin ( 2 π m 1 θ 1 ) cos ( 2 π m 2 θ 2 ) , Λ m 1 , m 2 1 , 0 ( θ 1 , θ 2 ) = cos ( 2 π m 1 θ 1 ) sin ( 2 π m 2 θ 2 ) , Λ m 1 , m 2 1 , 1 ( θ 1 , θ 2 ) = cos ( 2 π m 1 θ 1 ) cos ( 2 π m 2 θ 2 ) ,
This notation is particularly useful because, for any α , β Z 2 ,
θ 1 Λ m 1 , m 2 α , β = ( 1 ) α 2 π m 1 Λ m 1 , m 2 α + 1 , β , θ 2 Λ m 1 , m 2 α , β = ( 1 ) β 2 π m 2 Λ m 1 , m 2 α , β + 1 ,
where the sum is interpreted as the sum in Z 2 .
From this expression of the Fourier series, we approximate the dynamics of the Nash flow for F by truncating the Fourier series. In particular, we sort the coefficients a m 1 , m 2 α , β by decreasing order of their absolute value. Looking only at the two largest coefficients and normalizing so that the leading coefficient is 1, we consider the approximation to F :
Θ ( F ) = Λ m 1 , m 2 α , β + μ Λ n 1 , n 2 γ , δ ,
where α , β , γ , δ Z 2 , ( m 1 , m 2 ) are the leading Fourier modes and ( n 1 , n 2 ) are the second largest modes, and | μ | < 1 .

4.1. Nash Flow for Single Variable Fourier Basis

From now on, we aim to analyze the Nash flow for a truncated Fourier series. As we see in Section 5, from it, we can envisage the global dynamics of the Nash flow for the objective function of a GAN.
First, let us consider the simplest Fourier modes, namely with m 1 = 0 or m 2 = 0 . In this case, the dynamics is quite simple and, in most cases, can be pulled apart. In the case of Λ 0 , 0 α , β ( θ 1 , θ 2 ) 1 , the Nash flow equations amount to
θ 1 = θ 1 Λ 0 , 0 α , β ( θ 1 , θ 2 ) = 0 , θ 2 = θ 2 Λ 0 , 0 α , β ( θ 1 , θ 2 ) = 0 .
Therefore, the solutions are constant orbits ( θ 1 ( t ) , θ 2 ( t ) ) = ( θ 1 0 , θ 2 0 ) for some fixed ( θ 1 0 , θ 2 0 ) T 2 . For this reason, it does not contribute to the dynamics.
For Fourier modes of the form Λ m 1 , 0 0 , β ( θ 1 , θ 2 ) = sin ( 2 π m 1 θ 1 ) or Λ m 1 , 0 1 , β ( θ 1 , θ 2 ) = cos ( 2 π m 1 θ 1 ) , the situation is also very simple. Now, the Nash flow is given by
θ 1 = θ 1 Λ m 1 , 0 α , β ( θ 1 , θ 2 ) = 2 π m 1 Λ m 1 , 0 α + 1 , β ( θ 1 , θ 2 ) , θ 2 = θ 2 Λ m 1 , 0 α , β ( θ 1 , θ 2 ) = 0 .
The solution to this system has the form ( θ 1 ( t ) , θ 2 ( t ) ) = ( f m 1 α ( t ) , θ 2 0 ) for some fixed θ 2 0 , and f m 1 α ( t ) is a differentiable function depending on m 1 and α (the explicit form of f m 1 α ( t ) can be obtained by solving the 1-dimensional ODE for θ 1 by separation of variables). Thus, the flow is completely horizontal with 2 m 1 lines of critical points at the lines θ 1 = 2 k 1 α + 1 4 m 1 for k 1 Z . Half of these critical lines are attractive, corresponding to the maxima of f m 1 α , and half of them are repulsive, corresponding to the minima.
The situation of the Fourier modes of the form Λ 0 , m 2 α , 0 ( θ 1 , θ 2 ) = sin ( 2 π m 2 θ 2 ) or Λ 0 , m 2 α , 1 ( θ 1 , θ 2 ) = cos ( 2 π m 2 θ 2 ) is completely symmetric. Now, the flow is vertical and the critical lines are at θ 2 = 2 k 2 α + 1 4 m 2 for k 2 Z (but the attractive ones correspond to the minima and the repulsive to the minima).
Furthermore, we can collect all the Fourier modes with a vanishing frequency into a single function. To be precise, decompose the Fourier series of F as
F = a 0 , 0 0 , 0 2 + 1 m 1 < α = 0 , 1 a m 1 , 0 α , 0 Λ m 1 , 0 α , 0 Δ 1 ( θ 1 ) + a 0 , 0 0 , 0 2 + 1 m 2 < β = 0 , 1 a m 1 , 0 0 , β Λ 0 , m 2 0 , β Δ 2 ( θ 2 ) + m 1 , m 2 = 1 a m 1 , m 2 0 , 0 Λ m 1 , m 2 α , β Θ ( θ 1 , θ 2 ) .
Now, the superposition principle applied to (5) implies that any solution to the Nash flow has the following form:
( θ 1 ( t ) , θ 2 ( t ) ) = ( θ ^ 1 ( t ) , θ 2 0 ) + ( θ 1 0 , θ ^ 2 ( t ) ) + Φ ( t ) ,
where ( θ ^ 1 ( t ) , θ 2 0 ) is a horizontal flow corresponding to the solution of (5) for Δ 1 (explicitly, θ ^ 1 is the solution to the equation θ ^ 1 = d d θ 1 Δ 1 ( θ ^ 1 ) ), ( θ 1 0 , θ ^ 2 ( t ) ) is a vertical flow corresponding to the solution of (5) for Δ 2 (i.e., θ ^ 2 is the solution to θ ^ 2 = d d θ 2 Δ 2 ( θ ^ 2 ) ), and Φ is the solution to the (coupled) system of Equation (5) for Θ .
For this reason, in many cases, the effect of the Δ 1 and the Δ 2 parts in the dynamics is negligible and can be ignored.

4.2. Nash Flow for Fourier Basis

In this section, we analyze the dynamics of the Nash flow for the remaining Fourier basis. For this purpose, let us consider the function Λ m 1 , m 2 α , β for some α , β Z 2 with m 1 , m 2 1 . The Nash vector field associated with it is
N Λ m 1 , m 2 α , β = 2 π ( 1 ) α m 1 Λ m 1 , m 2 α + 1 , β , ( 1 ) β m 2 Λ m 1 , m 2 α , β + 1 .
Recall that, if ( θ 1 , θ 2 ) T 2 is a zero of Λ m 1 , m 2 α , β , then it satisfies
4 θ 1 m 1 2 k 1 + α mod 4 Z , or 4 θ 2 m 2 2 k 2 + β mod 4 Z ,
for some k 1 , k 2 Z . In other words, if we take into account the periodicity of the function Λ m 1 , m 2 α , β , the zeros are given by
θ 1 = 2 k 1 + α 4 m 1 , or θ 2 = 2 k 2 + β 4 m 2 ,
for 0 k 1 < 2 m 1 and 0 k 2 < 2 m 2 . Observe that all these values are different, so Λ m 1 , m 2 α , β has 4 m 1 m 2 zeros.
Coming back to Equation (7), we observe that, if ( θ 1 , θ 2 ) T 2 is a critical point of the Nash vector field (i.e., a critical point of Λ m 1 , m 2 α , β ), then it satisfies one of the following two possibilities:
(I) 4 θ 1 m 1 , 4 θ 2 m 2 ( 2 k 1 α + 1 , 2 k 2 β + 1 ) mod 4 Z × 4 Z ,
(II) 4 θ 1 m 1 , 4 θ 2 m 2 ( 2 k 1 + α , 2 k 2 + β ) mod 4 Z × 4 Z .
Beware of the change in sign for the coefficient of α and β for points (I). This is just a matter of notational convenience, as shown below. Equivalently, the these conditions can be written explicitly as
(I) ( θ 1 , θ 2 ) = 2 k 1 α + 1 4 m 1 , 2 k 2 β + 1 4 m 2 , for k 1 , k 2 Z ,
(II) ( θ 1 , θ 2 ) = 2 k 1 + α 4 m 1 , 2 k 2 + β 4 m 2 , for k 1 , k 2 Z .
Thus, the Nash vector field has 8 m 1 m 2 critical points: 4 m 1 m 2 critical points of type (I) and 4 m 1 m 2 of type (II).
Regarding the Nash Hessian, it is explicitly given by
N Hess Λ m 1 , m 2 α , β = 4 π 2 m 1 2 Λ m 1 , m 2 α , β ( 1 ) α + β m 1 m 2 Λ m 1 , m 2 α + 1 , β + 1 ( 1 ) α + β + 1 m 1 m 2 Λ m 1 , m 2 α + 1 , β + 1 m 2 2 Λ m 1 , m 2 α , β
Therefore, evaluated at a critical point of the form (I), we get that
N Hess Λ m 1 , m 2 α , β | ( I ) = ( 1 ) k 1 + k 2 4 π 2 m 1 2 0 0 m 2 2 .
These are all saddle points for the Nash flow, with an attractive direction and a repulsive direction.
On the other hand, the Nash Hessian evaluated at a critical point of the form (II) is
N Hess Λ m 1 , m 2 α , β | ( II ) = ( 1 ) k 1 + k 2 + α + β 4 π 2 0 m 1 m 2 m 1 m 2 0 ( 1 ) k 1 + k 2 + α + β 4 π 2 m 1 m 2 i 0 0 i .
In this situation, we obtain a center critical point with periodic orbits around it and no convergent flow lines. This dynamic is depicted in Figure 1. Observe that, in this plot, the 2-dimensional torus T 2 is represented as the square [ 0 , 1 ] × [ 0 , 1 ] with the boundaries identified in pairs, i.e., the left boundary 0 × [ 0 , 1 ] is identified with the right boundary 1 × [ 0 , 1 ] preserving the orientation and so are the bottom boundary [ 0 , 1 ] × 0 and the upper one 1 × [ 0 , 1 ] ).
Putting together these calculations, we have proven the following result.
Proposition 1.
The Nash flow for the Fourier basis function Λ m 1 , m 2 α , β has 8 m 1 m 2 critical points, for which the dynamics are
(I)
4 m 1 m 2 points are saddle points for the flow, half of them corresponding to the maxima of Λ m 1 , m 2 α , β and half of them to the minima.
(II)
4 m 1 m 2 points are center points for the flow, surrounded by periodic orbits and corresponding to the saddle points of Λ m 1 , m 2 α , β .

4.3. Nash Flow for Simplified Truncated Fourier Series

In [4], it is proven that, under some ideal conditions, the Nash flow associated with the cost function of a GAN has stable Nash equilibriums. For this reason, according to Proposition 1, these cost functions cannot be the basis functions of the Fourier series. In other words, its Fourier approximation (6) is nontrivial. Hence, in order to capture the actual dynamics of the GAN flow, let us consider a general truncated Fourier series of the following form:
Θ = Λ m 1 , m 2 α , β + μ Λ n 1 , n 2 γ , δ ,
for some α , β , γ , δ Z 2 , 1 μ 1 and Fourier modes m 1 , m 2 , n 1 , n 2 1 .
In order to simplify the computations, in this section, we suppose that m 1 = m 2 = 1 . After this case, the general setting is studied. In this simplified case, at a point ( θ 1 0 , θ 2 0 ) = k 1 / 2 + α / 4 , k 2 / 2 + β / 4 of the form (II), we have
Θ | ( θ 1 0 , θ 2 0 ) = 2 π μ ( ( 1 ) γ n 1 Λ n 1 , n 2 γ + 1 , δ ( θ 1 0 , θ 2 0 ) , ( 1 ) δ n 2 Λ n 1 , n 2 γ , δ + 1 ( θ 1 0 , θ 2 0 ) ) .
At this point, we have the following two options.
  • If Θ | ( θ 1 0 , θ 2 0 ) = 0 , then ( θ 1 0 , θ 2 0 ) is also a critical point of Θ . Hence, the dynamic of the Nash flow near ( θ 1 0 , θ 2 0 ) is determined by the Nash Hessian at that point. This Hessian is given by
    N Hess Θ | ( θ 1 0 , θ 2 0 ) = ( 1 ) k 1 + k 2 + α + β 4 π 2 0 1 1 0 + μ N Hess Λ n 1 , n 2 γ , δ | ( θ 1 0 , θ 2 0 )
    Suppose that ( γ , δ ) = ( α + 1 , β + 1 ) in Z 2 × Z 2 . Set σ = ( 1 ) n 1 k 1 + n 2 k 2 + α n 1 / 2 + β n 2 / 2 . Observe that Λ n 1 , n 2 α , β θ 1 0 , θ 2 0 = 0 and Λ n 1 , n 2 α + 1 , β + 1 θ 1 0 , θ 2 0 = σ , so we have that
    N Hess Λ n 1 , n 2 γ , δ | ( θ 1 0 , θ 2 0 ) = 4 π 2 μ σ n 1 2 0 0 n 2 2
    With this calculation at hand, we observe the following. By continuity, for | μ | small, since N Hess Λ 1 , 1 α , β | ( θ 1 0 , θ 2 0 ) has complex eigenvalues, then N Hess Θ | ( θ 1 0 , θ 2 0 ) also has complex eigenvalues. In particular, they must be conjugated, say λ , λ ¯ C . In that case, the stability of a critical point at ( θ 1 0 , θ 2 0 ) is governed by the following trace:
    2 Re ( λ ) = λ + λ ¯ = tr N Hess Θ | ( θ 1 0 , θ 2 0 ) = 4 π 2 μ σ n 2 2 n 1 2 .
    Hence, if n 2 < n 1 and μ σ = 1 , or n 2 > n 1 and μ σ = 1 (respectively n 2 > n 1 and μ σ = 1 , or n 2 < n 1 and μ σ = 1 ), any critical point nearby ( θ 1 , θ 2 ) T 2 is an spiral attractor (respectively repulsor). In the case that n 1 = n 2 , the eigenvalues are multiples of i and i so the point is still a center and the behaviour bifurcates depending on further Fourier modes.
    On the other hand, if γ = α or δ = β in Z 2 , then we have that
    N Hess Λ n 1 , n 2 γ , δ | ( θ 1 0 , θ 2 0 ) = ± 4 π 2 μ 0 n 1 n 2 n 1 n 2 0
    Therefore, N Hess Θ | ( θ 1 0 , θ 2 0 ) is still an anti-diagonal matrix and the dynamics depends on further Fourier modes.
  • If Θ | ( θ 1 0 , θ 2 0 ) 0 , then ( θ 1 0 , θ 2 0 ) is no longer a critical point of Θ . However, if | μ | is small, by the implicit function theorem, nearby ( θ 1 0 , θ 2 0 ) , there must be a unique critical point ( θ ˜ 1 , θ ˜ 2 ) T 2 of Θ . Again, by continuity, since N Hess Λ 1 , 1 α , β | ( θ 1 , θ 2 ) has complex eigenvalues, then N Hess Θ | ( θ ˜ 1 , θ ˜ 2 ) also has complex eigenvalues and their real part can be controlled through the trace.
    Explicitly, the Nash Hessian is
    N Hess Θ | ( θ ˜ 1 , θ ˜ 2 ) = 4 π 2 n 1 2 μ Λ n 1 , n 2 γ , δ ± 1 ± μ n 1 n 2 Λ n 1 , n 2 γ + 1 , δ + 1 1 μ n 1 n 2 Λ n 1 , n 2 γ + 1 , δ + 1 n 2 2 μ Λ n 1 , n 2 γ , δ ( θ ˜ 1 , θ ˜ 2 )
    Therefore, its trace is given by
    4 π 2 μ Λ n 1 , n 2 γ , δ ( θ ˜ 1 , θ ˜ 2 ) n 2 2 n 1 2 .
    In particular, if n 1 = n 2 , then the new critical point ( θ ˜ 1 , θ ˜ 2 ) is still a center. Otherwise, the behaviour is determined by the sign of Λ n 1 , n 2 γ , δ ( θ ˜ 1 , θ ˜ 2 ) . This sign can be read from the gradient and the Nash Hessian at ( θ 1 0 , θ 2 0 ) .
    To illustrate this idea, we consider a particular combination of signs. The other cases can be obtained analogously. Suppose that the first component of the gradient satisfies
    Θ θ 1 ( θ 1 0 , θ 2 0 ) = 2 π μ ( 1 ) γ n 1 Λ n 1 , n 2 γ + 1 , δ ( θ 1 0 , θ 2 0 ) > 0 .
    In addition, suppose that the entries of the first row of the Nash Hessian have signs
    N Hess Λ n 1 , n 2 γ , δ | ( θ 1 0 , θ 2 0 ) 1 , 1 = 4 π 2 n 1 2 μ Λ n 1 , n 2 γ , δ ( θ 1 0 , θ 2 0 ) > 0 , N Hess Λ n 1 , n 2 γ , δ | ( θ 1 0 , θ 2 0 ) 1 , 2 = ± 1 ± μ n 1 n 2 Λ n 1 , n 2 γ + 1 , δ + 1 ( θ 1 0 , θ 2 0 ) < 0 .
    In that case, this means that ( θ ˜ 1 , θ ˜ 2 ) has the form ( θ ˜ 1 , θ ˜ 2 ) = ( θ 1 0 ϵ 1 , θ 2 0 + ϵ 2 ) for a small ϵ 1 , ϵ 2 > 0 . Therefore, the sign of (9) is determined by the sign of Λ n 1 , n 2 γ , δ ( θ 1 0 ϵ 1 , θ 2 0 + ϵ 2 ) , which is a well-defined quantity that only depends on the particular point ( θ 1 0 , θ 2 0 ) and γ , δ Z 2 .

4.4. Nash Flow for General Truncated Fourier Series

In the general case, the calculation is similar but more involved. To alleviate notation, let us consider the auxiliary functions:
σ 0 ( θ ) = 0 if θ = 0 or 1 2 , 1 if 0 < θ < 1 2 , 1 if 1 2 < θ < 1 , σ 1 ( θ ) = 0 if θ = 1 4 or 3 4 , 1 if 0 θ < 1 4 or 3 4 θ < 1 , 1 if 1 4 < θ < 3 4 .
Notice that these maps are just the sign functions of the trigonometric functions σ 0 ( θ ) = sign sin ( 2 π θ ) and σ 1 ( θ ) = sign cos ( 2 π θ ) , with the customary assumption that the sign function vanishes at zero. If needed, we may extend them to the whole real line by periodicity.
Now, let us consider a truncated Fourier series with arbitrary frequencies m 1 , m 2 , n 1 , n 2 1 of the following form:
Θ = Λ m 1 , m 2 α , β + μ Λ n 1 , n 2 γ , δ .
Analogously to the previous case, the gradient of Θ at a point
( θ 1 0 , θ 2 0 ) = ( 2 k 1 + α ) n 1 4 m 1 , ( 2 k 2 + β ) n 2 4 m 2 T 2
of the form (II) is
Θ | ( θ 1 0 , θ 2 0 ) = 2 π μ ( ( 1 ) γ n 1 Λ n 1 , n 2 γ + 1 , δ ( θ 1 0 , θ 2 0 ) , ( 1 ) δ n 2 Λ n 1 , n 2 γ , δ + 1 ( θ 1 0 , θ 2 0 ) ) .
Therefore, we again find a bifurcation of behaviour depending on whether Θ | ( θ 1 0 , θ 2 0 ) = 0 . If Θ | ( θ 1 0 , θ 2 0 ) = 0 , the Nash Hessian it is given by
N Hess Θ | ( θ 1 0 , θ 2 0 ) = ( 1 ) k 1 + k 2 + α + β 4 π 2 0 m 1 m 2 m 1 m 2 0 + μ N Hess Λ n 1 , n 2 γ , δ | ( θ 1 0 , θ 2 0 )
As above, the character of this matrix depends some combinatorials of ( α , β ) and ( γ , δ ) . Explicitly, we have that
N Hess Θ | ( θ 1 0 , θ 2 0 ) = 4 π 2 n 1 2 μ Λ n 1 , n 2 γ , δ ± m 1 m 2 ± μ n 1 n 2 Λ n 1 , n 2 γ + 1 , δ + 1 m 1 m 2 μ n 1 n 2 Λ n 1 , n 2 γ + 1 , δ + 1 n 2 2 μ Λ n 1 , n 2 γ , δ ( θ 1 0 , θ 2 0 )
When | μ | is small, N Hess Θ | ( θ 1 0 , θ 2 0 ) has complex eigenvalues λ , λ ¯ C . Since λ + λ ¯ = 2 Re ( λ ) , the dynamics are ruled by the real part Re ( λ ) which is given by the following trace:
4 π 2 μ Λ n 1 , n 2 γ , δ | ( θ 1 0 , θ 2 0 ) n 2 2 n 1 2 .
Its negativity (respectively positivity) can be controlled with the trigonometric sign functions as
μ σ γ ( θ 1 0 n 1 ) σ δ ( θ 2 0 n 2 ) n 2 2 n 1 2 < 0 ( respectively > 0 ) .
Remark 6.
There are many cases in which this trace does not vanish. For instance, if ( γ , δ ) = ( α + 1 , β + 1 ) in Z 2 × Z 2 , in general,
Λ n 1 , n 2 α + 1 , β + 1 2 k 1 + α 4 m 1 , 2 k 2 + β 4 m 2 0 .
To be precise, given n N , let us denote by p a r ( n ) the unique integer such that n = 2 p a r ( n ) n with n odd. In that case, we have that Λ n 1 , n 2 α + 1 , β + 1 2 k 1 + α 4 m 1 , 2 k 2 + β 4 m 2 = 0 for some k 1 , k 2 Z if and only if p a r ( m 1 ) = p a r ( n 1 ) + ( 1 ) α or p a r ( m 2 ) = p a r ( n 2 ) + ( 1 ) β . It would be interesting to study the relation between the behavior and the small divisors phenomena observed in Kolmogorov-Arnold-Moser (KAM) theory [27].
The case with Θ | ( θ 1 0 , θ 2 0 ) 0 can be treated similarly, but now, we must not look at the Nash Hessian exactly at ( θ 1 0 , θ 2 0 ) but at a point nearby. Generalizing the argument of Section 4.3, set
A = ( 1 ) γ μ n 1 σ γ + 1 ( θ 1 0 n 1 ) σ δ ( θ 2 0 n 2 ) , B 1 = μ σ γ ( θ 1 0 n 1 ) σ δ ( θ 2 0 n 2 ) ,
B 2 = ( 1 ) k 1 + k 2 + α + β m 1 m 2 + ( 1 ) δ + γ μ n 1 n 2 σ γ + 1 ( θ 1 0 n 1 ) σ δ + 1 ( θ 2 0 n 2 ) .
Then, the unique critical point ( θ ˜ 1 , θ ˜ 2 ) close to ( θ 1 0 , θ 2 0 ) has the following form:
( θ ˜ 1 , θ ˜ 2 ) = θ 1 0 + sign ( A B 1 ) ϵ 1 , θ 2 0 + sign ( A B 2 ) ϵ 2 ,
for small enough ϵ 1 , ϵ 2 > 0 . Therefore, the dynamic of the critical point ( θ ˜ 1 , θ ˜ 2 ) is determined by
μ σ γ ( ( θ 1 0 + sign ( A B 1 ) ϵ 1 ) n 1 ) σ δ ( ( θ 2 0 + sign ( A B 2 ) ϵ 2 ) n 2 ) n 2 2 n 1 2 .
This quantity controls the the sign of the trace of the Nash Hessian in analogy with the analysis of Section 4.3. Therefore, if this last quantity is negative, then ( θ ˜ 1 , θ ˜ 2 ) is a spiral attractor and, if it is positive, the point becomes a repulsor.
To illustrate the different bifurcation phenomena explained in this section, in Figure 2, the Nash follows some truncated series of low frequencies. Finally, summarizing this discussion, we obtained the following result.
Theorem 1.
For μ small enough, the truncated Fourier series
Θ = Λ m 1 , m 2 α , β + μ Λ n 1 , n 2 γ , δ ,
has an attracting (respectively repulsive) spiral critical point at each of the points of the form (II),
θ 1 0 , θ 2 0 = 2 k 1 + α 4 m 1 , 2 k 2 + β 4 m 2 ,
for k 1 , k 2 Z provided the following:
  • If Θ | ( θ 1 0 , θ 2 0 ) = 0 , it must hold that
    μ σ γ ( θ 1 0 ) σ δ ( θ 2 0 ) n 2 2 n 1 2 < 0 ( respectively > 0 ) .
  • If Θ | ( θ 1 0 , θ 2 0 ) 0 , it must hold that
    μ σ γ ( ( θ 1 0 + sign ( A B 1 ) ϵ 1 ) n 1 ) σ δ ( ( θ 2 0 + sign ( A B 2 ) ϵ 2 ) n 2 ) n 2 2 n 1 2 < 0 ( respectively > 0 ) .
    for ϵ 1 , ϵ 2 > 0 that is small enough.
Remark 7.
Even though half of the critical points near the points of the form (II) are attractors for the Nash flow of Θ , the dynamic is an small perturbation of a center. In this manner, the convergence is slow, highly spiralizing towards the Nash equilibrium. This theoretically justifies the slow and bad conditioned convergence observed in GANs networks.

5. Empirical Analysis

In this section, we show empirically how these Fourier approximations can be useful for understanding the convergence in the training of GANs. For this purpose, in this section, we consider a simple model for a 2-parametric torus GAN (i.e., with d D = d G = 1 ) and we analyze its convergence by means of its truncated Fourier series.
In the notation of Section 3, we take d = 1 (1-dimensional real data) and the parameter spaces is Θ D = Θ G = S 1 . The latent space is Λ = [ 0 , 1 ] R with the uniform probability (standard Lebesgue measure). Fix a periodic functions χ : S 1 R . Choose a 1-parametric continuous distribution D ξ depending on the parameter ξ R , with cumulative distribution function F ξ and probability density function f ξ . Fix ω S 1 , and the real data X is sampled according to the distribution X D χ ( ω ) .
As discriminator function, for θ 1 S 1 , we consider the function D θ 1 : R R given by
D θ 1 ( x ) = f χ ( ω ) ( x ) f χ ( ω ) ( x ) + f χ ( θ 1 ) ( x ) .
On the other hand, for θ 2 S 1 , the generator is the function G θ 2 : Λ = [ 0 , 1 ] R given by
G θ 2 ( λ ) = F χ ( θ 2 ) 1 ( λ ) ,
where F χ ( θ 2 ) 1 is the quantile function of D χ ( θ 2 ) .
With these choices of generator and discriminator and taking as weight function f ( t ) = log ( 1 + exp ( t ) ) , as in [1], the cost functional (1) is reduced to
F ( θ 1 , θ 2 ) = E Ω log D θ 1 ( X ) + E Λ log 1 D θ 1 ( G θ 2 ) = R log f χ ( ω ) ( x ) f χ ( ω ) ( x ) + f χ ( θ 1 ) ( x ) f χ ( ω ) ( x ) d x + 0 1 log 1 f χ ( ω ) F χ ( θ 2 ) 1 ( λ ) f χ ( ω ) F χ ( θ 2 ) 1 ( λ ) + f χ ( θ 1 ) F χ ( θ 2 ) 1 ( λ ) d λ .
Remark 8.
These choices of shapes for the discriminator and generator functions are justified by [1, Proposition 1]. There, it is proven that, for a fixed generator G with transformed probability density function f G , the optimal discriminator D θ 1 0 is given by
D θ 1 0 ( x ) = f χ ( ω ) ( x ) f χ ( ω ) ( x ) + f G ( x ) .
On the other hand, recall that, if Λ = [ 0 , 1 ] with the uniform probability, then F ξ 1 : Λ = [ 0 , 1 ] R is a random variable with distribution D ( ξ ) . Thus, in our case, G θ 2 is a random variable with distribution D χ ( θ 2 ) and, therefore, transformed density f χ ( θ 2 ) .
In this vein, the goal of the generator G given by (12) is to adjust θ 2 to reach the value θ 2 = ω , for which G generates exactly the real data. On the other side, for fixed parameter θ 2 for G, D given by (11) aims to tune θ 1 to the value θ 1 = θ 2 , for which D is the perfect discriminator (14).
For the purposes of these experiments, we fix the underlying distribution D ξ to be the exponential distribution with mean 1 / ξ and χ ( θ ) = sin ( π θ ) 2 + 1 . Recall that, in this situation, f ξ ( x ) = ξ e ξ x y F ξ ( x ) = 1 e ξ x . In this way, the discriminator function (11) and the generator (12) are given by
D θ 1 ( x ) = e x sin π θ 1 2 sin π θ 1 2 + 1 sin π ω 2 + 1 e x sin π ω 2 + e x sin π θ 1 2 , G θ 2 ( λ ) = 1 sin π θ 2 2 + 1 log 1 λ 1 .
Moreover, from now on, we fix ω = 1 / 4 , so that χ ( ω ) = 3 / 4 . The resulting probability density and cumulative distribution functions of the real data are plotted in Figure 3.
With this choice of real distribution, the generator function as well as the transformed probability density function are plotted in Figure 4 and the discriminator function is shown in Figure 5.
In addition, in Figure 6, we show graphically the cost function F ( θ 1 , θ 2 ) of (13) on T 2 . The numerical approximation of the integrals in (13) were carried out with the Simpson rule. The function was sampled at 225 knot points and subsequently interpolated by means of a multiquadratic radial basis interpolation. Observe that one of the Nash equilibria of F is at ( θ 1 , θ 2 ) = ( 1 / 4 , 1 / 4 ) (bottom corner of the plot). Moreover, by the symmetries of χ , the plot suggests that ( θ 1 , θ 2 ) = ( 1 / 4 , 3 / 4 ) , ( 3 / 4 , 1 / 4 ) , ( 3 / 4 , 3 / 4 ) are also Nash equilibria.
In Figure 7, we show the Nash flow associated with the cost function F : T 2 R . As can be checked in the image, the flow confirms that there exists four Nash equilibrium points, corresponding to ( θ 1 0 , θ 2 0 ) = ( 1 / 4 , 1 / 4 ) , ( 1 / 4 , 3 / 4 ) , ( 3 / 4 , 1 / 4 ) , and ( 3 / 4 , 3 / 4 ) , all of them being attractors for the Nash flow. Another four critical points of F can be observed in the figure: the points ( 0 , 0 ) and ( 1 / 2 , 1 / 2 ) correspond to the two maxima of F , and the points ( 0 , 1 / 2 ) and ( 1 / 2 , 0 ) correspond to the two minima. Observe that these critical points are saddle points for the flow, with an attractive direction and a repulsive direction. Finally, notice that (4) is satisfied since the maxima and minima have even indices (2 and 0, respectively), and the Nash equilibria have odd indices.
Now, let us decompose F according to its Fourier series. In Table 1, we show the modes with the largest absolute Fourier coefficients. These coefficients have been computed using the formulae of Section 4 by applying rectangular quadrature as the numerical integration method and looking at the modes with 1 m 1 , m 2 10 .
From these results, we observe that the predominant Fourier modes of F are cosine basis functions, Λ m 1 , m 2 1 , 1 ( θ 1 , θ 2 ) = cos ( 2 π m 1 θ 1 ) cos ( 2 π m 2 θ 2 ) . The largest coefficient corresponds to the mode ( m 1 , m 2 ) = ( 1 , 1 ) . Observe that this is not surprising: ( m 1 , m 2 ) = ( 1 , 1 ) is the unique mode with four critical points of type (II), which correspond to the four Nash equilibria of Figure 7 (in other words, the four saddle points in Figure 6).
For s 0 , let us order the first s Fourier modes decreasingly according to the absolute value of their coefficient, ( m 1 0 , m 2 0 ) = ( 1 , 1 ) , ( m 1 1 , m 2 1 ) , , ( m 1 s , m 2 s ) . Denote by b m i i , m 2 i 1 , 1 = a m i i , m 2 i 1 , 1 / a m i 0 , m 2 0 1 , 1 the ratio of the Fourier coefficients. We can approximate the Nash flow of the cost function F by the truncated Fourier series:
Θ s ( θ 1 , θ 2 ) = Λ m 1 0 , m 2 0 1 , 1 ( θ 1 , θ 2 ) + i = 1 s b m i i , m 2 i 1 , 1 Λ m i i , m 2 i 1 , 1 ( θ 1 , θ 2 ) .
The associated Nash flow is depicted in Figure 8. As can be checked there, the critical points nearby points of type (II) are (approximately) centers for s 3 . The reason for this behavior is twofold. In the following, let ( θ 1 0 , θ 2 0 ) = ( 1 / 4 , 1 / 4 ) , ( 1 / 4 , 3 / 4 ) , ( 3 / 4 , 1 / 4 ) or ( 3 / 4 , 3 / 4 ) .
  • For s 2 , we have that Θ s | ( θ 1 0 , θ 2 0 ) = 0 since, in the gradient, there is always a term with a factor cos ( 2 π θ ) that vanishes at these points. Hence, the critical point of Θ s is exactly at ( θ 1 0 , θ 2 0 ) . Nevertheless, since all the terms Λ m 1 , m 2 α , β appearing in the Fourier series have equal ( α , β ) = 1 , as mentioned in Section 4.3, we still have that the Nash Hessian has the form in (8) with vanishing diagonal entries. Hence, the critical point ( θ 1 0 , θ 2 0 ) is still a center.
  • For s = 3 , we find that Θ 3 | ( θ 1 0 , θ 2 0 ) 0 , so a new critical point ( θ ˜ 1 , θ ˜ 2 ) appears near ( θ 1 0 , θ 2 0 ) . Nevertheless, for this new mode, we have that m 1 3 = m 2 3 = 2 , so Equation (10) still vanishes, proving that the new critical point is still a center.
Finally, let us consider the case s = 4 . In this situation, we also have Θ 4 | ( θ 1 0 , θ 2 0 ) 0 , so a new critical point ( θ ˜ 1 , θ ˜ 2 ) appears near ( θ 1 0 , θ 2 0 ) . The dynamic around it is governed by Equation (10). To do so, we calculate the sign of the quantities A , B 1 and B 2 of Section 4.4 and we get
A > 0 , B 1 < 0 , B 2 < 0 .
Hence, the new critical point has the form ( θ ˜ 1 , θ ˜ 2 ) = ( θ 1 0 ϵ 1 , θ 2 0 ϵ 2 ) for a small ϵ 1 , ϵ 2 > 0 . For these values, we have that
σ 1 ( 2 ( θ 1 0 ϵ 1 ) ) = 1 , σ δ ( 3 ( θ 2 0 ϵ 2 ) ) = 1 .
Therefore, checking Equation (10), we get
μ σ 1 ( n 1 ( θ 1 0 ϵ 1 ) ) σ δ ( n 2 ( θ 2 0 ϵ 2 ) ) · n 2 2 n 1 2 = 0.003 · ( 1 ) · ( 1 ) ( 3 2 2 2 ) < 0 .
Therefore, for s = 4 , the trend changes and the centers turn into spiral attractor critical points. This is the attractive behavior observed in Figure 8e. Notice that this dynamic agrees with the real one observed in Figure 7, which empirically confirms the validity of our approach.

6. Methodology for Practical Applications

The discussion of Section 4 and Section 5 opens the door to practical application of the analysis techniques introduced in this paper to study convergence of real-world GANs. Observe that, in general, the knowledge of the underlying cost function F (c.f. Equation (1)) of a GAN is very limited. Indeed, several metrics have been proposed in the literature to screen the evolution of the training of the GAN. These metrics provide a way to measure indirectly the convergence of the GAN but definitely skip a thorough analysis of the cost function. Nevertheless, using the techniques introduced in this paper, we show that it is possible to methodically analyze the dynamics of the Nash flow for the GAN problem through partial sums of the Fourier series of the cost function. It is remarkable that this valuable information about the behaviour of the training process cannot be extracted from F itself.
In this section, we aim to organize the previous analysis into a precise methodology that can be applied in practice. As it will become clear, this process was implicit in the reasoning provided in Section 5. The proposed process of analysis comprises the following steps:
  • Evaluate cost function F ( θ D , θ G ) in a uniform grid for the parameters ( θ D , θ G ) (the weights of the two neural networks forming the GAN in the deep learning framework). Observe that, for these evaluations, it is not necessary to train the GAN networks. The sampling process amounts to fixing the weights of the networks and to computing the mean prediction error of the discriminant against real and synthetic instances. No optimization of the weights must be carried out.
  • Compute the Discrete Fourier Transform (DFT) of F by means of the obtained samples. This process can be done efficiently through the Fast Fourier Transform (FFT) algorithm.
  • Use the results of the DFT to estimate the Fourier modes and coefficients of F . Sort the modes decreasingly according to the absolute value of their associated Fourier coefficient.
  • Consider a truncation level s 0 (starting with s = 0 ). Compute the critical points of Θ s , the truncated Fourier series of F with s terms. Using the techniques developed in Section 4 (see also Section 5), analyze the local dynamics of the Nash flow around the critical points of Θ s .
  • While some of the critical points of Θ s are a center, increase the truncation level by 1. Repeat the steps 4 and 5 until a truncation level s 0 is reached such that all the critical points of Θ s 0 are either attractors or repulsors.
After this process, we found a truncation level s 0 such that the local dynamics of Θ s 0 around the critical points are conjugated to the local dynamics of F around its Nash equilibria. This information can be exploited to analyze the training process of the GAN. For instance, if the convergence to the critical point is very slow, in the sense that the trace of the Nash Hessian is close to zero, then a hard convergence of the training process should be expected. This leads to remarkable instabilities during the learning process that may prevent the system from converging with a raw gradient descent optimization procedure. In that case, the obtained results strongly suggest that several heuristics for stabilizing the training process must be implemented. Additionally, since the equilibria are spiral attractors, if the learning rate of the gradient descend method is not small enough, the discrete time approximation may not converge. In that case, the information about the convergence rate in the simplified Fourier model can be used to properly anneal the learning rate, leading to a much stable convergence.
Despite the utility of the proposed methodology, it suffers several issues that must be addressed in future works to obtain an efficient analysis procedure. The first one is that the previous proposal has an obvious bottleneck: the sampling process of the cost function on the parameters ( θ D , θ G ) may require a huge number of samples due to the course of dimensionality. Nevertheless, it is important to mention that it is not necessary to use a very dense grid since we want to understand the Fourier modes of the cost function F and not to obtain a detailed picture of the landscape of F . This largely alleviates the sampling process to make it feasible.
Another possible solution is to not sample on the whole ( θ D , θ G ) space but on a smaller dimensional subspace concentrating the flow. For that purpose, the GAN network can be trained and, after some epochs, the flow will have entered in a certain “convergence subspace” that encloses the long-time evolution of the flow. This subspace can be estimated by several methods, for instance by considering the subspace generated by the last k 1 gradient vectors obtained in the training process. In that case, instead of working on the high dimensional ( θ D , θ G ) -space, we can restrict our analysis to the k-dimensional affine space generated by these vectors. This is a much smaller subspace in which the sampling process can be carried out. Nevertheless, proposing other efficient methods of sampling that enable accurate approximations of the Fourier series of F is an interesting topic for future work.
Another important remark is that the methodology proposed to estimate the Fourier series through the FFT is much more efficient than the quadrature methods used in Section 5. However, it also may lead to poorer estimations of the Fourier coefficients. This inaccuracy may produce errors when choosing the leading Fourier modes if their importance (absolute value of their Fourier coefficients) are similar. To avoid these problems, all the possible permutations of these similar modes (say, modes whose coefficients differ less than a fixed threshold) must be considered during the analysis of Nash flow of the Fourier series.

7. Conclusions

In this paper, we studied a novel approach to deeply analyze the converge of GAN networks on tori. This is an outstanding open problem in machine learning and deep learning that prevents GANs being suitable for use in arbitrary domains, as feature generation outside the world of image processing.
In this paper, we proposed to decompose the cost function of a GAN into its Fourier mode and to envisage the dynamics around the Nash equilibria through its truncated Fourier approximation. For that purpose, we performed a thorough analysis of the dynamics of trigonometric series with one and two terms. Roughly speaking, this analysis showed that, if we truncate the Fourier series at its first mode, all the critical points are centers surrounded by periodic orbits. When we add subtler Fourier modes to the approximation, this dynamic may be preserved or may bifurcate to give rise to spiral attractors or repulsors. This dynamic is essentially determined by the trace of the Nash Hessian of the cost function. Hence, following this idea, in this paper, we exhibited explicitly the bifurcation condition for the Nash flow of the truncated Fourier approximations. These conditions have an involved shape taking into account the monotonicity of the trigonometric functions on a neighborhood of the critical point, but eventually, the conditions are very explicit and can be easily checked. As byproduct of this analysis, we observed that, even though the Nash equilibria are stable points as proven in [4], the dynamic of the training process is close to a center and the convergence is slow and spiral.
To test this idea, we conducted an experimental analysis with a torus GAN toy-model. Through this example, we observed that the number and distribution of the critical points is determined by the first Fourier model. Nevertheless, it was necessary to reach the forth Fourier term to discover the attractive dynamics, as predicted in the GAN literature. Comparing the approximated flow with the real flow, we observed that the approximation is able to replicate not only the local but also the global dynamics of real GAN.
We expect that this work will be useful for quantifying the complexity and convergence properties of GAN. To show how this theoretical analysis can be put into practice, in Section 6, we proposed a methodology of analysis that enables a characterization of the training dynamics of real-world GANs by means of the techniques developed in this work. From the obtained information about the convergence of the learning process of the networks, several improvements for stabilizing the training can be implemented, such as a progressive reduction of the learning rate to adapt the geometry of the spiral flow.
It is worth mentioning that the results presented in this paper apply not only to torus toy-models but also to more realistic networks. It may seem at a first sight that standard GANs do not fulfil the periodicity requirement to be defined on a torus. However, in many cases, the outputs of the generator and the discriminator networks are clipped for large enough inputs. This fix is crucial to maintain several required analytic properties, as the Lipschitz condition for Wasserstein GANs [14]. After this clipping, the GAN does actually turn into a torus GAN since the generator and discriminator functions are periodic (with a large period). In this manner, most of the regular GANs used in image generation and feature generation fit in the framework introduced in this paper. This is crucial, since dynamics on a closed manifold are deeply related to the underlying topology, for instance, through the Poincaré–Hopf theorem or deeper Morse-like results.
Nevertheless, much work must be done before this project can be turned into a reality. First, in order to compute the Fourier series of the cost function, we had to sample the cost function of the GAN at a dense mesh of weights. Using this sampling, we were able to estimate the Fourier coefficients through standard quadrature techniques, as the Simpson rule. In shallow networks with few neurons, a similar approach can be applied, but for deeper networks, this dense sampling is unfeasible. For this reason, better methods for estimating the Fourier coefficients of the cost function are needed, maybe by exploding the analytical and harmonical properties of the trigonometric functions. In addition, to illustrate the method, in this paper, we carried out all the calculations on a 2-dimensional torus. The computation in higher dimensional tori may follow similar lines, but definitely a thorough analysis of the bifurcation conditions in the higher dimensional setting is not obvious.
Summarizing, in this paper, we introduced a novel method for understanding the dynamics of GANs through harmonic analysis. We showed that, despite the Nash equilibria of the GAN being stable, the convergence is a perturbation of a center and, thus, slow and complicated. The method allowed us to identify a simplified model of the dynamics that may be useful for tuning several hyperparameters of the used GANs as the learning rate of the number of epochs to be trained. We expect that this work will open the door to new methods of study of dynamics of GAN by using harmonic analysis and trascendental methods.

Author Contributions

Conceptualization, Á.G.-P.; methodology, Á.G.-P. and A.M.; software E.T. and S.G.-C.; validation, E.T. and S.G.-C.; formal analysis, Á.G.-P.; writing—original draft preparation, Á.G.-P.; writing—review and editing, A.M., E.T., and S.G.-C.; project administration, A.M.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the European Union’s Horizon 2020 Research and Innovation Programme under grant 833685 (SPIDER).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2672–2680. [Google Scholar]
  2. Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
  3. Nagarajan, V.; Kolter, J.Z. Gradient descent GAN optimization is locally stable. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5585–5595. [Google Scholar]
  4. Mescheder, L.M.; Geiger, A.; Nowozin, S. Which Training Methods for GANs do actually Converge? In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018; pp. 3478–3487. [Google Scholar]
  5. Goodfellow, I. NIPS 2016 tutorial: Generative adversarial networks. arXiv 2016, arXiv:1701.00160. [Google Scholar]
  6. Kusner, M.J.; Hernández-Lobato, J.M. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv 2016, arXiv:1611.04051. [Google Scholar]
  7. Diesendruck, M.; Elenberg, E.R.; Sen, R.; Cole, G.W.; Shakkottai, S.; Williamson, S.A. Importance weighted generative networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin, Germany, 2019; pp. 249–265. [Google Scholar]
  8. Antoniou, A.; Storkey, A.; Edwards, H. Data augmentation generative adversarial networks. arXiv 2017, arXiv:1711.04340. [Google Scholar]
  9. Arjovsky, M.; Bottou, L. Towards principled methods for training generative adversarial networks. arXiv 2017, arXiv:1701.04862. [Google Scholar]
  10. Arora, S.; Ge, R.; Liang, Y.; Ma, T.; Zhang, Y. Generalization and equilibrium in generative adversarial nets (gans). arXiv 2017, arXiv:1703.00573. [Google Scholar]
  11. Arora, S.; Risteski, A.; Zhang, Y. Do GANs learn the distribution? some theory and empirics. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  12. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 2016, 29, 2234–2242. [Google Scholar]
  13. Roth, K.; Lucchi, A.; Nowozin, S.; Hofmann, T. Stabilizing training of generative adversarial networks through regularization. arXiv 2017, arXiv:1705.09367v2. [Google Scholar]
  14. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
  15. Nowozin, S.; Cseke, B.; Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. arXiv 2016, arXiv:1606.00709. [Google Scholar]
  16. Wang, C.; Xu, C.; Yao, X.; Tao, D. Evolutionary generative adversarial networks. IEEE Trans. Evol. Comput. 2019, 23, 921–934. [Google Scholar] [CrossRef] [Green Version]
  17. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv 2017, arXiv:1706.08500. [Google Scholar]
  18. Snell, J.; Ridgeway, K.; Liao, R.; Roads, B.D.; Mozer, M.C.; Zemel, R.S. Learning to generate images with perceptual similarity metrics. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4277–4281. [Google Scholar]
  19. Borji, A. Pros and cons of gan evaluation measures. Comput. Vis. Image Underst. 2019, 179, 41–65. [Google Scholar] [CrossRef] [Green Version]
  20. Milnor, J. Lectures on the H-Cobordism Theorem; Princeton University Press: Princeton, NJ, USA, 2015; Volume 2258. [Google Scholar]
  21. Atiyah, M.F.; Bott, R. The yang-mills equations over riemann surfaces. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Sci. 1983, 308, 523–615. [Google Scholar]
  22. Rudin, W. Real and Complex Analysis; Tata McGraw-Hill Education: New York, NY, USA, 2006. [Google Scholar]
  23. Du Bois-Reymond, P. Ueber die fourierschen reihen. Nachrichten von der Königl. Gesellschaft der Wissenschaften und der Georg-Augusts-Universität zu Göttingen 1873, 1873, 571–584. [Google Scholar]
  24. Kolmogorov, A. Une séries de Fourier-Lebesgue divergente partout. CR Acad. Sci. Paris 1926, 183, 1327–1328. [Google Scholar]
  25. Zygmund, A. Trigonometric Series; Cambridge University Press: Cambridge, UK, 2002; Volume 1. [Google Scholar]
  26. Gronwall, T.H. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Ann. Math. 1919, 20, 292–296. [Google Scholar] [CrossRef]
  27. Arnol’d, V.I. Mathematical Methods of Classical Mechanics; Springer Science & Business Media: Berlin, Germany, 2013; Volume 60. [Google Scholar]
Figure 1. Nash flow dynamics of Fourier basis functions: (a) Λ 1 , 1 0 , 1 = sin ( 2 π θ 1 ) cos ( 2 π θ 2 ) , (b) Λ 1 , 2 0 , 0 = sin ( 2 π θ 1 ) sin ( 4 π θ 2 ) , and (c) Λ 2 , 3 1 , 1 = cos ( 4 π θ 1 ) cos ( 6 π θ 2 ) .
Figure 1. Nash flow dynamics of Fourier basis functions: (a) Λ 1 , 1 0 , 1 = sin ( 2 π θ 1 ) cos ( 2 π θ 2 ) , (b) Λ 1 , 2 0 , 0 = sin ( 2 π θ 1 ) sin ( 4 π θ 2 ) , and (c) Λ 2 , 3 1 , 1 = cos ( 4 π θ 1 ) cos ( 6 π θ 2 ) .
Mathematics 09 00325 g001
Figure 2. Nash flow dynamics of truncated Fourier series: cases (ad) show breaking of the periodic orbits into spiral flow, and cases (e,f) preserve the periodic orbits. (a) Θ = Λ 1 , 1 0 , 0 + 0.03 Λ 3 , 5 1 , 1 . (b) Θ = Λ 1 , 1 0 , 1 + 0.02 Λ 3 , 5 1 , 0 . (c) Θ = Λ 1 , 2 0 , 0 + 0.1 Λ 2 , 3 1 , 1 . (d) Θ = Λ 2 , 2 0 , 0 + 0.1 Λ 3 , 5 1 , 1 . (e) Θ = Λ 2 , 2 0 , 0 + 0.02 Λ 4 , 4 1 , 1 . (f) Θ = Λ 1 , 2 0 , 0 + 0.1 Λ 3 , 5 0 , 0 .
Figure 2. Nash flow dynamics of truncated Fourier series: cases (ad) show breaking of the periodic orbits into spiral flow, and cases (e,f) preserve the periodic orbits. (a) Θ = Λ 1 , 1 0 , 0 + 0.03 Λ 3 , 5 1 , 1 . (b) Θ = Λ 1 , 1 0 , 1 + 0.02 Λ 3 , 5 1 , 0 . (c) Θ = Λ 1 , 2 0 , 0 + 0.1 Λ 2 , 3 1 , 1 . (d) Θ = Λ 2 , 2 0 , 0 + 0.1 Λ 3 , 5 1 , 1 . (e) Θ = Λ 2 , 2 0 , 0 + 0.02 Λ 4 , 4 1 , 1 . (f) Θ = Λ 1 , 2 0 , 0 + 0.1 Λ 3 , 5 0 , 0 .
Mathematics 09 00325 g002
Figure 3. Distribution of the real data: (a) probability density function and (b) cumulative distribution function.
Figure 3. Distribution of the real data: (a) probability density function and (b) cumulative distribution function.
Mathematics 09 00325 g003
Figure 4. Generator functions for 0 θ 2 1 2 : the warmer the plot, the bigger the value of θ 2 . The dashed line corresponds to the real data. (a) Output of the function. (b) Transformed probability density function.
Figure 4. Generator functions for 0 θ 2 1 2 : the warmer the plot, the bigger the value of θ 2 . The dashed line corresponds to the real data. (a) Output of the function. (b) Transformed probability density function.
Mathematics 09 00325 g004
Figure 5. Discriminator functions for 0 θ 1 1 2 : the warmer the plot, the larger the value of θ 1 . For fixed generator parameter θ 2 , the optimal value for θ 1 corresponds to the line with θ 1 = θ 2 .
Figure 5. Discriminator functions for 0 θ 1 1 2 : the warmer the plot, the larger the value of θ 1 . For fixed generator parameter θ 2 , the optimal value for θ 1 corresponds to the line with θ 1 = θ 2 .
Mathematics 09 00325 g005
Figure 6. Graphical representation of the landscape of the cost function F ( θ 1 , θ 2 ) : T 2 R (a) Plot of the function F ( θ 1 , θ 2 ) . The four saddle points lie near each of the four corners of the frame. (b) Contour plot of F ( θ 1 , θ 2 ) .
Figure 6. Graphical representation of the landscape of the cost function F ( θ 1 , θ 2 ) : T 2 R (a) Plot of the function F ( θ 1 , θ 2 ) . The four saddle points lie near each of the four corners of the frame. (b) Contour plot of F ( θ 1 , θ 2 ) .
Mathematics 09 00325 g006
Figure 7. Dynamics of the Nash flow for the torus GAN: four attractive Nash equilibria can be observed.
Figure 7. Dynamics of the Nash flow for the torus GAN: four attractive Nash equilibria can be observed.
Mathematics 09 00325 g007
Figure 8. Nash flow dynamics of truncated Fourier series approximations for the cost function of the torus GAN: (a) approximation Θ 0 , (b) approximation Θ 1 , (c) approximation Θ 2 , (d) approximation Θ 3 , and (e) approximation Θ 4 .
Figure 8. Nash flow dynamics of truncated Fourier series approximations for the cost function of the torus GAN: (a) approximation Θ 0 , (b) approximation Θ 1 , (c) approximation Θ 2 , (d) approximation Θ 3 , and (e) approximation Θ 4 .
Mathematics 09 00325 g008
Table 1. Fourier modes of the cost function for the torus GAN. The ten modes with the largest absolute value of their associated coefficient are shown. The last column shows the ratio between each Fourier coefficient and the largest coefficient.
Table 1. Fourier modes of the cost function for the torus GAN. The ten modes with the largest absolute value of their associated coefficient are shown. The last column shows the ratio between each Fourier coefficient and the largest coefficient.
m 1 m 2 α β a m 1 , m 2 α , β Ratio
11110.061271.0000
12110.011020.1800
2111−0.00503−0.0822
2211−0.00404−0.0660
2311−0.00325−0.0532
2411−0.00308−0.0504
2511−0.00305−0.0499
2711−0.00304−0.0497
2911−0.00304−0.0496
21011−0.00304−0.0496
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

González-Prieto, Á.; Mozo, A.; Talavera, E.; Gómez-Canaval, S. Dynamics of Fourier Modes in Torus Generative Adversarial Networks. Mathematics 2021, 9, 325. https://doi.org/10.3390/math9040325

AMA Style

González-Prieto Á, Mozo A, Talavera E, Gómez-Canaval S. Dynamics of Fourier Modes in Torus Generative Adversarial Networks. Mathematics. 2021; 9(4):325. https://doi.org/10.3390/math9040325

Chicago/Turabian Style

González-Prieto, Ángel, Alberto Mozo, Edgar Talavera, and Sandra Gómez-Canaval. 2021. "Dynamics of Fourier Modes in Torus Generative Adversarial Networks" Mathematics 9, no. 4: 325. https://doi.org/10.3390/math9040325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop