Dynamics of Fourier Modes in Torus Generative Adversarial Networks

González-Prieto, Ángel; Mozo, Alberto; Talavera, Edgar; Gómez-Canaval, Sandra

doi:10.3390/math9040325

Open AccessArticle

Dynamics of Fourier Modes in Torus Generative Adversarial Networks

¹

Departamento de Matemáticas, Facultad de Ciencias, Universidad Autónoma de Madrid, 28049 Madrid, Spain

²

Escuela Técnica Superior de Ingeniería de Sistemas Informáticos, Universidad Politécnica de Madrid, 28031 Madrid, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2021, 9(4), 325; https://doi.org/10.3390/math9040325

Submission received: 15 December 2020 / Revised: 29 January 2021 / Accepted: 1 February 2021 / Published: 6 February 2021

(This article belongs to the Special Issue Bioinspired Computation: Recent Advances in Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Generative Adversarial Networks (GANs) are powerful machine learning models capable of generating fully synthetic samples of a desired phenomenon with a high resolution. Despite their success, the training process of a GAN is highly unstable, and typically, it is necessary to implement several accessory heuristics to the networks to reach acceptable convergence of the model. In this paper, we introduce a novel method to analyze the convergence and stability in the training of generative adversarial networks. For this purpose, we propose to decompose the objective function of the adversary min–max game defining a periodic GAN into its Fourier series. By studying the dynamics of the truncated Fourier series for the continuous alternating gradient descend algorithm, we are able to approximate the real flow and to identify the main features of the convergence of GAN. This approach is confirmed empirically by studying the training flow in a 2-parametric GAN, aiming to generate an unknown exponential distribution. As a by-product, we show that convergent orbits in GANs are small perturbations of periodic orbits so the Nash equillibria are spiral attractors. This theoretically justifies the slow and unstable training observed in GANs.

Keywords:

Generative Adversarial Networks; dynamical systems; machine learning; Morse theory; Nash equilibrium

1. Introduction

Since their very inception, Generative Adversarial Network (GANs) have revolutionized the areas of machine learning and deep learning. They address very successfully one of the most outstanding problems in pattern recognition: given a collection of examples of a certain phenomenon that we want to replicate, construct a generative model able to create new completely synthetic instances following the same patterns as the original ones. Ideally, the goal would be to capture the underlying pattern so subtly that no external critic would be able to distinguish between real samples and synthesized instances.

The proposal of Goodfellow et al. [1] is to confront two neural networks in an adversary game to solve this problem. More precisely, they propose to consider a neural network G playing the role of a generator agent and a network D acting as the discriminator. The discriminator D is trained to distinguish as accurately as possible between real samples and fake/synthetic samples. On the other hand, G aim to generate synthetic instances of high quality in such a way that D is barely able to distinguish from real data. The two networks are, thus, in effective competition. When, as a by-product of this competition, the agents reach an optimal point, we obtain a generator able to generate almost indistinguishable synthetic samples as well as a discriminator very proficient in classifying real and fake instances.

The way in which these networks are trained to reach this optimal point is through a common objective function. Explicitly, in [1], it is proposed to consider the following function:

F (θ_{D}, θ_{G}) = E_{Ω} log [D_{θ_{D}} (X)] + E_{Λ} log [1 - D_{θ_{D}} (G_{θ_{G}})],

where

θ_{D}

are the inner weights of D,

θ_{G}

are the weights of G,

Ω

is the probability space of the real data, and

Λ

is the latent probability space from which G samples the noise to be transformed into synthetic instances. In this manner,

F

is essentially the error that D suffers in the classification problem between real and fake examples so D tries to maximize it and G tries to minimize it. Hence, it gives rise to a non-convex min–max game and the goal of the training process is to reach a Nash equilibrium.

Several training approaches have been proposed to reach these Nash equlibria, but the most widely used method is the so-called Alternating Gradient Descend (AGD). Roughly speaking, the idea is to, alternatively, train D by tuning

θ_{D}

with cost function

F

and weights

θ_{G}

fixed and, after a certain amount of epochs, to reverse the roles and to update

θ_{G}

with cost function

- F

and weights

θ_{D}

fixed. This optimization procedure has led to astonishing results, particularly in the domain of image processing and generation. Using several architectures and sophisticated multi-level training, GANs are able to generate images with such a high quality that a human eye is not capable to distinguish them from real images [2].

Despite these achievements, the stability of the AGD algorithm for GANs is a major issue. In [3], the authors proved that the Nash equillibria for GANs are locally stable provided that some ideal conditions on the optimality of the equillibria are fulfilled. Nevertheless, these conditions may be unfeasible, as shown in [4], so actual convergence and stability are not guaranteed in real applications. In particular, one of the most challenging problems arising during the training of GANs is the so-called mode collapse [5]. This state is characterized by a generator that has degenerated into a network that is only able to generate a single synthetic sample (or a very small number of them) with almost no variation and such that the discriminator confuses it with a real sample (typically, because the synthetic sample is actually very close to a real one). In this state, the system is no longer a generative model but simply a copier of real data.

Furthermore, by construction, neural network-based GANs have some intrinsic constraints in their expressivity that lead to very unrealistic synthetic samples in the context far from image generation. For instance, neural networks produce a smooth output function, which provokes GANs having lots of difficulties in dealing with the generation of real samples drawn from a discrete distribution (e.g., according to an exponential distribution) [6] or with some drastic semantic restrictions (e.g., nonnegative values for counters) [7]. These scenarios do not typically appear in image generation but are common in other domains such as data augmentation for machine learning [8]. These problems lead to additional inconveniences for stable convergence and usually give rise to highly unstable models that require a very handcrafted stopping criteria and optimization heuristics.

A multitude of works have been oriented towards a deeper understanding of the instability of the training of GANs as well as to propose solutions. A thorough theoretical study of the sources of instability and their causes can be found in [9], and in [10,11], the authors analyzed the real capability of the GAN for learning the distribution both through a theoretical and an empirical approach. In addition, in order to mitigate the instability of the training in [12], the authors proposed a collection of heuristical methods through variations of the standard backpropagation algorithm that contribute to stabilizing the training process of GANs. Moreover, in [13], the use of regularization procedures was proposed to speed up the convergence.

Another very active research line is the proposal of alternative models for GANs that guarantee better convergence. It is well known that the key reason why GAN should capture the original distribution is because they implicitly optimize the Jensen–Shannon divergence (JSD) between the real underlying distribution and the generated distribution of the synthetic data [1]. In order to change this framework, in [14], the authors proposed to modify the cost function in such a way that the new GAN did not optimize JSD but an Earth-mover distance known as Wasserstein distance, giving rise to the celebrated Wasserstein Generative Adversarial Network (WGANs). In a similar vein, in [15], it was proposed to use the f-divergence (a divergence in the spirit of the Kullback–Leibler divergence) as the criterion for training GANs. Even genetic algorithms have been used to stabilize the training process, as in [16], where the authors applied genetic programming to optimize the use of different adversarial training objectives and evolved a population of generators to adapt to the discriminator, which acts as the hostile environment driving evolution. Nevertheless, despite all these efforts, no master method is currently available, and hence, assuring a fast, or even effective, convergence of GANs is an open problem.

Our contribution. In this paper, we propose a novel method to analyze the convergence of GANs through Fourier analysis. Concretely, we propose to approximate the objective function

F

by its Fourier series, truncated with enough precision that the local dynamics of

F

can be understood by means of a trigonometric polynomial.

Recall that any function

F (θ) : T^{n} \to C

defined on the n-dimensional torus

T^{n} = {(S^{1})}^{n}

(equivalently, an n-periodic function on

R^{n}

) can be decomposed into a series of complex exponential functions, known as its Fourier series:

F (θ) = \sum_{m \in Z^{n}} α_{m} e^{2 π i m \cdot θ},

where the series is indexed by the so-called Fourier modes or frequencies,

m

defined on the rectangular lattice

Z^{n} \subseteq R^{n}

. In principle, the previous equality must be understood as a decomposition in the Hilbert space of square-integrable functions,

L^{2} (T^{n})

. However, if

F

has enough regularity, then the Fourier series on the right-hand side also converges uniformly to the original function

F

. This implies that, taking enough Fourier modes,

F

can be effectively approximated by a truncated Fourier series. Moreover, if

F

is real-valued, expressing the complex exponential as a combination of sine and cosine functions, we obtain an approximation of

F

by a trigonometric polynomial,

Θ (F)

.

This approximation can be applied to the study of the convergence of GANs as follows. The continuous version of the AGD algorithm can be the thought of as a path of weights,

(θ_{D} (t), θ_{G} (t))

, depending on the time parameter

t \in R

. In particular,

(θ_{D} (0), θ_{G} (0))

are the initial random weights of the GAN and

(θ_{D} (t), θ_{G} (t))

determine the state of the networks after training for a time

t > 0

. In this manner, if we seek to increase

F (θ_{D}, θ_{G})

in the direction

θ_{D}

and to decrease it in the direction

θ_{G}

, the AGD gives rise to a system of Ordinary Differential Equation (ODEs) given by

\{\begin{matrix} θ_{D}^{'} = \nabla_{D} F (θ_{D}, θ_{G}), \\ θ_{G}^{'} = - \nabla_{G} F (θ_{D}, θ_{G}), \end{matrix}

where

θ_{D}^{'}

and

θ_{G}^{'}

denote the derivatives of the functions

θ_{D} (t)

and

θ_{G} (t)

with respect to time t. This flow aims to converge to a Nash equilibrium of the objective function

F

of the GAN, and for this reason, we refer to it as the Nash flow.

However, in many interesting cases, the function

F

may be very involved and lacks an analytic closed expression that would enable an explicit analysis (e.g., even in the toy example of Equation (13), the cost function is intractable analytically). To address this problem, we propose to approximate

F

by its truncated Fourier series,

Θ (F)

. In this way, at least locally, the dynamic of the original Nash flow can be read from the solutions to the simplified system

\{\begin{matrix} θ_{D}^{'} = \nabla_{D} Θ (F) (θ_{D}, θ_{G}), \\ θ_{G}^{'} = - \nabla_{G} Θ (F) (θ_{D}, θ_{G}) . \end{matrix}

In order to analyze this system of ODEs, we propose a novel method focused on studying the dynamics of the Nash flow on Fourier basic functions and on subsequent further approximations. As we will see, for the Nash flow of a basic trigonometric function, the Nash equillibria are not attractors of the flow but centers, that is, they are surrounded by periodic functions that spin around the critical point. When we consider more Fourier modes in the Fourier expansion of

F

, these periodic orbits may break, leading to spiral attractors or spiral repulsors. The conditions that bifurcate the centers into spiral sinks or sources can be given explicitly in terms of the combinatorics of the considered Fourier modes.

This provides a theoretical justification to the empirically observed instability of the GAN training: the convergent orbits towards a Nash equilibrium are mere perturbations of periodic orbits, falling slowly and spirally to the optimal point. For this reason, small variations in the training hyperparameters, such as the learning rate, the number of epochs, or the batch size, may lead to very different dynamics, which confers to training its characteristic instability. In addition, in this paper, we empirically evaluate this method against a GAN that aims to generate samples according to an unknown exponential distribution. To facilitate the visualization, we consider a simple GAN, with 1-dimensional parameter spaces in each network, in such a way that the Nash flow can be plotted as a planar path. We show that the proposed approach allows us to understand the simplified dynamics of the GAN and to extract qualitative information of the Nash flow.

It is worth mentioning that, in order to have a natural Fourier series, the considered objective function

F

of the GAN must be periodic. This may seem unrealistic in real-life GANs, but this is actually not a very strong condition. Usually, seeking to prove theoretical results about the convergence of GANs, most work forces

F

to have compact support (for instance, to assure that it is Lipschitz as in WGANs). In practice, this is accomplish by clipping the output of the generator and discriminator functions for large inputs. This provokes that, artificially, the objective function turns into a periodic function, and thus, it can be studied through the method introduced in this paper. We expect that this work will open the door to new methods for analyzing and quantifying the convergence of GANs by importing well-established techniques of harmonic analysis and dynamical systems on closed manifolds, as studied in global analysis.

The structure of this paper is as follows. In Section 2, we review the theoretical fundamentals of GANs and their associated objective function and training method. In Section 2.1, we sketch briefly some basic concepts of Morse theory, a very successful theory that allows us to relate the analytic properties of the function to be optimized with the topological properties of the underlying space. In Section 2.2, we introduce the Nash flow and discuss some of the arising problems for its convergence. In Section 3, we introduce torus GANs, and particularly, in Section 3.1 we explain how to perform Fourier analysis on the torus. Section 4 is devoted to the analysis of the Nash flow for truncated Fourier series both for basic function (Section 4.1 and Section 4.2) and for more complicated combinations (Section 4.3 and Section 4.4). In addition, in Section 5, the empirical testing of this method is performed, with comparisons between the real dynamic and the predicted ideal dynamic. Finally, in Section 7, we summarize some of the keys ideas of this paper and sketch some lines of future work.

2. GANs Dynamics

As introduced by Goodfellow in [1], a GAN network is a competitive model in which two intelligent agents (typically two neural networks) compete to improve their performance and to generate very precise samples according to a given distribution.

To be precise, let

X : Ω \to R^{d}

be a d-dimensional random vector, defined on a certain probability space

Ω

. This random vector X should be understood as a very complex phenomenon whose samples we would like to replicate. For this purpose, we consider two functions:

D : R^{d} \times Θ_{D} \to R, G : Λ \times Θ_{G} \to R^{d},

called the discriminator and the generator, respectively. Here,

Λ

is a probability space, called the latent space, and

Θ_{D}, Θ_{G}

are two given topological spaces. These functions should be seen as parametric families of functions

D_{θ_{D}} : R^{d} \to R

and

G_{θ_{G}} : Λ \to R^{d}

, parametrized by

θ_{D} \in Θ_{D}

and

θ_{G} \in Θ_{G}

.

The aim of the GAN is to tune the parameters

θ_{D}

and

θ_{G}

is such a way that, given

x \in R^{d}

,

D_{θ_{D}} (x)

intends to predict whether

x = X (ω)

for some

ω \in Ω

, i.e., whether x is compatible with being a real instance or it is a fake datum. Observe that, throughout this paper, we follow the convention that

D_{θ_{D}} (x)

is the probability of being a real instance; thus,

D_{θ_{D}} (x) = 1

means that

D_{θ_{D}}

is sure that x is real, and

D_{θ_{D}} (x) = 0

means that

D_{θ_{D}}

is sure that x is fake. On the other hand, the generative function,

G_{θ_{G}}

, is a d-dimensional random vector that seeks to converge in distribution to the original distribution X. Typically, the probability space

Λ

is

R^{l}

with a certain standard probability distribution

λ

, as the spherical normal distribution or a uniform distribution on the unit cube.

Remark 1.

In typical applications in machine learning,

Ω

is given by a finite set

Ω = \{x_{1}, \dots, x_{N}\}

, with

x_{i} \in R^{d}

, and endowed with a discrete probability (typically, the uniform one) so X is just the identity function. In customary applications of GANs, we have that the instances

x_{i}

are images, represented by their pixel map, so the objective of the GAN is to generate new images as similar as possible to the ones in the dataset

Ω

.

The competition appears because the agents D and G try to improve non-simultaneously satifactible objectives. On one hand, D tries to improve its performance in the classification problem, but on the other hand, G tries to generate as best results as possible to cheat D. To be precise, recall that perfect fit for the classification problem for

D_{θ_{D}}

is given by

D_{θ_{D}} (x) = 1

if x is an instance of X and

D_{θ_{D}} (x) = 0

if not. Hence, the

L^{1}

error made by

D_{θ_{D}}

with respect to perfect classification is

E (θ_{D}, θ_{G}) = E_{Ω} [1 - D_{θ_{D}} (X)] + E_{Λ} [D_{θ_{D}} (G_{θ_{G}})] = 1 - E_{Ω} [D_{θ_{D}} (X)] + E_{Λ} [D_{θ_{D}} (G_{θ_{G}})],

where

E_{Ω}

and

E_{Λ}

denote the mathematical expectation on

Ω

and

Λ

, respectively. In this way, the objective of

D_{θ_{D}}

is to minimize

E

, while the goal of

G_{θ_{G}}

is to maximize it. It is customary in the literature to consider the function

1 - E

as the objective and to weight the error with a certain smooth concave function

f : R \to R

. In this way, the final cost function is

F (θ_{D}, θ_{G}) = E_{Ω} f [D_{θ_{D}} (X)] + E_{Λ} f [- D_{θ_{D}} (G_{θ_{G}})] .

(1)

Remark 2.

Typical choices for the weight function f are

f (s) = - log (1 + exp (- s))

, as in the original paper of Goodfellow [1], or

f (s) = s

, as in the Wasserstein GAN [9].

However, in sharp contrast with what is typical in machine learning, the aim of the GAN is not to maximize/minimize

F

. The objectives of the D and G agents are opposing: while D tries to maximize

F

, the generator tries to minimize it. In this vein, the objective of the GAN is

\begin{matrix} min_{θ_{G}} max_{θ_{D}} F (θ_{D}, θ_{G}) = min_{θ_{G}} max_{θ_{D}} E_{Ω} f [D_{θ_{D}} (X)] + E_{Λ} f [- D_{θ_{D}} (G_{θ_{G}})] . \end{matrix}

(2)

In the case that the latent space

Λ

is naturally equipped with a topology (as in the case

Λ = (R^{l}, λ)

), it is customary to require that

F : Θ_{D} \times Θ_{G} \to R

is a continuous function. In addition, in our case,

Θ_{G}

and

Θ_{D}

are differentiable manifolds, so we require that both D and G are

C^{2}

maps in both arguments, and thus,

F

is a differentiable function on

Θ_{D} \times Θ_{G}

.

To be precise, the algorithm proposed by Goodfellow [1] suggests to freeze the internal weights of G and to use it to generate a batch of fake examples from

Λ

. With this set of fake instances and another batch of real instances created using X (i.e., sampling randomly from the dataset of real instances), we train D to improve its accuracy in the classification problem with the usual backpropagation (i.e., gradient descent) method. Afterwards, we freeze the weights of D and we sample a batch of latent data of

Λ

(i.e., we randomly sample noise using the latent distribution) and we use it to train G using gradient descent for G with objective function

θ_{G} \mapsto E_{Λ} f (- D (G_{θ_{G}}))

. Finally, we can alternate this process as many times as needed until we reach the desired results. Several metrics have been proposed to quantify this performance, specially regarding the domain of image generation, such as Inception Score (IS) [12], Fréchet Inception Distance (FID) [17], or perceptual similarity measures [18]. For a survey of these techniques, please refer to [19].

2.1. Review of Morse Theory

Let us suppose for a while that, instead of looking for solutions of (2), we were seeking the local maxima of

F

. In this situation, the standard approach in machine learning is to consider the Morse flow, also known as gradient ascent flow. For it, let us fix riemannian metrics on

Θ_{D}

and

Θ_{G}

. Using them, we can compute the gradient of

F

,

\nabla F = (\nabla_{D} F, \nabla_{G} F)

, where

\nabla_{D} F, \nabla_{G} F

denote the gradient in the

θ_{D}, θ_{G}

directions, respectively. Then, the Morse flow is the differentiable flow on

Θ_{D} \times Θ_{G}

generated by the vector field

\nabla F

. Explicitly, it is given by the system of ODEs:

\{\begin{matrix} θ_{D}^{'} = \nabla_{D} F (θ_{D}, θ_{G}), \\ θ_{G}^{'} = \nabla_{G} F (θ_{D}, θ_{G}) . \end{matrix}

(3)

This flow has been the objective of very intense studies in the context of differentiable geometry and geometric topology. For instance, it is the crucial tool used in Smale’s proof of the Poincaré conjecture in high dimension [20] and has been successfully used to understand the topology of moduli spaces of solutions to highly nonlinear partial differential equations coming from theoretical physics [21], among others.

Obviously, the critical points of the system (3) are exactly the critical points of

F

in the sense that the differential

{d F |}_{(θ_{D}^{0}, θ_{G}^{0})} = 0

. In order to control the dynamics of this ODE around a critical point, a key concept is the notion of index of a point.

Definition 1.

Let

(θ_{D}^{0}, θ_{G}^{0})

be a critical point of

F

. The Hessian of

F

at

(θ_{D}^{0}, θ_{G}^{0})

is the symmetric 2-form

{H F |}_{θ_{D}^{0}, θ_{G}^{0}} \in {S y m}^{2} (T_{θ_{D}^{0}}^{*} Θ_{D} \oplus T_{θ_{G}^{0}}^{*} Θ_{G})

given by

{Hess (F) |}_{θ_{D}^{0}, θ_{G}^{0}} (v, w) = w (\tilde{v} (F)),

for

v \in T_{θ_{D}^{0}} Θ_{D}, w \in T_{θ_{G}^{0}} Θ_{G}

, and

\tilde{v}

, with any extension of v to an vector field in a small neighborhood of

(θ_{D}^{0}, θ_{G}^{0})

.

The point

(θ_{D}^{0}, θ_{G}^{0})

is said to be non-degenerate if

{Hess (F) |}_{θ_{D}^{0}, θ_{G}^{0}}

is non-degenerated in the 2-form. In that case, the index of the point, denoted

λ (θ_{D}^{0}, θ_{G}^{0})

, is the number of negative eigenvalues of

{Hess (F) |}_{θ_{D}^{0}, θ_{G}^{0}}

. A function

F

is said to be Morse if all its critical points are non-degenerate.

More explicitly, let

\partial_{D}^{1}, \dots, \partial_{D}^{d_{D}}

be a basis of

T_{θ_{D}^{0}} Θ_{D}

and

\partial_{G}^{1}, \dots, \partial_{G}^{d_{G}}

be a basis of

T_{θ_{G}^{0}} Θ_{G}

, where

d_{D}

and

d_{G}

are the dimensions of

Θ_{D}

and

Θ_{G}

respectively. Then, Hessian is the matrix of second derivatives:

Hess (F) = (\begin{matrix} \frac{\partial^{2} F}{\partial θ_{D}^{i} \partial θ_{D}^{j}} & \frac{\partial^{2} F}{\partial θ_{D}^{i} \partial θ_{G}^{j}} \\ \frac{\partial^{2} F}{\partial θ_{G}^{i} \partial θ_{D}^{j}} & \frac{\partial^{2} F}{\partial θ_{G}^{i} \partial θ_{G}^{j}} \end{matrix})

If

Θ_{D}

and

Θ_{G}

are compact, Morse functions are known to form a dense open set of the space of continuous functions on

Θ_{D} \times Θ_{D}

[20]. Moreover, the critical points of a Morse function are isolated in the sense that there exists an open neighborhood of each critical point that contains only that critical point. Indeed, the stability of a critical point

(θ_{D}, θ_{G})

is fully determined by its index. Then,

(θ_{D}, θ_{G})

is a sink in a hypersurface of dimension

λ (θ_{D}, θ_{G})

while it is a source in a hypersurface of dimension

d_{D} d_{G} - λ (θ_{D}, θ_{G})

. In particular, the only sinks of the Morse flow are precisely the local maxima of

F

, in which

Hess (F)

is negative-definite and, thus,

λ (θ_{D}, θ_{G}) = d_{D} d_{G}

.

Another important fact that we use is the following topological interpretation of the indices, known as the Poincaré–Hopf theorem. It claims that, if

Θ_{D}

and

Θ_{G}

are compact, then

\begin{matrix} \sum_{(θ_{D}, θ_{G}) \in Crit (F)} {(- 1)}^{λ (θ_{D}, θ_{G})} = χ (Θ_{D} \times Θ_{G}) = χ (Θ_{D}) χ (Θ_{G}) . \end{matrix}

(4)

Here,

Crit (F)

denotes the (finite) set of critical points of

F

and

χ

is the Euler characteristic of the space.

2.2. The Nash Flow

Now, let us come back to our optimization problem (2). Despite the simplicity of the formulation of the cost function, this problem is very far from being trivial. The best scenario would be to obtain a so-called Nash equilibrium.

Definition 2.

Let

F : Θ_{D} \times Θ_{G} \to R

be a differentiable function. A point

(θ_{D}^{0}, θ_{G}^{0}) \in Θ_{D} \times Θ_{G}

is said to be a Nash equilibrium if

the function $θ_{D} \mapsto F (θ_{D}, θ_{G}^{0})$ has a maximum at $θ_{D}^{0}$ .
the function $θ_{G} \mapsto F (θ_{D}^{0}, θ_{G})$ has a minimum at $θ_{G}^{0}$ .

Remark 3.

A Nash equilibrium is in particular a critical point of

F

.

In this vein, it is natural to consider an analogous differentiable flow to (3) but converging to Nash equilibria. For this purpose, fix riemannian metrics on

Θ_{D}

and

Θ_{G}

as above and consider the gradient

\nabla F = (\nabla_{D} F, \nabla_{G} F)

. Now, we twist the gradient to consider the Nash vector field:

N (F) = (\nabla_{D} F, - \nabla_{G} F) .

Definition 3.

The Nash flow is the differentiable flow on

Θ_{D} \times Θ_{G}

generated by the Nash vector field

N (F)

. Explicitly, it is the system of ODEs:

\{\begin{matrix} θ_{D}^{'} = \nabla_{D} F (θ_{D}, θ_{G}), \\ θ_{G}^{'} = - \nabla_{G} F (θ_{D}, θ_{G}) . \end{matrix}

(5)

This flow (or, more precisely, the associated discrete-time version known as the AGD flow) has been intensively used for training GANs from their very inception. Already in Goodfellow’s seminar paper [1], this flow was proposed as a method for seeking Nash equilibriums of the game (2).

To understand the dynamics of the Nash flow, let us study it around a critical point. Working in a local chart around a critical point, with an adapted basis

\partial_{D}^{1}, \dots, \partial_{D}^{d_{D}}, \partial_{G}^{1}, \dots, \partial_{G}^{d_{G}}

of

T_{θ_{D}^{0}} Θ_{D} \oplus T_{θ_{G}^{0}} Θ_{G}

, the differential of the Nash vector field is the Nash Hessian:

N Hess (F) = {(N (F))}_{*} = (\begin{matrix} \frac{\partial^{2} F}{\partial θ_{D}^{i} \partial θ_{D}^{j}} & \frac{\partial^{2} F}{\partial θ_{D}^{i} \partial θ_{G}^{j}} \\ - \frac{\partial^{2} F}{\partial θ_{G}^{i} \partial θ_{D}^{j}} & - \frac{\partial^{2} F}{\partial θ_{G}^{i} \partial θ_{G}^{j}} \end{matrix})

In this manner, in a small neighborhood of a critical point

(θ_{D}^{0}, θ_{G}^{0}) \in Θ_{D} \times Θ_{G}

of

F

(in particular, around a Nash equilibrium), the dynamics are determined by the linearized version:

(\begin{matrix} θ_{D}^{'} \\ θ_{G}^{'} \end{matrix}) = {(\begin{matrix} \frac{\partial^{2} F}{\partial θ_{D}^{i} \partial θ_{D}^{j}} & \frac{\partial^{2} F}{\partial θ_{D}^{i} \partial θ_{G}^{j}} \\ - \frac{\partial^{2} F}{\partial θ_{G}^{i} \partial θ_{D}^{j}} & - \frac{\partial^{2} F}{\partial θ_{G}^{i} \partial θ_{G}^{j}} \end{matrix})|}_{(θ_{D}^{0}, θ_{G}^{0})} (\begin{matrix} θ_{D} \\ θ_{G} \end{matrix})

However, in sharp contrast with the Morse flow, even if

F

has non-degenerate critical points, it may happen that the Nash equilibria are not attractors. For instance, if the Nash Hessian has a vanishing diagonal (as in Section 4.2), then periodic orbits arise around the critical point and the flow is non-convergent.

Nonetheless, this behavior can be controlled. Suppose for simplicity that

d_{D} = d_{G} = 1

(higher dimensional scenarios can be treated analogously by splitting the tangent space). In that case, the eigenvalues of

N Hess (F)

are either both real or complex conjugated.

If the eigenvalues are real around a Nash equilibrium, both eigenvalues must be nonnegative, since in the usual Hessian, they have different signs. Hence, the Nash equilibrium is a non-repulsor of the Nash flow. Moreover, if $F$ is Morse, then its eigenvalues do not vanish and, thus, the Nash equilibrium is an attractor.
If the eigenvalues are complex conjugated, say $λ, \bar{λ} \in C$ , then the dynamic is controlled by the real part of $λ$ , $Re (λ)$ . There is an invariant way of computing this quantity as through the trace of $N Hess (F)$ since

$2 Re (λ) = λ + \bar{λ} = tr (N Hess (F)) = \frac{\partial^{2} F}{\partial θ_{D}^{2}} - \frac{\partial^{2} F}{\partial θ_{G}^{2}} .$

Observe that this is nothing but the wave operator acting on $F$ . In the case that this trace is negative, the critical point is an attractor with spiral dynamic; if it is positive, it is a repulsor, and if it vanishes, it is a center with surrounding periodic orbits.

It is worth mentioning that, in the case of GANs, the function

F

of (2) to be optimized does not define a convex–concave game so, in general, the convergence of the usual training methods through Nash flow is not guaranteed [3]. Under some ideal assumptions on the behaviour of the game around the Nash equilibrium points, in [3], the authors proved that the Nash flow is locally asymptotically stable. However, the hypotheses needed to apply this result are quite strong and seem to be unfeasible in practice. For instance, in [4], the authors show an example of a very simple GAN, the so-called Dirac GAN, for which the usual gradient descend does not converge.

3. Torus GANs

From now on, let us focus on a very particular case of GAN that we call a torus GAN. Let us denote

T^{n} = \underset{n times}{\underset{︸}{S^{1} \times \dots S^{1}}}

as the n-dimensional torus. Then, we take as parameter spaces

Θ_{D} = T^{d_{D}}

and

Θ_{G} = T^{d_{G}}

. In this way, the cost functional becomes a function:

F : T^{d_{D}} \times T^{d_{G}} = T^{d_{D} + d_{G}} \to R .

Remark 4.

This particular choice is not as arbitrary as it may seem at a first sight. In the end, a torus GAN is any GAN in which the generator and discriminator are periodic functions on their parameters

θ_{D}

and

θ_{G}

for some large enough period. In standard neural network-based GANs, it is customary to clip the output of the neural network in order to prevent the internal weights from becoming arbitrarily large. This is particularly important in Wasserstein GANs, where the objective function is required to be Lipschitz, and this is achieved by forcing the cost function to have compact support. In this way, after clipping, both the generator and the discriminator agents are periodic functions, and thus, they define a torus GAN.

Working on the torus has important consequences to the dynamics the Morse flow. Some of them are the following:

Divergent orbits are not allowed. Since $T^{n}$ is compact, standard results of prologability of solutions for a short time show that the orbits of any vector flow cannot blow up. Intuitively, they cannot escape by tending to infinity. In particular, if $F$ is a Morse function, all the orbits in the Morse flow must converge to a critical point. This is a consequence of the fact that, along a non-constant orbit of the Morse flow, the function $F$ is strictly increasing since

$\frac{d}{d t} F (θ_{D}, θ_{G}) = d F (θ_{D}^{'}, θ_{G}^{'}) = d F (\nabla F) = {| | \nabla F | |}^{2} > 0 .$

Thus, since $F$ is bounded, the flow is forced to converge to a constant orbit, that is, to a critical point of $F$ . This prevents the appearance of periodic orbits in the Morse flow. In the Nash flow, this may no longer hold and periodic orbits may arise (as in Section 4.2).
Topological restrictions: the Euler characteristic of $T^{n}$ is $χ (T^{n}) = χ {(S^{1})}^{n} = 0$ . Hence, Equation (4) implies that

$\sum_{(θ_{D}, θ_{G}) \in Crit (F)} = 0 .$

In other word, there is the same number of critical points of even index as of odd index. In particular, if $d_{D} = d_{G} = 1$ , there are as many saddle points (which are points of index 1) as maxima and minima (which are points of index 2 or 0).

3.1. Fourier Analysis in the Torus

In order to understand the cost function

F

of a torus GAN, we apply techniques of harmonic analysis to it. We suppose that the reader is familiar with basic notions of Fourier and harmonic analysis, such as Hilbert spaces and orthogonal Schauder basis on them. Otherwise, please refer to [22].

Let us consider

T^{n} = R^{n} / Z^{n}

so that functions on

T^{n}

are n-periodic functions on the unit square. Recall that a fundamental result of Fourier analysis is that the space

L^{2} (T^{n})

of complex-valued square-integrable functions on

T^{n}

is a Hilbert space with product given by

〈 F, G 〉 = \int_{T^{n}} F (θ) \bar{G (θ)} d θ .

Moreover, this space is spanned by the orthonormal basis of functions:

e_{m} (θ) = e^{2 π i m \cdot θ},

where

m = (m_{1}, \dots, m_{n}) \in Z^{n}

,

θ = (θ_{1}, \dots, θ_{n}) \in T^{n}

and

m \cdot θ = m_{1} θ_{1} + \dots + m_{n} θ_{n}

is the standard inner product. In other words, any

F \in L^{2} (T^{n})

can be uniquely written as a sum:

F (θ) = \sum_{m \in Z^{n}} α_{m} e_{m} (θ) = \sum_{m \in Z^{n}} α_{m} e^{2 π i m \cdot θ},

in the sense that this sum is convergent in

L^{2} (T^{n})

and converges to

F

. This expression is referred to as the Fourier series of

F

. The coefficients

α_{m}

are called the Fourier coefficients or the Fourier modes of

F

. Using the orthogonality of the functions

e_{m} (θ)

, they can be obtained as

α_{m} = 〈 F, e_{m} (θ) 〉 = \int_{T^{n}} F (θ) e^{- 2 π i m \cdot θ} d θ .

In principle, the convergence of the Fourier series to

F

is only in the

L^{2}

sense (c.f. [23] for a Fourier series of a continuous function not converging pointwise everywhere or [24] for an everywhere divergent Fourier series of a

L^{1}

function). However, if

F

is

C^{1}

, since we are working on a compact space, it is automatically Hölder and, thus, its Fourier series converges uniformly [25]. This means that, for every

ϵ > 0

{||F - \sum_{m_{i} = - N}^{N} α_{m} e_{m}||}_{\infty} = sup_{θ \in T^{n}} |F (θ) - \sum_{m_{i} = - N}^{N} α_{m} e^{2 π i m \cdot θ}| < ϵ,

for all N large enough. Similar approximations can be obtained for the k first derivatives of

F

if it has enough regularity (concretely, if it is

C^{k + 1}

).

This approximation is very useful for estimating the associated flow. Recall that, using the Gronwall inequality [26], if

X, Y

are two Lipschitz vector fields, then there exists a constant

M > 0

such that their associated flows

θ (t)

and

ϑ (t)

satisfy

| θ (t) - ϑ (t) | \leq \frac{e^{M t} - 1}{M} | | X - {Y | |}_{\infty}

for all t. In other words, for medium times, the flow of X may be approximated through the flow of Y.

Remark 5.

The previous estimation implies that, locally, the dynamics of the flows

θ (t)

and

ϑ (t)

are similar. In particular, this is useful for analyzing convergence around critical points. Nevertheless, the global dynamics of

θ (t)

and

ϑ (t)

may be quite different, say, they may have different numbers of critical points.

In our context, this idea can be exploited as follows. Let us denote by

Θ_{N} (F) = \sum_{m_{i} = - N}^{N} α_{m} e_{m}

the truncated Fourier series of

F

. If

F

is

C^{2}

, then

\nabla F

and

\nabla Θ_{N} (F)

are close vector fields and, thus,

| θ (t) - θ_{N} (t) | \leq \frac{e^{M t} - 1}{M} | | \nabla F - \nabla Θ_{N} (F) {| |}_{\infty} \leq ϵ (e^{M t} - 1)

for N large enough, where

θ (t)

is the Morse flow for

F

and

θ_{N} (t)

is the Morse flow for

Θ_{N} (F)

. Working verbatim with the Nash vector fields, we obtain similar estimates for the solutions of the Nash flow.

4. Dynamics of Fourier Basis

In this section, we focus on the Nash flow of truncated approximations of Fourier series of a

C^{2}

function

F

. As we mentioned above, these solutions approximate quite well the real Nash flow of

F

for short times (particularly, around critical points).

For the sake of simplicity, in this section, we focus on the 2-dimensional case in which

d_{D} = d_{G} = 1

so that

F = F (θ_{1}, θ_{2})

is a function:

F : T^{2} \to R .

Moreover, we truncate the Fourier series at the level

N = 2

. Similar arguments can be carried out for higher dimension and more accurate precision of the Fourier series with similar results, but the calculations become more involved.

First, let us rewrite the Fourier series of

F

as a trigonometric polynomial. Recall that the trigonometric functions can be obtained from the complex exponential as

cos (2 π θ) = \frac{e^{2 π i θ} + e^{- 2 π i θ}}{2}, sin (2 π θ) = \frac{e^{2 π i θ} - e^{- 2 π i θ}}{2 i} .

Since the function

F

is real-valued, we can group the coefficients and obtain a formula for the Fourier series in term of trigonometric functions as

\begin{matrix} F (θ_{1}, θ_{2}) & = \sum_{m_{1}, m_{2} = 0}^{\infty} a_{m_{1}, m_{2}}^{0, 0} sin (2 π m_{1} θ_{1}) sin (2 π m_{2} θ_{2}) + \sum_{m_{1}, m_{2} = 0}^{\infty} a_{m_{1}, m_{2}}^{0, 1} sin (2 π m_{1} θ_{1}) cos (2 π m_{2} θ_{2}) \\ + \sum_{m_{1}, m_{2} = 0}^{\infty} a_{m_{1}, m_{2}}^{1, 0} cos (2 π m_{1} θ_{1}) sin (2 π m_{2} θ_{2}) + \sum_{m_{1}, m_{2} = 0}^{\infty} a_{m_{1}, m_{2}}^{1, 1} cos (2 π m_{1} θ_{1}) cos (2 π m_{2} θ_{2}) . \end{matrix}

The coefficients are real numbers that can be obtained as

\begin{matrix} a_{m_{1}, m_{2}}^{0, 0} & = δ_{m_{1}, m_{2}} 〈 F, sin (2 π m_{1} θ_{1}) sin (2 π m_{2} θ_{2}) 〉 = δ_{m_{1}, m_{2}} \int_{T^{2}} F (θ_{1}, θ_{2}) sin (2 π m_{1} θ_{1}) sin (2 π m_{2} θ_{2}) d θ_{1} d θ_{2}, \\ a_{m_{1}, m_{2}}^{0, 1} & = δ_{m_{1}, m_{2}} 〈 F, sin (2 π m_{1} θ_{1}) cos (2 π m_{2} θ_{2}) 〉 = δ_{m_{1}, m_{2}} \int_{T^{2}} F (θ_{1}, θ_{2}) sin (2 π m_{1} θ_{1}) cos (2 π m_{2} θ_{2}) d θ_{1} d θ_{2}, \\ a_{m_{1}, m_{2}}^{1, 0} & = δ_{m_{1}, m_{2}} 〈 F, cos (2 π m_{1} θ_{1}) sin (2 π m_{2} θ_{2}) 〉 = δ_{m_{1}, m_{2}} \int_{T^{2}} F (θ_{1}, θ_{2}) cos (2 π m_{1} θ_{1}) sin (2 π m_{2} θ_{2}) d θ_{1} d θ_{2}, \\ a_{m_{1}, m_{2}}^{1, 1} & = δ_{m_{1}, m_{2}} 〈 F, cos (2 π m_{1} θ_{1}) cos (2 π m_{2} θ_{2}) 〉 = δ_{m_{1}, m_{2}} \int_{T^{2}} F (θ_{1}, θ_{2}) cos (2 π m_{1} θ_{1}) cos (2 π m_{2} θ_{2}) d θ_{1} d θ_{2}, \end{matrix}

where

δ_{m_{1}, m_{2}}

is a coefficient that

δ_{m_{1}, m_{2}} = 1

if

m_{1} = m_{2} = 0

;

δ_{m_{1}, m_{2}} = 2

if

m_{1} = 0

and

m_{2} > 0

,

m_{1} > 0

and

m_{2} = 0

; and

δ_{m_{1}, m_{2}} = 4

m_{1}, m_{2} > 0

.

To shorten notation, from now on, we denote

\begin{matrix} Λ_{m_{1}, m_{2}}^{0, 0} (θ_{1}, θ_{2}) & = sin (2 π m_{1} θ_{1}) sin (2 π m_{2} θ_{2}), & Λ_{m_{1}, m_{2}}^{0, 1} (θ_{1}, θ_{2}) & = sin (2 π m_{1} θ_{1}) cos (2 π m_{2} θ_{2}), \\ Λ_{m_{1}, m_{2}}^{1, 0} (θ_{1}, θ_{2}) & = cos (2 π m_{1} θ_{1}) sin (2 π m_{2} θ_{2}), & Λ_{m_{1}, m_{2}}^{1, 1} (θ_{1}, θ_{2}) & = cos (2 π m_{1} θ_{1}) cos (2 π m_{2} θ_{2}), \end{matrix}

This notation is particularly useful because, for any

α, β \in Z_{2}

,

\frac{\partial}{\partial θ_{1}} Λ_{m_{1}, m_{2}}^{α, β} = {(- 1)}^{α} 2 π m_{1} Λ_{m_{1}, m_{2}}^{α + 1, β}, \frac{\partial}{\partial θ_{2}} Λ_{m_{1}, m_{2}}^{α, β} = {(- 1)}^{β} 2 π m_{2} Λ_{m_{1}, m_{2}}^{α, β + 1},

where the sum is interpreted as the sum in

Z_{2}

.

From this expression of the Fourier series, we approximate the dynamics of the Nash flow for

F

by truncating the Fourier series. In particular, we sort the coefficients

a_{m_{1}, m_{2}}^{α, β}

by decreasing order of their absolute value. Looking only at the two largest coefficients and normalizing so that the leading coefficient is 1, we consider the approximation to

F

:

\begin{matrix} Θ (F) = Λ_{m_{1}, m_{2}}^{α, β} + μ Λ_{n_{1}, n_{2}}^{γ, δ}, \end{matrix}

(6)

where

α, β, γ, δ \in Z_{2}

,

(m_{1}, m_{2})

are the leading Fourier modes and

(n_{1}, n_{2})

are the second largest modes, and

| μ | < 1

.

4.1. Nash Flow for Single Variable Fourier Basis

From now on, we aim to analyze the Nash flow for a truncated Fourier series. As we see in Section 5, from it, we can envisage the global dynamics of the Nash flow for the objective function of a GAN.

First, let us consider the simplest Fourier modes, namely with

m_{1} = 0

or

m_{2} = 0

. In this case, the dynamics is quite simple and, in most cases, can be pulled apart. In the case of

Λ_{0, 0}^{α, β} (θ_{1}, θ_{2}) \equiv 1

, the Nash flow equations amount to

\{\begin{matrix} θ_{1}^{'} = \frac{\partial}{\partial θ_{1}} Λ_{0, 0}^{α, β} (θ_{1}, θ_{2}) = 0, \\ θ_{2}^{'} = - \frac{\partial}{\partial θ_{2}} Λ_{0, 0}^{α, β} (θ_{1}, θ_{2}) = 0 . \end{matrix}

Therefore, the solutions are constant orbits

(θ_{1} (t), θ_{2} (t)) = (θ_{1}^{0}, θ_{2}^{0})

for some fixed

(θ_{1}^{0}, θ_{2}^{0}) \in T^{2}

. For this reason, it does not contribute to the dynamics.

For Fourier modes of the form

Λ_{m_{1}, 0}^{0, β} (θ_{1}, θ_{2}) = sin (2 π m_{1} θ_{1})

or

Λ_{m_{1}, 0}^{1, β} (θ_{1}, θ_{2}) = cos (2 π m_{1} θ_{1})

, the situation is also very simple. Now, the Nash flow is given by

\{\begin{matrix} θ_{1}^{'} & = \frac{\partial}{\partial θ_{1}} Λ_{m_{1}, 0}^{α, β} (θ_{1}, θ_{2}) = 2 π m_{1} Λ_{m_{1}, 0}^{α + 1, β} (θ_{1}, θ_{2}), \\ θ_{2}^{'} & = - \frac{\partial}{\partial θ_{2}} Λ_{m_{1}, 0}^{α, β} (θ_{1}, θ_{2}) = 0 . \end{matrix}

The solution to this system has the form

(θ_{1} (t), θ_{2} (t)) = (f_{m_{1}}^{α} (t), θ_{2}^{0})

for some fixed

θ_{2}^{0}

, and

f_{m_{1}}^{α} (t)

is a differentiable function depending on

m_{1}

and

α

(the explicit form of

f_{m_{1}}^{α} (t)

can be obtained by solving the 1-dimensional ODE for

θ_{1}

by separation of variables). Thus, the flow is completely horizontal with

2 m_{1}

lines of critical points at the lines

θ_{1} = \frac{2 k_{1} - α + 1}{4 m_{1}}

for

k_{1} \in Z

. Half of these critical lines are attractive, corresponding to the maxima of

f_{m_{1}}^{α}

, and half of them are repulsive, corresponding to the minima.

The situation of the Fourier modes of the form

Λ_{0, m_{2}}^{α, 0} (θ_{1}, θ_{2}) = sin (2 π m_{2} θ_{2})

or

Λ_{0, m_{2}}^{α, 1} (θ_{1}, θ_{2}) = cos (2 π m_{2} θ_{2})

is completely symmetric. Now, the flow is vertical and the critical lines are at

θ_{2} = \frac{2 k_{2} - α + 1}{4 m_{2}}

for

k_{2} \in Z

(but the attractive ones correspond to the minima and the repulsive to the minima).

Furthermore, we can collect all the Fourier modes with a vanishing frequency into a single function. To be precise, decompose the Fourier series of

F

as

\begin{matrix} F & = \underset{Δ_{1} (θ_{1})}{\underset{︸}{\frac{a_{0, 0}^{0, 0}}{2} + \sum_{\begin{matrix} 1 \leq m_{1} < \infty \\ α = 0, 1 \end{matrix}} a_{m_{1}, 0}^{α, 0} Λ_{m_{1}, 0}^{α, 0}}} + \underset{Δ_{2} (θ_{2})}{\underset{︸}{\frac{a_{0, 0}^{0, 0}}{2} + \sum_{\begin{matrix} 1 \leq m_{2} < \infty \\ β = 0, 1 \end{matrix}} a_{m_{1}, 0}^{0, β} Λ_{0, m_{2}}^{0, β}}} + \underset{Θ (θ_{1}, θ_{2})}{\underset{︸}{\sum_{m_{1}, m_{2} = 1}^{\infty} a_{m_{1}, m_{2}}^{0, 0} Λ_{m_{1}, m_{2}}^{α, β}}} . \end{matrix}

Now, the superposition principle applied to (5) implies that any solution to the Nash flow has the following form:

(θ_{1} (t), θ_{2} (t)) = ({\hat{θ}}_{1} (t), θ_{2}^{0}) + (θ_{1}^{0}, {\hat{θ}}_{2} (t)) + Φ (t),

where

({\hat{θ}}_{1} (t), θ_{2}^{0})

is a horizontal flow corresponding to the solution of (5) for

Δ_{1}

(explicitly,

{\hat{θ}}_{1}

is the solution to the equation

{\hat{θ}}_{1}^{'} = \frac{d}{d θ_{1}} Δ_{1} ({\hat{θ}}_{1})

),

(θ_{1}^{0}, {\hat{θ}}_{2} (t))

is a vertical flow corresponding to the solution of (5) for

Δ_{2}

(i.e.,

{\hat{θ}}_{2}

is the solution to

{\hat{θ}}_{2}^{'} = - \frac{d}{d θ_{2}} Δ_{2} ({\hat{θ}}_{2})

), and

Φ

is the solution to the (coupled) system of Equation (5) for

Θ

.

For this reason, in many cases, the effect of the

Δ_{1}

and the

Δ_{2}

parts in the dynamics is negligible and can be ignored.

4.2. Nash Flow for Fourier Basis

In this section, we analyze the dynamics of the Nash flow for the remaining Fourier basis. For this purpose, let us consider the function

Λ_{m_{1}, m_{2}}^{α, β}

for some

α, β \in Z_{2}

with

m_{1}, m_{2} \geq 1

. The Nash vector field associated with it is

N (Λ_{m_{1}, m_{2}}^{α, β}) = 2 π ({(- 1)}^{α} m_{1} Λ_{m_{1}, m_{2}}^{α + 1, β}, {(- 1)}^{β} m_{2} Λ_{m_{1}, m_{2}}^{α, β + 1}) .

(7)

Recall that, if

(θ_{1}, θ_{2}) \in T^{2}

is a zero of

Λ_{m_{1}, m_{2}}^{α, β}

, then it satisfies

4 θ_{1} m_{1} \equiv 2 k_{1} + α mod 4 Z, or 4 θ_{2} m_{2} \equiv 2 k_{2} + β mod 4 Z,

for some

k_{1}, k_{2} \in Z

. In other words, if we take into account the periodicity of the function

Λ_{m_{1}, m_{2}}^{α, β}

, the zeros are given by

\begin{matrix} θ_{1} = \frac{2 k_{1} + α}{4 m_{1}}, or θ_{2} = \frac{2 k_{2} + β}{4 m_{2}}, \end{matrix}

for

0 \leq k_{1} < 2 m_{1}

and

0 \leq k_{2} < 2 m_{2}

. Observe that all these values are different, so

Λ_{m_{1}, m_{2}}^{α, β}

has

4 m_{1} m_{2}

zeros.

Coming back to Equation (7), we observe that, if

(θ_{1}, θ_{2}) \in T^{2}

is a critical point of the Nash vector field (i.e., a critical point of

Λ_{m_{1}, m_{2}}^{α, β}

), then it satisfies one of the following two possibilities:

(I)	$(4 θ_{1} m_{1}, 4 θ_{2} m_{2}) \equiv (2 k_{1} - α + 1, 2 k_{2} - β + 1)$	$mod 4 Z \times 4 Z$ ,
(II)	$(4 θ_{1} m_{1}, 4 θ_{2} m_{2}) \equiv (2 k_{1} + α, 2 k_{2} + β)$	$mod 4 Z \times 4 Z$ .

Beware of the change in sign for the coefficient of

α

and

β

for points (I). This is just a matter of notational convenience, as shown below. Equivalently, the these conditions can be written explicitly as

(I)	$(θ_{1}, θ_{2}) = (\frac{2 k_{1} - α + 1}{4 m_{1}}, \frac{2 k_{2} - β + 1}{4 m_{2}}),$	for $k_{1}, k_{2} \in Z$ ,
(II)	$(θ_{1}, θ_{2}) = (\frac{2 k_{1} + α}{4 m_{1}}, \frac{2 k_{2} + β}{4 m_{2}}),$	for $k_{1}, k_{2} \in Z$ .

Thus, the Nash vector field has

8 m_{1} m_{2}

critical points:

4 m_{1} m_{2}

critical points of type (I) and

4 m_{1} m_{2}

of type (II).

Regarding the Nash Hessian, it is explicitly given by

N Hess (Λ_{m_{1}, m_{2}}^{α, β}) = 4 π^{2} (\begin{matrix} - m_{1}^{2} Λ_{m_{1}, m_{2}}^{α, β} & {(- 1)}^{α + β} m_{1} m_{2} Λ_{m_{1}, m_{2}}^{α + 1, β + 1} \\ {(- 1)}^{α + β + 1} m_{1} m_{2} Λ_{m_{1}, m_{2}}^{α + 1, β + 1} & m_{2}^{2} Λ_{m_{1}, m_{2}}^{α, β} \end{matrix})

Therefore, evaluated at a critical point of the form (I), we get that

\begin{matrix} N Hess (Λ_{m_{1}, m_{2}}^{α, β}) |_{(I)} = {(- 1)}^{k_{1} + k_{2}} 4 π^{2} (\begin{matrix} - m_{1}^{2} & 0 \\ 0 & m_{2}^{2} \end{matrix}) . \end{matrix}

These are all saddle points for the Nash flow, with an attractive direction and a repulsive direction.

On the other hand, the Nash Hessian evaluated at a critical point of the form (II) is

\begin{matrix} N Hess (Λ_{m_{1}, m_{2}}^{α, β}) |_{(II)} & = {(- 1)}^{k_{1} + k_{2} + α + β} 4 π^{2} (\begin{matrix} 0 & m_{1} m_{2} \\ - m_{1} m_{2} & 0 \end{matrix}) \\ \sim {(- 1)}^{k_{1} + k_{2} + α + β} 4 π^{2} m_{1} m_{2} (\begin{matrix} i & 0 \\ 0 & - i \end{matrix}) . \end{matrix}

In this situation, we obtain a center critical point with periodic orbits around it and no convergent flow lines. This dynamic is depicted in Figure 1. Observe that, in this plot, the 2-dimensional torus

T^{2}

is represented as the square

[0, 1] \times [0, 1]

with the boundaries identified in pairs, i.e., the left boundary

\{0\} \times [0, 1]

is identified with the right boundary

\{1\} \times [0, 1]

preserving the orientation and so are the bottom boundary

[0, 1] \times \{0\}

and the upper one

\{1\} \times [0, 1]

).

Putting together these calculations, we have proven the following result.

Proposition 1.

The Nash flow for the Fourier basis function

Λ_{m_{1}, m_{2}}^{α, β}

has

8 m_{1} m_{2}

critical points, for which the dynamics are

(I): $4 m_{1} m_{2}$ points are saddle points for the flow, half of them corresponding to the maxima of $Λ_{m_{1}, m_{2}}^{α, β}$ and half of them to the minima.
(II): $4 m_{1} m_{2}$ points are center points for the flow, surrounded by periodic orbits and corresponding to the saddle points of $Λ_{m_{1}, m_{2}}^{α, β}$ .

4.3. Nash Flow for Simplified Truncated Fourier Series

In [4], it is proven that, under some ideal conditions, the Nash flow associated with the cost function of a GAN has stable Nash equilibriums. For this reason, according to Proposition 1, these cost functions cannot be the basis functions of the Fourier series. In other words, its Fourier approximation (6) is nontrivial. Hence, in order to capture the actual dynamics of the GAN flow, let us consider a general truncated Fourier series of the following form:

Θ = Λ_{m_{1}, m_{2}}^{α, β} + μ Λ_{n_{1}, n_{2}}^{γ, δ},

for some

α, β, γ, δ \in Z_{2}

,

- 1 \leq μ \leq 1

and Fourier modes

m_{1}, m_{2}, n_{1}, n_{2} \geq 1

.

In order to simplify the computations, in this section, we suppose that

m_{1} = m_{2} = 1

. After this case, the general setting is studied. In this simplified case, at a point

(θ_{1}^{0}, θ_{2}^{0}) = (k_{1} / 2 + α / 4, k_{2} / 2 + β / 4)

of the form (II), we have

{\nabla Θ |}_{(θ_{1}^{0}, θ_{2}^{0})} = 2 π μ ({(- 1)}^{γ} n_{1} Λ_{n_{1}, n_{2}}^{γ + 1, δ} (θ_{1}^{0}, θ_{2}^{0}), {(- 1)}^{δ} n_{2} Λ_{n_{1}, n_{2}}^{γ, δ + 1} (θ_{1}^{0}, θ_{2}^{0})) .

At this point, we have the following two options.

If ${\nabla Θ |}_{(θ_{1}^{0}, θ_{2}^{0})} = 0$ , then $(θ_{1}^{0}, θ_{2}^{0})$ is also a critical point of $Θ$ . Hence, the dynamic of the Nash flow near $(θ_{1}^{0}, θ_{2}^{0})$ is determined by the Nash Hessian at that point. This Hessian is given by

$\begin{matrix} N Hess (Θ) {|_{(θ_{1}^{0}, θ_{2}^{0})} = {(- 1)}^{k_{1} + k_{2} + α + β} 4 π^{2} (\begin{matrix} 0 & 1 \\ - 1 & 0 \end{matrix}) + μ N Hess (Λ_{n_{1}, n_{2}}^{γ, δ}) |}_{(θ_{1}^{0}, θ_{2}^{0})} \end{matrix}$

Suppose that $(γ, δ) = (α + 1, β + 1)$ in $Z_{2} \times Z_{2}$ . Set $σ = {(- 1)}^{n_{1} k_{1} + n_{2} k_{2} + α n_{1} / 2 + β n_{2} / 2}$ . Observe that $Λ_{n_{1}, n_{2}}^{α, β} (θ_{1}^{0}, θ_{2}^{0}) = 0$ and $Λ_{n_{1}, n_{2}}^{α + 1, β + 1} (θ_{1}^{0}, θ_{2}^{0}) = σ$ , so we have that

$N Hess (Λ_{n_{1}, n_{2}}^{γ, δ}) |_{(θ_{1}^{0}, θ_{2}^{0})} = 4 π^{2} μ σ (\begin{matrix} - n_{1}^{2} & 0 \\ 0 & n_{2}^{2} \end{matrix})$

With this calculation at hand, we observe the following. By continuity, for $| μ |$ small, since $N Hess (Λ_{1, 1}^{α, β}) |_{(θ_{1}^{0}, θ_{2}^{0})}$ has complex eigenvalues, then $N Hess (Θ) |_{(θ_{1}^{0}, θ_{2}^{0})}$ also has complex eigenvalues. In particular, they must be conjugated, say $λ, \bar{λ} \in C$ . In that case, the stability of a critical point at $(θ_{1}^{0}, θ_{2}^{0})$ is governed by the following trace:

$2 Re (λ) = λ + \bar{λ} = tr N Hess (Θ) |_{(θ_{1}^{0}, θ_{2}^{0})} = 4 π^{2} μ σ (n_{2}^{2} - n_{1}^{2}) .$

Hence, if $n_{2} < n_{1}$ and $μ σ = 1$ , or $n_{2} > n_{1}$ and $μ σ = - 1$ (respectively $n_{2} > n_{1}$ and $μ σ = 1$ , or $n_{2} < n_{1}$ and $μ σ = - 1$ ), any critical point nearby $(θ_{1}, θ_{2}) \in T^{2}$ is an spiral attractor (respectively repulsor). In the case that $n_{1} = n_{2}$ , the eigenvalues are multiples of i and $- i$ so the point is still a center and the behaviour bifurcates depending on further Fourier modes.
On the other hand, if $γ = α$ or $δ = β$ in $Z_{2}$ , then we have that

$\begin{matrix} N Hess (Λ_{n_{1}, n_{2}}^{γ, δ}) |_{(θ_{1}^{0}, θ_{2}^{0})} = \pm 4 π^{2} μ (\begin{matrix} 0 & n_{1} n_{2} \\ - n_{1} n_{2} & 0 \end{matrix}) \end{matrix}$

(8)

Therefore, $N Hess (Θ) |_{(θ_{1}^{0}, θ_{2}^{0})}$ is still an anti-diagonal matrix and the dynamics depends on further Fourier modes.
If ${\nabla Θ |}_{(θ_{1}^{0}, θ_{2}^{0})} \neq 0$ , then $(θ_{1}^{0}, θ_{2}^{0})$ is no longer a critical point of $Θ$ . However, if $| μ |$ is small, by the implicit function theorem, nearby $(θ_{1}^{0}, θ_{2}^{0})$ , there must be a unique critical point $({\tilde{θ}}_{1}, {\tilde{θ}}_{2}) \in T^{2}$ of $Θ$ . Again, by continuity, since $N Hess (Λ_{1, 1}^{α, β}) |_{(θ_{1}, θ_{2})}$ has complex eigenvalues, then $N Hess (Θ) |_{({\tilde{θ}}_{1}, {\tilde{θ}}_{2})}$ also has complex eigenvalues and their real part can be controlled through the trace.
Explicitly, the Nash Hessian is

$\begin{matrix} N Hess (Θ) |_{({\tilde{θ}}_{1}, {\tilde{θ}}_{2})} & = 4 π^{2} {(\begin{matrix} - n_{1}^{2} μ Λ_{n_{1}, n_{2}}^{γ, δ} & \pm 1 \pm μ n_{1} n_{2} Λ_{n_{1}, n_{2}}^{γ + 1, δ + 1} \\ \mp 1 \mp μ n_{1} n_{2} Λ_{n_{1}, n_{2}}^{γ + 1, δ + 1} & n_{2}^{2} μ Λ_{n_{1}, n_{2}}^{γ, δ} \end{matrix})|}_{({\tilde{θ}}_{1}, {\tilde{θ}}_{2})} \end{matrix}$

Therefore, its trace is given by

$\begin{matrix} 4 π^{2} μ Λ_{n_{1}, n_{2}}^{γ, δ} ({\tilde{θ}}_{1}, {\tilde{θ}}_{2}) (n_{2}^{2} - n_{1}^{2}) . \end{matrix}$

(9)

In particular, if $n_{1} = n_{2}$ , then the new critical point $({\tilde{θ}}_{1}, {\tilde{θ}}_{2})$ is still a center. Otherwise, the behaviour is determined by the sign of $Λ_{n_{1}, n_{2}}^{γ, δ} ({\tilde{θ}}_{1}, {\tilde{θ}}_{2})$ . This sign can be read from the gradient and the Nash Hessian at $(θ_{1}^{0}, θ_{2}^{0})$ .
To illustrate this idea, we consider a particular combination of signs. The other cases can be obtained analogously. Suppose that the first component of the gradient satisfies

$\frac{\partial Θ}{\partial θ_{1}} (θ_{1}^{0}, θ_{2}^{0}) = 2 π μ {(- 1)}^{γ} n_{1} Λ_{n_{1}, n_{2}}^{γ + 1, δ} (θ_{1}^{0}, θ_{2}^{0}) > 0 .$

In addition, suppose that the entries of the first row of the Nash Hessian have signs

$\begin{matrix} {(N Hess (Λ_{n_{1}, n_{2}}^{γ, δ}) |_{(θ_{1}^{0}, θ_{2}^{0})})}_{1, 1} & = - 4 π^{2} n_{1}^{2} μ Λ_{n_{1}, n_{2}}^{γ, δ} (θ_{1}^{0}, θ_{2}^{0}) > 0, \\ {(N Hess (Λ_{n_{1}, n_{2}}^{γ, δ}) |_{(θ_{1}^{0}, θ_{2}^{0})})}_{1, 2} & = \pm 1 \pm μ n_{1} n_{2} Λ_{n_{1}, n_{2}}^{γ + 1, δ + 1} (θ_{1}^{0}, θ_{2}^{0}) < 0 . \end{matrix}$

In that case, this means that $({\tilde{θ}}_{1}, {\tilde{θ}}_{2})$ has the form $({\tilde{θ}}_{1}, {\tilde{θ}}_{2}) = (θ_{1}^{0} - ϵ_{1}, θ_{2}^{0} + ϵ_{2})$ for a small $ϵ_{1}, ϵ_{2} > 0$ . Therefore, the sign of (9) is determined by the sign of $Λ_{n_{1}, n_{2}}^{γ, δ} (θ_{1}^{0} - ϵ_{1}, θ_{2}^{0} + ϵ_{2})$ , which is a well-defined quantity that only depends on the particular point $(θ_{1}^{0}, θ_{2}^{0})$ and $γ, δ \in Z_{2}$ .

4.4. Nash Flow for General Truncated Fourier Series

In the general case, the calculation is similar but more involved. To alleviate notation, let us consider the auxiliary functions:

σ^{0} (θ) = \{\begin{matrix} 0 & if θ = 0 or \frac{1}{2}, \\ 1 & if 0 < θ < \frac{1}{2}, \\ - 1 & if \frac{1}{2} < θ < 1, \end{matrix} σ^{1} (θ) = \{\begin{matrix} 0 & if θ = \frac{1}{4} or \frac{3}{4}, \\ 1 & if 0 \leq θ < \frac{1}{4} or \frac{3}{4} \leq θ < 1, \\ - 1 & if \frac{1}{4} < θ < \frac{3}{4} . \end{matrix}

Notice that these maps are just the sign functions of the trigonometric functions

σ^{0} (θ) = sign (sin (2 π θ))

and

σ^{1} (θ) = sign (cos (2 π θ))

, with the customary assumption that the sign function vanishes at zero. If needed, we may extend them to the whole real line by periodicity.

Now, let us consider a truncated Fourier series with arbitrary frequencies

m_{1}, m_{2}, n_{1}, n_{2} \geq 1

of the following form:

Θ = Λ_{m_{1}, m_{2}}^{α, β} + μ Λ_{n_{1}, n_{2}}^{γ, δ} .

Analogously to the previous case, the gradient of

Θ

at a point

(θ_{1}^{0}, θ_{2}^{0}) = (\frac{(2 k_{1} + α) n_{1}}{4 m_{1}}, \frac{(2 k_{2} + β) n_{2}}{4 m_{2}}) \in T^{2}

of the form (II) is

{\nabla Θ |}_{(θ_{1}^{0}, θ_{2}^{0})} = 2 π μ ({(- 1)}^{γ} n_{1} Λ_{n_{1}, n_{2}}^{γ + 1, δ} (θ_{1}^{0}, θ_{2}^{0}), {(- 1)}^{δ} n_{2} Λ_{n_{1}, n_{2}}^{γ, δ + 1} (θ_{1}^{0}, θ_{2}^{0})) .

Therefore, we again find a bifurcation of behaviour depending on whether

{\nabla Θ |}_{(θ_{1}^{0}, θ_{2}^{0})} = 0

. If

{\nabla Θ |}_{(θ_{1}^{0}, θ_{2}^{0})} = 0

, the Nash Hessian it is given by

\begin{matrix} N Hess (Θ) {|_{(θ_{1}^{0}, θ_{2}^{0})} = {(- 1)}^{k_{1} + k_{2} + α + β} 4 π^{2} (\begin{matrix} 0 & m_{1} m_{2} \\ - m_{1} m_{2} & 0 \end{matrix}) + μ N Hess (Λ_{n_{1}, n_{2}}^{γ, δ}) |}_{(θ_{1}^{0}, θ_{2}^{0})} \end{matrix}

As above, the character of this matrix depends some combinatorials of

(α, β)

and

(γ, δ)

. Explicitly, we have that

\begin{matrix} N Hess (Θ) |_{(θ_{1}^{0}, θ_{2}^{0})} & = 4 π^{2} {(\begin{matrix} - n_{1}^{2} μ Λ_{n_{1}, n_{2}}^{γ, δ} & \pm m_{1} m_{2} \pm μ n_{1} n_{2} Λ_{n_{1}, n_{2}}^{γ + 1, δ + 1} \\ \mp m_{1} m_{2} \mp μ n_{1} n_{2} Λ_{n_{1}, n_{2}}^{γ + 1, δ + 1} & n_{2}^{2} μ Λ_{n_{1}, n_{2}}^{γ, δ} \end{matrix})|}_{(θ_{1}^{0}, θ_{2}^{0})} \end{matrix}

When

| μ |

is small,

N Hess (Θ) |_{(θ_{1}^{0}, θ_{2}^{0})}

has complex eigenvalues

λ, \bar{λ} \in C

. Since

λ + \bar{λ} = 2 Re (λ)

, the dynamics are ruled by the real part

Re (λ)

which is given by the following trace:

4 π^{2} μ Λ_{n_{1}, n_{2}}^{γ, δ} |_{(θ_{1}^{0}, θ_{2}^{0})} (n_{2}^{2} - n_{1}^{2}) .

Its negativity (respectively positivity) can be controlled with the trigonometric sign functions as

μ σ^{γ} (θ_{1}^{0} n_{1}) σ^{δ} (θ_{2}^{0} n_{2}) (n_{2}^{2} - n_{1}^{2}) < 0 (respectively > 0) .

Remark 6.

There are many cases in which this trace does not vanish. For instance, if

(γ, δ) = (α + 1, β + 1)

in

Z_{2} \times Z_{2}

, in general,

Λ_{n_{1}, n_{2}}^{α + 1, β + 1} (\frac{2 k_{1} + α}{4 m_{1}}, \frac{2 k_{2} + β}{4 m_{2}}) \neq 0 .

To be precise, given

n \in N

, let us denote by

p a r (n)

the unique integer such that

n = 2^{p a r (n)} n^{'}

with

n^{'}

odd. In that case, we have that

Λ_{n_{1}, n_{2}}^{α + 1, β + 1} (\frac{2 k_{1} + α}{4 m_{1}}, \frac{2 k_{2} + β}{4 m_{2}}) = 0

for some

k_{1}, k_{2} \in Z

if and only if

p a r (m_{1}) = p a r (n_{1}) + {(- 1)}^{α}

or

p a r (m_{2}) = p a r (n_{2}) + {(- 1)}^{β}

. It would be interesting to study the relation between the behavior and the small divisors phenomena observed in Kolmogorov-Arnold-Moser (KAM) theory [27].

The case with

{\nabla Θ |}_{(θ_{1}^{0}, θ_{2}^{0})} \neq 0

can be treated similarly, but now, we must not look at the Nash Hessian exactly at

(θ_{1}^{0}, θ_{2}^{0})

but at a point nearby. Generalizing the argument of Section 4.3, set

A = {(- 1)}^{γ} μ n_{1} σ^{γ + 1} (θ_{1}^{0} n_{1}) σ^{δ} (θ_{2}^{0} n_{2}), B_{1} = μ σ^{γ} (θ_{1}^{0} n_{1}) σ^{δ} (θ_{2}^{0} n_{2}),

B_{2} = {(- 1)}^{k_{1} + k_{2} + α + β} m_{1} m_{2} + {(- 1)}^{δ + γ} μ n_{1} n_{2} σ^{γ + 1} (θ_{1}^{0} n_{1}) σ^{δ + 1} (θ_{2}^{0} n_{2}) .

Then, the unique critical point

({\tilde{θ}}_{1}, {\tilde{θ}}_{2})

close to

(θ_{1}^{0}, θ_{2}^{0})

has the following form:

({\tilde{θ}}_{1}, {\tilde{θ}}_{2}) = (θ_{1}^{0} + sign (A B_{1}) ϵ_{1}, θ_{2}^{0} + sign (A B_{2}) ϵ_{2}),

for small enough

ϵ_{1}, ϵ_{2} > 0

. Therefore, the dynamic of the critical point

({\tilde{θ}}_{1}, {\tilde{θ}}_{2})

is determined by

\begin{matrix} μ σ^{γ} ((θ_{1}^{0} + sign (A B_{1}) ϵ_{1}) n_{1}) σ^{δ} ((θ_{2}^{0} + sign (A B_{2}) ϵ_{2}) n_{2}) (n_{2}^{2} - n_{1}^{2}) . \end{matrix}

(10)

This quantity controls the the sign of the trace of the Nash Hessian in analogy with the analysis of Section 4.3. Therefore, if this last quantity is negative, then

({\tilde{θ}}_{1}, {\tilde{θ}}_{2})

is a spiral attractor and, if it is positive, the point becomes a repulsor.

To illustrate the different bifurcation phenomena explained in this section, in Figure 2, the Nash follows some truncated series of low frequencies. Finally, summarizing this discussion, we obtained the following result.

Theorem 1.

For μ small enough, the truncated Fourier series

Θ = Λ_{m_{1}, m_{2}}^{α, β} + μ Λ_{n_{1}, n_{2}}^{γ, δ},

has an attracting (respectively repulsive) spiral critical point at each of the points of the form (II),

(θ_{1}^{0}, θ_{2}^{0}) = (\frac{2 k_{1} + α}{4 m_{1}}, \frac{2 k_{2} + β}{4 m_{2}}),

for

k_{1}, k_{2} \in Z

provided the following:

If ${\nabla Θ |}_{(θ_{1}^{0}, θ_{2}^{0})} = 0$ , it must hold that

$μ σ^{γ} (θ_{1}^{0}) σ^{δ} (θ_{2}^{0}) (n_{2}^{2} - n_{1}^{2}) < 0 (respectively > 0) .$
If ${\nabla Θ |}_{(θ_{1}^{0}, θ_{2}^{0})} \neq 0$ , it must hold that

$μ σ^{γ} ((θ_{1}^{0} + sign (A B_{1}) ϵ_{1}) n_{1}) σ^{δ} ((θ_{2}^{0} + sign (A B_{2}) ϵ_{2}) n_{2}) (n_{2}^{2} - n_{1}^{2}) < 0 (respectively > 0) .$

for $ϵ_{1}, ϵ_{2} > 0$ that is small enough.

Remark 7.

Even though half of the critical points near the points of the form (II) are attractors for the Nash flow of

Θ

, the dynamic is an small perturbation of a center. In this manner, the convergence is slow, highly spiralizing towards the Nash equilibrium. This theoretically justifies the slow and bad conditioned convergence observed in GANs networks.

5. Empirical Analysis

In this section, we show empirically how these Fourier approximations can be useful for understanding the convergence in the training of GANs. For this purpose, in this section, we consider a simple model for a 2-parametric torus GAN (i.e., with

d_{D} = d_{G} = 1

) and we analyze its convergence by means of its truncated Fourier series.

In the notation of Section 3, we take

d = 1

(1-dimensional real data) and the parameter spaces is

Θ_{D} = Θ_{G} = S^{1}

. The latent space is

Λ = [0, 1] \subseteq R

with the uniform probability (standard Lebesgue measure). Fix a periodic functions

χ : S^{1} \to R

. Choose a 1-parametric continuous distribution

D_{ξ}

depending on the parameter

ξ \in R

, with cumulative distribution function

F_{ξ}

and probability density function

f_{ξ}

. Fix

ω \in S^{1}

, and the real data X is sampled according to the distribution

X \sim D_{χ (ω)}

.

As discriminator function, for

θ_{1} \in S^{1}

, we consider the function

D_{θ_{1}} : R \to R

given by

D_{θ_{1}} (x) = \frac{f_{χ (ω)} (x)}{f_{χ (ω)} (x) + f_{χ (θ_{1})} (x)} .

(11)

On the other hand, for

θ_{2} \in S^{1}

, the generator is the function

G_{θ_{2}} : Λ = [0, 1] \to R

given by

G_{θ_{2}} (λ) = F_{χ (θ_{2})}^{- 1} (λ),

(12)

where

F_{χ (θ_{2})}^{- 1}

is the quantile function of

D_{χ (θ_{2})}

.

With these choices of generator and discriminator and taking as weight function

f (t) = - log (1 + exp (- t))

, as in [1], the cost functional (1) is reduced to

\begin{matrix} F (θ_{1}, θ_{2}) & = E_{Ω} log [D_{θ_{1}} (X)] + E_{Λ} log [1 - D_{θ_{1}} (G_{θ_{2}})] \\ = \int_{R} log (\frac{f_{χ (ω)} (x)}{f_{χ (ω)} (x) + f_{χ (θ_{1})} (x)}) f_{χ (ω)} (x) d x \\ + \int_{0}^{1} log (1 - \frac{f_{χ (ω)} (F_{χ (θ_{2})}^{- 1} (λ))}{f_{χ (ω)} (F_{χ (θ_{2})}^{- 1} (λ)) + f_{χ (θ_{1})} (F_{χ (θ_{2})}^{- 1} (λ))}) d λ . \end{matrix}

(13)

Remark 8.

These choices of shapes for the discriminator and generator functions are justified by [1, Proposition 1]. There, it is proven that, for a fixed generator G with transformed probability density function

f_{G}

, the optimal discriminator

D_{θ_{1}^{0}}

is given by

\begin{matrix} D_{θ_{1}^{0}} (x) = \frac{f_{χ (ω)} (x)}{f_{χ (ω)} (x) + f_{G} (x)} . \end{matrix}

(14)

On the other hand, recall that, if

Λ = [0, 1]

with the uniform probability, then

F_{ξ}^{- 1} : Λ = [0, 1] \to R

is a random variable with distribution

D (ξ)

. Thus, in our case,

G_{θ_{2}}

is a random variable with distribution

D_{χ (θ_{2})}

and, therefore, transformed density

f_{χ (θ_{2})}

.

In this vein, the goal of the generator G given by (12) is to adjust

θ_{2}

to reach the value

θ_{2} = ω

, for which G generates exactly the real data. On the other side, for fixed parameter

θ_{2}

for G, D given by (11) aims to tune

θ_{1}

to the value

θ_{1} = θ_{2}

, for which D is the perfect discriminator (14).

For the purposes of these experiments, we fix the underlying distribution

D_{ξ}

to be the exponential distribution with mean

1 / ξ

and

χ (θ) = sin {(π θ)}^{2} + 1

. Recall that, in this situation,

f_{ξ} (x) = ξ e^{- ξ x}

y

F_{ξ} (x) = 1 - e^{- ξ x}

. In this way, the discriminator function (11) and the generator (12) are given by

D_{θ_{1}} (x) = \frac{e^{x sin {(π θ_{1})}^{2}}}{\frac{(sin {(π θ_{1})}^{2} + 1)}{(sin {(π ω)}^{2} + 1)} e^{x sin {(π ω)}^{2}} + e^{x sin {(π θ_{1})}^{2}}}, G_{θ_{2}} (λ) = \frac{1}{sin {(π θ_{2})}^{2} + 1} log (- \frac{1}{λ - 1}) .

(15)

Moreover, from now on, we fix

ω = 1 / 4

, so that

χ (ω) = 3 / 4

. The resulting probability density and cumulative distribution functions of the real data are plotted in Figure 3.

With this choice of real distribution, the generator function as well as the transformed probability density function are plotted in Figure 4 and the discriminator function is shown in Figure 5.

In addition, in Figure 6, we show graphically the cost function

F (θ_{1}, θ_{2})

of (13) on

T^{2}

. The numerical approximation of the integrals in (13) were carried out with the Simpson rule. The function was sampled at 225 knot points and subsequently interpolated by means of a multiquadratic radial basis interpolation. Observe that one of the Nash equilibria of

F

is at

(θ_{1}, θ_{2}) = (1 / 4, 1 / 4)

(bottom corner of the plot). Moreover, by the symmetries of

χ

, the plot suggests that

(θ_{1}, θ_{2}) = (1 / 4, 3 / 4), (3 / 4, 1 / 4), (3 / 4, 3 / 4)

are also Nash equilibria.

In Figure 7, we show the Nash flow associated with the cost function

F : T^{2} \to R

. As can be checked in the image, the flow confirms that there exists four Nash equilibrium points, corresponding to

(θ_{1}^{0}, θ_{2}^{0}) = (1 / 4, 1 / 4), (1 / 4, 3 / 4), (3 / 4, 1 / 4)

, and

(3 / 4, 3 / 4)

, all of them being attractors for the Nash flow. Another four critical points of

F

can be observed in the figure: the points

(0, 0)

and

(1 / 2, 1 / 2)

correspond to the two maxima of

F

, and the points

(0, 1 / 2)

and

(1 / 2, 0)

correspond to the two minima. Observe that these critical points are saddle points for the flow, with an attractive direction and a repulsive direction. Finally, notice that (4) is satisfied since the maxima and minima have even indices (2 and 0, respectively), and the Nash equilibria have odd indices.

Now, let us decompose

F

according to its Fourier series. In Table 1, we show the modes with the largest absolute Fourier coefficients. These coefficients have been computed using the formulae of Section 4 by applying rectangular quadrature as the numerical integration method and looking at the modes with

1 \leq m_{1}, m_{2} \leq 10

.

From these results, we observe that the predominant Fourier modes of

F

are cosine basis functions,

Λ_{m_{1}, m_{2}}^{1, 1} (θ_{1}, θ_{2}) = cos (2 π m_{1} θ_{1}) cos (2 π m_{2} θ_{2})

. The largest coefficient corresponds to the mode

(m_{1}, m_{2}) = (1, 1)

. Observe that this is not surprising:

(m_{1}, m_{2}) = (1, 1)

is the unique mode with four critical points of type (II), which correspond to the four Nash equilibria of Figure 7 (in other words, the four saddle points in Figure 6).

For

s \geq 0

, let us order the first s Fourier modes decreasingly according to the absolute value of their coefficient,

(m_{1}^{0}, m_{2}^{0}) = (1, 1), (m_{1}^{1}, m_{2}^{1}), \dots, (m_{1}^{s}, m_{2}^{s})

. Denote by

b_{m_{i}^{i}, m_{2}^{i}}^{1, 1} = a_{m_{i}^{i}, m_{2}^{i}}^{1, 1} / a_{m_{i}^{0}, m_{2}^{0}}^{1, 1}

the ratio of the Fourier coefficients. We can approximate the Nash flow of the cost function

F

by the truncated Fourier series:

\begin{matrix} Θ_{s} (θ_{1}, θ_{2}) = Λ_{m_{1}^{0}, m_{2}^{0}}^{1, 1} (θ_{1}, θ_{2}) + \sum_{i = 1}^{s} b_{m_{i}^{i}, m_{2}^{i}}^{1, 1} Λ_{m_{i}^{i}, m_{2}^{i}}^{1, 1} (θ_{1}, θ_{2}) . \end{matrix}

The associated Nash flow is depicted in Figure 8. As can be checked there, the critical points nearby points of type (II) are (approximately) centers for

s \leq 3

. The reason for this behavior is twofold. In the following, let

(θ_{1}^{0}, θ_{2}^{0}) = (1 / 4, 1 / 4), (1 / 4, 3 / 4), (3 / 4, 1 / 4)

or

(3 / 4, 3 / 4)

.

For $s \leq 2$ , we have that $\nabla Θ_{s} |_{(θ_{1}^{0}, θ_{2}^{0})} = 0$ since, in the gradient, there is always a term with a factor $cos (2 π θ)$ that vanishes at these points. Hence, the critical point of $Θ_{s}$ is exactly at $(θ_{1}^{0}, θ_{2}^{0})$ . Nevertheless, since all the terms $Λ_{m_{1}, m_{2}}^{α, β}$ appearing in the Fourier series have equal $(α, β) = 1$ , as mentioned in Section 4.3, we still have that the Nash Hessian has the form in (8) with vanishing diagonal entries. Hence, the critical point $(θ_{1}^{0}, θ_{2}^{0})$ is still a center.
For $s = 3$ , we find that $\nabla Θ_{3} |_{(θ_{1}^{0}, θ_{2}^{0})} \neq 0$ , so a new critical point $({\tilde{θ}}_{1}, {\tilde{θ}}_{2})$ appears near $(θ_{1}^{0}, θ_{2}^{0})$ . Nevertheless, for this new mode, we have that $m_{1}^{3} = m_{2}^{3} = 2$ , so Equation (10) still vanishes, proving that the new critical point is still a center.

Finally, let us consider the case

s = 4

. In this situation, we also have

\nabla Θ_{4} |_{(θ_{1}^{0}, θ_{2}^{0})} \neq 0

, so a new critical point

({\tilde{θ}}_{1}, {\tilde{θ}}_{2})

appears near

(θ_{1}^{0}, θ_{2}^{0})

. The dynamic around it is governed by Equation (10). To do so, we calculate the sign of the quantities

A, B_{1}

and

B_{2}

of Section 4.4 and we get

A > 0, B_{1} < 0, B_{2} < 0 .

Hence, the new critical point has the form

({\tilde{θ}}_{1}, {\tilde{θ}}_{2}) = (θ_{1}^{0} - ϵ_{1}, θ_{2}^{0} - ϵ_{2})

for a small

ϵ_{1}, ϵ_{2} > 0

. For these values, we have that

σ^{1} (2 (θ_{1}^{0} - ϵ_{1})) = - 1, σ^{δ} (3 (θ_{2}^{0} - ϵ_{2})) = - 1 .

Therefore, checking Equation (10), we get

μ σ^{1} (n_{1} (θ_{1}^{0} - ϵ_{1})) σ^{δ} (n_{2} (θ_{2}^{0} - ϵ_{2})) \cdot (n_{2}^{2} - n_{1}^{2}) = - 0.003 \cdot (- 1) \cdot (- 1) (3^{2} - 2^{2}) < 0 .

Therefore, for

s = 4

, the trend changes and the centers turn into spiral attractor critical points. This is the attractive behavior observed in Figure 8e. Notice that this dynamic agrees with the real one observed in Figure 7, which empirically confirms the validity of our approach.

6. Methodology for Practical Applications

The discussion of Section 4 and Section 5 opens the door to practical application of the analysis techniques introduced in this paper to study convergence of real-world GANs. Observe that, in general, the knowledge of the underlying cost function

F

(c.f. Equation (1)) of a GAN is very limited. Indeed, several metrics have been proposed in the literature to screen the evolution of the training of the GAN. These metrics provide a way to measure indirectly the convergence of the GAN but definitely skip a thorough analysis of the cost function. Nevertheless, using the techniques introduced in this paper, we show that it is possible to methodically analyze the dynamics of the Nash flow for the GAN problem through partial sums of the Fourier series of the cost function. It is remarkable that this valuable information about the behaviour of the training process cannot be extracted from

F

itself.

In this section, we aim to organize the previous analysis into a precise methodology that can be applied in practice. As it will become clear, this process was implicit in the reasoning provided in Section 5. The proposed process of analysis comprises the following steps:

Evaluate cost function $F (θ_{D}, θ_{G})$ in a uniform grid for the parameters $(θ_{D}, θ_{G})$ (the weights of the two neural networks forming the GAN in the deep learning framework). Observe that, for these evaluations, it is not necessary to train the GAN networks. The sampling process amounts to fixing the weights of the networks and to computing the mean prediction error of the discriminant against real and synthetic instances. No optimization of the weights must be carried out.
Compute the Discrete Fourier Transform (DFT) of $F$ by means of the obtained samples. This process can be done efficiently through the Fast Fourier Transform (FFT) algorithm.
Use the results of the DFT to estimate the Fourier modes and coefficients of $F$ . Sort the modes decreasingly according to the absolute value of their associated Fourier coefficient.
Consider a truncation level $s \geq 0$ (starting with $s = 0$ ). Compute the critical points of $Θ_{s}$ , the truncated Fourier series of $F$ with s terms. Using the techniques developed in Section 4 (see also Section 5), analyze the local dynamics of the Nash flow around the critical points of $Θ_{s}$ .
While some of the critical points of $Θ_{s}$ are a center, increase the truncation level by 1. Repeat the steps 4 and 5 until a truncation level $s_{0}$ is reached such that all the critical points of $Θ_{s_{0}}$ are either attractors or repulsors.

After this process, we found a truncation level

s_{0}

such that the local dynamics of

Θ_{s_{0}}

around the critical points are conjugated to the local dynamics of

F

around its Nash equilibria. This information can be exploited to analyze the training process of the GAN. For instance, if the convergence to the critical point is very slow, in the sense that the trace of the Nash Hessian is close to zero, then a hard convergence of the training process should be expected. This leads to remarkable instabilities during the learning process that may prevent the system from converging with a raw gradient descent optimization procedure. In that case, the obtained results strongly suggest that several heuristics for stabilizing the training process must be implemented. Additionally, since the equilibria are spiral attractors, if the learning rate of the gradient descend method is not small enough, the discrete time approximation may not converge. In that case, the information about the convergence rate in the simplified Fourier model can be used to properly anneal the learning rate, leading to a much stable convergence.

Despite the utility of the proposed methodology, it suffers several issues that must be addressed in future works to obtain an efficient analysis procedure. The first one is that the previous proposal has an obvious bottleneck: the sampling process of the cost function on the parameters

(θ_{D}, θ_{G})

may require a huge number of samples due to the course of dimensionality. Nevertheless, it is important to mention that it is not necessary to use a very dense grid since we want to understand the Fourier modes of the cost function

F

and not to obtain a detailed picture of the landscape of

F

. This largely alleviates the sampling process to make it feasible.

Another possible solution is to not sample on the whole

(θ_{D}, θ_{G})

space but on a smaller dimensional subspace concentrating the flow. For that purpose, the GAN network can be trained and, after some epochs, the flow will have entered in a certain “convergence subspace” that encloses the long-time evolution of the flow. This subspace can be estimated by several methods, for instance by considering the subspace generated by the last

k \geq 1

gradient vectors obtained in the training process. In that case, instead of working on the high dimensional

(θ_{D}, θ_{G})

-space, we can restrict our analysis to the k-dimensional affine space generated by these vectors. This is a much smaller subspace in which the sampling process can be carried out. Nevertheless, proposing other efficient methods of sampling that enable accurate approximations of the Fourier series of

F

is an interesting topic for future work.

Another important remark is that the methodology proposed to estimate the Fourier series through the FFT is much more efficient than the quadrature methods used in Section 5. However, it also may lead to poorer estimations of the Fourier coefficients. This inaccuracy may produce errors when choosing the leading Fourier modes if their importance (absolute value of their Fourier coefficients) are similar. To avoid these problems, all the possible permutations of these similar modes (say, modes whose coefficients differ less than a fixed threshold) must be considered during the analysis of Nash flow of the Fourier series.

7. Conclusions

In this paper, we studied a novel approach to deeply analyze the converge of GAN networks on tori. This is an outstanding open problem in machine learning and deep learning that prevents GANs being suitable for use in arbitrary domains, as feature generation outside the world of image processing.

In this paper, we proposed to decompose the cost function of a GAN into its Fourier mode and to envisage the dynamics around the Nash equilibria through its truncated Fourier approximation. For that purpose, we performed a thorough analysis of the dynamics of trigonometric series with one and two terms. Roughly speaking, this analysis showed that, if we truncate the Fourier series at its first mode, all the critical points are centers surrounded by periodic orbits. When we add subtler Fourier modes to the approximation, this dynamic may be preserved or may bifurcate to give rise to spiral attractors or repulsors. This dynamic is essentially determined by the trace of the Nash Hessian of the cost function. Hence, following this idea, in this paper, we exhibited explicitly the bifurcation condition for the Nash flow of the truncated Fourier approximations. These conditions have an involved shape taking into account the monotonicity of the trigonometric functions on a neighborhood of the critical point, but eventually, the conditions are very explicit and can be easily checked. As byproduct of this analysis, we observed that, even though the Nash equilibria are stable points as proven in [4], the dynamic of the training process is close to a center and the convergence is slow and spiral.

To test this idea, we conducted an experimental analysis with a torus GAN toy-model. Through this example, we observed that the number and distribution of the critical points is determined by the first Fourier model. Nevertheless, it was necessary to reach the forth Fourier term to discover the attractive dynamics, as predicted in the GAN literature. Comparing the approximated flow with the real flow, we observed that the approximation is able to replicate not only the local but also the global dynamics of real GAN.

We expect that this work will be useful for quantifying the complexity and convergence properties of GAN. To show how this theoretical analysis can be put into practice, in Section 6, we proposed a methodology of analysis that enables a characterization of the training dynamics of real-world GANs by means of the techniques developed in this work. From the obtained information about the convergence of the learning process of the networks, several improvements for stabilizing the training can be implemented, such as a progressive reduction of the learning rate to adapt the geometry of the spiral flow.

It is worth mentioning that the results presented in this paper apply not only to torus toy-models but also to more realistic networks. It may seem at a first sight that standard GANs do not fulfil the periodicity requirement to be defined on a torus. However, in many cases, the outputs of the generator and the discriminator networks are clipped for large enough inputs. This fix is crucial to maintain several required analytic properties, as the Lipschitz condition for Wasserstein GANs [14]. After this clipping, the GAN does actually turn into a torus GAN since the generator and discriminator functions are periodic (with a large period). In this manner, most of the regular GANs used in image generation and feature generation fit in the framework introduced in this paper. This is crucial, since dynamics on a closed manifold are deeply related to the underlying topology, for instance, through the Poincaré–Hopf theorem or deeper Morse-like results.

Nevertheless, much work must be done before this project can be turned into a reality. First, in order to compute the Fourier series of the cost function, we had to sample the cost function of the GAN at a dense mesh of weights. Using this sampling, we were able to estimate the Fourier coefficients through standard quadrature techniques, as the Simpson rule. In shallow networks with few neurons, a similar approach can be applied, but for deeper networks, this dense sampling is unfeasible. For this reason, better methods for estimating the Fourier coefficients of the cost function are needed, maybe by exploding the analytical and harmonical properties of the trigonometric functions. In addition, to illustrate the method, in this paper, we carried out all the calculations on a 2-dimensional torus. The computation in higher dimensional tori may follow similar lines, but definitely a thorough analysis of the bifurcation conditions in the higher dimensional setting is not obvious.

Summarizing, in this paper, we introduced a novel method for understanding the dynamics of GANs through harmonic analysis. We showed that, despite the Nash equilibria of the GAN being stable, the convergence is a perturbation of a center and, thus, slow and complicated. The method allowed us to identify a simplified model of the dynamics that may be useful for tuning several hyperparameters of the used GANs as the learning rate of the number of epochs to be trained. We expect that this work will open the door to new methods of study of dynamics of GAN by using harmonic analysis and trascendental methods.

Author Contributions

Conceptualization, Á.G.-P.; methodology, Á.G.-P. and A.M.; software E.T. and S.G.-C.; validation, E.T. and S.G.-C.; formal analysis, Á.G.-P.; writing—original draft preparation, Á.G.-P.; writing—review and editing, A.M., E.T., and S.G.-C.; project administration, A.M.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the European Union’s Horizon 2020 Research and Innovation Programme under grant 833685 (SPIDER).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2672–2680. [Google Scholar]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Nagarajan, V.; Kolter, J.Z. Gradient descent GAN optimization is locally stable. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5585–5595. [Google Scholar]
Mescheder, L.M.; Geiger, A.; Nowozin, S. Which Training Methods for GANs do actually Converge? In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018; pp. 3478–3487. [Google Scholar]
Goodfellow, I. NIPS 2016 tutorial: Generative adversarial networks. arXiv 2016, arXiv:1701.00160. [Google Scholar]
Kusner, M.J.; Hernández-Lobato, J.M. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv 2016, arXiv:1611.04051. [Google Scholar]
Diesendruck, M.; Elenberg, E.R.; Sen, R.; Cole, G.W.; Shakkottai, S.; Williamson, S.A. Importance weighted generative networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin, Germany, 2019; pp. 249–265. [Google Scholar]
Antoniou, A.; Storkey, A.; Edwards, H. Data augmentation generative adversarial networks. arXiv 2017, arXiv:1711.04340. [Google Scholar]
Arjovsky, M.; Bottou, L. Towards principled methods for training generative adversarial networks. arXiv 2017, arXiv:1701.04862. [Google Scholar]
Arora, S.; Ge, R.; Liang, Y.; Ma, T.; Zhang, Y. Generalization and equilibrium in generative adversarial nets (gans). arXiv 2017, arXiv:1703.00573. [Google Scholar]
Arora, S.; Risteski, A.; Zhang, Y. Do GANs learn the distribution? some theory and empirics. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 2016, 29, 2234–2242. [Google Scholar]
Roth, K.; Lucchi, A.; Nowozin, S.; Hofmann, T. Stabilizing training of generative adversarial networks through regularization. arXiv 2017, arXiv:1705.09367v2. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
Nowozin, S.; Cseke, B.; Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. arXiv 2016, arXiv:1606.00709. [Google Scholar]
Wang, C.; Xu, C.; Yao, X.; Tao, D. Evolutionary generative adversarial networks. IEEE Trans. Evol. Comput. 2019, 23, 921–934. [Google Scholar] [CrossRef] [Green Version]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv 2017, arXiv:1706.08500. [Google Scholar]
Snell, J.; Ridgeway, K.; Liao, R.; Roads, B.D.; Mozer, M.C.; Zemel, R.S. Learning to generate images with perceptual similarity metrics. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4277–4281. [Google Scholar]
Borji, A. Pros and cons of gan evaluation measures. Comput. Vis. Image Underst. 2019, 179, 41–65. [Google Scholar] [CrossRef] [Green Version]
Milnor, J. Lectures on the H-Cobordism Theorem; Princeton University Press: Princeton, NJ, USA, 2015; Volume 2258. [Google Scholar]
Atiyah, M.F.; Bott, R. The yang-mills equations over riemann surfaces. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Sci. 1983, 308, 523–615. [Google Scholar]
Rudin, W. Real and Complex Analysis; Tata McGraw-Hill Education: New York, NY, USA, 2006. [Google Scholar]
Du Bois-Reymond, P. Ueber die fourierschen reihen. Nachrichten von der Königl. Gesellschaft der Wissenschaften und der Georg-Augusts-Universität zu Göttingen 1873, 1873, 571–584. [Google Scholar]
Kolmogorov, A. Une séries de Fourier-Lebesgue divergente partout. CR Acad. Sci. Paris 1926, 183, 1327–1328. [Google Scholar]
Zygmund, A. Trigonometric Series; Cambridge University Press: Cambridge, UK, 2002; Volume 1. [Google Scholar]
Gronwall, T.H. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Ann. Math. 1919, 20, 292–296. [Google Scholar] [CrossRef]
Arnol’d, V.I. Mathematical Methods of Classical Mechanics; Springer Science & Business Media: Berlin, Germany, 2013; Volume 60. [Google Scholar]

Figure 1. Nash flow dynamics of Fourier basis functions: (a)

Λ_{1, 1}^{0, 1} = sin (2 π θ_{1}) cos (2 π θ_{2})

, (b)

Λ_{1, 2}^{0, 0} = sin (2 π θ_{1}) sin (4 π θ_{2})

, and (c)

Λ_{2, 3}^{1, 1} = cos (4 π θ_{1}) cos (6 π θ_{2})

.

Figure 1. Nash flow dynamics of Fourier basis functions: (a)

Λ_{1, 1}^{0, 1} = sin (2 π θ_{1}) cos (2 π θ_{2})

, (b)

Λ_{1, 2}^{0, 0} = sin (2 π θ_{1}) sin (4 π θ_{2})

, and (c)

Λ_{2, 3}^{1, 1} = cos (4 π θ_{1}) cos (6 π θ_{2})

.

Figure 2. Nash flow dynamics of truncated Fourier series: cases (a–d) show breaking of the periodic orbits into spiral flow, and cases (e,f) preserve the periodic orbits. (a)

Θ = Λ_{1, 1}^{0, 0} + 0.03 Λ_{3, 5}^{1, 1}

. (b)

Θ = Λ_{1, 1}^{0, 1} + 0.02 Λ_{3, 5}^{1, 0}

. (c)

Θ = Λ_{1, 2}^{0, 0} + 0.1 Λ_{2, 3}^{1, 1}

. (d)

Θ = Λ_{2, 2}^{0, 0} + 0.1 Λ_{3, 5}^{1, 1}

. (e)

Θ = Λ_{2, 2}^{0, 0} + 0.02 Λ_{4, 4}^{1, 1}

. (f)

Θ = Λ_{1, 2}^{0, 0} + 0.1 Λ_{3, 5}^{0, 0}

.

Figure 2. Nash flow dynamics of truncated Fourier series: cases (a–d) show breaking of the periodic orbits into spiral flow, and cases (e,f) preserve the periodic orbits. (a)

Θ = Λ_{1, 1}^{0, 0} + 0.03 Λ_{3, 5}^{1, 1}

. (b)

Θ = Λ_{1, 1}^{0, 1} + 0.02 Λ_{3, 5}^{1, 0}

. (c)

Θ = Λ_{1, 2}^{0, 0} + 0.1 Λ_{2, 3}^{1, 1}

. (d)

Θ = Λ_{2, 2}^{0, 0} + 0.1 Λ_{3, 5}^{1, 1}

. (e)

Θ = Λ_{2, 2}^{0, 0} + 0.02 Λ_{4, 4}^{1, 1}

. (f)

Θ = Λ_{1, 2}^{0, 0} + 0.1 Λ_{3, 5}^{0, 0}

.

Figure 3. Distribution of the real data: (a) probability density function and (b) cumulative distribution function.

Figure 4. Generator functions for

0 \leq θ_{2} \leq \frac{1}{2}

: the warmer the plot, the bigger the value of

θ_{2}

. The dashed line corresponds to the real data. (a) Output of the function. (b) Transformed probability density function.

Figure 4. Generator functions for

0 \leq θ_{2} \leq \frac{1}{2}

: the warmer the plot, the bigger the value of

θ_{2}

. The dashed line corresponds to the real data. (a) Output of the function. (b) Transformed probability density function.

Figure 5. Discriminator functions for

0 \leq θ_{1} \leq \frac{1}{2}

: the warmer the plot, the larger the value of

θ_{1}

. For fixed generator parameter

θ_{2}

, the optimal value for

θ_{1}

corresponds to the line with

θ_{1} = θ_{2}

.

Figure 5. Discriminator functions for

0 \leq θ_{1} \leq \frac{1}{2}

: the warmer the plot, the larger the value of

θ_{1}

. For fixed generator parameter

θ_{2}

, the optimal value for

θ_{1}

corresponds to the line with

θ_{1} = θ_{2}

.

Figure 6. Graphical representation of the landscape of the cost function

F (θ_{1}, θ_{2}) : T^{2} \to R

(a) Plot of the function

F (θ_{1}, θ_{2})

. The four saddle points lie near each of the four corners of the frame. (b) Contour plot of

F (θ_{1}, θ_{2})

.

Figure 6. Graphical representation of the landscape of the cost function

F (θ_{1}, θ_{2}) : T^{2} \to R

(a) Plot of the function

F (θ_{1}, θ_{2})

. The four saddle points lie near each of the four corners of the frame. (b) Contour plot of

F (θ_{1}, θ_{2})

.

Figure 7. Dynamics of the Nash flow for the torus GAN: four attractive Nash equilibria can be observed.

Figure 8. Nash flow dynamics of truncated Fourier series approximations for the cost function of the torus GAN: (a) approximation

Θ_{0}

, (b) approximation

Θ_{1}

, (c) approximation

Θ_{2}

, (d) approximation

Θ_{3}

, and (e) approximation

Θ_{4}

.

Figure 8. Nash flow dynamics of truncated Fourier series approximations for the cost function of the torus GAN: (a) approximation

Θ_{0}

, (b) approximation

Θ_{1}

, (c) approximation

Θ_{2}

, (d) approximation

Θ_{3}

, and (e) approximation

Θ_{4}

.

Table 1. Fourier modes of the cost function for the torus GAN. The ten modes with the largest absolute value of their associated coefficient are shown. The last column shows the ratio between each Fourier coefficient and the largest coefficient.

$m_{1}$	$m_{2}$	$α$	$β$	$a_{m_{1}, m_{2}}^{α, β}$	Ratio
1	1	1	1	0.06127	1.0000
1	2	1	1	0.01102	0.1800
2	1	1	1	−0.00503	−0.0822
2	2	1	1	−0.00404	−0.0660
2	3	1	1	−0.00325	−0.0532
2	4	1	1	−0.00308	−0.0504
2	5	1	1	−0.00305	−0.0499
2	7	1	1	−0.00304	−0.0497
2	9	1	1	−0.00304	−0.0496
2	10	1	1	−0.00304	−0.0496

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

González-Prieto, Á.; Mozo, A.; Talavera, E.; Gómez-Canaval, S. Dynamics of Fourier Modes in Torus Generative Adversarial Networks. Mathematics 2021, 9, 325. https://doi.org/10.3390/math9040325

AMA Style

González-Prieto Á, Mozo A, Talavera E, Gómez-Canaval S. Dynamics of Fourier Modes in Torus Generative Adversarial Networks. Mathematics. 2021; 9(4):325. https://doi.org/10.3390/math9040325

Chicago/Turabian Style

González-Prieto, Ángel, Alberto Mozo, Edgar Talavera, and Sandra Gómez-Canaval. 2021. "Dynamics of Fourier Modes in Torus Generative Adversarial Networks" Mathematics 9, no. 4: 325. https://doi.org/10.3390/math9040325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamics of Fourier Modes in Torus Generative Adversarial Networks

Abstract

1. Introduction

2. GANs Dynamics

2.1. Review of Morse Theory

2.2. The Nash Flow

3. Torus GANs

3.1. Fourier Analysis in the Torus

4. Dynamics of Fourier Basis

4.1. Nash Flow for Single Variable Fourier Basis

4.2. Nash Flow for Fourier Basis

4.3. Nash Flow for Simplified Truncated Fourier Series

4.4. Nash Flow for General Truncated Fourier Series

5. Empirical Analysis

6. Methodology for Practical Applications

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI