Deep Learning and Mean-Field Games: A Stochastic Optimal Control Perspective

Persio, Luca Di; Garbelli, Matteo

doi:10.3390/sym13010014

Open AccessFeature PaperArticle

Deep Learning and Mean-Field Games: A Stochastic Optimal Control Perspective

by

Luca Di Persio

^1,*

and

Matteo Garbelli

²

¹

Department of Computer Science, University of Verona, 37134 Verona, Italy

²

Department of Mathematics, University of Trento, 38123 Trento, Italy

^*

Author to whom correspondence should be addressed.

Symmetry 2021, 13(1), 14; https://doi.org/10.3390/sym13010014

Submission received: 18 November 2020 / Revised: 9 December 2020 / Accepted: 17 December 2020 / Published: 23 December 2020

(This article belongs to the Special Issue Advances in Stochastic Differential Equations)

Download Versions Notes

Abstract

:

We provide a rigorous mathematical formulation of Deep Learning (DL) methodologies through an in-depth analysis of the learning procedures characterizing Neural Network (NN) models within the theoretical frameworks of Stochastic Optimal Control (SOC) and Mean-Field Games (MFGs). In particular, we show how the supervised learning approach can be translated in terms of a (stochastic) mean-field optimal control problem by applying the Hamilton–Jacobi–Bellman (HJB) approach and the mean-field Pontryagin maximum principle. Our contribution sheds new light on a possible theoretical connection between mean-field problems and DL, melting heterogeneous approaches and reporting the state-of-the-art within such fields to show how the latter different perspectives can be indeed fruitfully unified.

Keywords:

deep learning; neural networks; stochastic optimal control; mean-field games; Hamilton–Jacobi–Bellman equation; Pontryagin maximum principle

1. Introduction

Controlled stochastic processes, which naturally arise in a plethora of heterogeneous fields, spanning, e.g., from mathematical finance to industry, can be solved in the setting of continuous time stochastic control theory. In particular, when we have to analyse complex dynamics produced by the mutual interaction of a large set of indistinguishable players, an efficient approach to infer knowledge about the resulting behaviour, typical for example of a neuronal ensemble, is provided by Mean-Field Game (MFG) methods, as described in [1]. MFG theory generalizes classical models of interacting particle systems characterizing statistical mechanics. Intuitively, each particle is replaced by rational agents whose dynamics are represented by a Stochastic Differential Equation (SDE). The term mean-field refers to the highly symmetric form of interaction: the dynamics and the objective of each particle depend on an empirical measure capturing the global behaviour of the population. The solution of an MFG is analogous to a Nash equilibrium for a non-cooperative game [2]. The key idea is that the population limit can be effectively approximated by statistical features of the system corresponding to the behaviour of a typical group of agents, in a Wasserstein space sense [3]. On the other hand, Deep Learning (DL) is frequently used in several Machine Learning (ML) based applications, spanning from image classification and speech recognition to predictive maintenance and clustering. Therefore, it has become essential to provide a strong mathematical formulation and to analyse both the setting and the associated algorithms [4,5]. Commonly, Neural Networks (NNs) are trained through the Stochastic Gradient Descent (SGD) method. It updates the trainable parameters using gradient information computed randomly via a back-propagation algorithm with the disadvantage of being slow in the first steps of training. An alternative consists of expressing the learning procedure of an NN as a dynamical system (see [6]), which can be then analysed as an optimal control problem [7].

The present paper is structured as follows. In Section 2, we introduce the fundamentals about the Wasserstein space, Stochastic Optimal Control (SOC) and MFGs. In Section 3, the link between NNs and MFGs is deeply analysed in order to reflect the probabilistic nature of the learning process. Conclusions and prospects are outlined in Section 4.

2. Problem Formulation and Preliminaries

2.1. Wasserstein Metrics

Many of the main results contained in this paper will be stated in terms of convergence in the distribution of random variables, vectors, processes and measures. We refer to [8,9,10] concerning the basics about the theory of the weak convergence of probability measures in metric spaces equipped with the natural Borel

σ

-algebra; see also [11,12] and the references therein for further details. According to [8] (pp. 7, 8), let us recall the following two definitions, useful to highlight the strict connection between measure theory and probability. Given a probability measure

μ \in P (X)

over the metric space

X

and a sequence

(μ_{n}) \subset P (X)

with

n \in N

, we say that

μ_{n}

converges weakly to

μ

or

μ_{n} \to μ

if:

lim_{n \to \infty} \int_{X} f d μ_{n} = \int_{X} f d μ, \forall f \in C_{b} (X) .

Given a sequence of

X

-valued random variables

{X_{n}}_{n \geq 1}

, we say that

X_{n}

converges weakly (or in distribution) to a

X

-valued random variable X, if:

lim_{n \to \infty} E [f (X_{n})] = E [f (X)], \forall f \in C_{b} (X),

denoting this convergence by

X_{n} \Rightarrow X

. Focusing on the convergence of empirical measures, let

(X_{i})

be a sequence of independent and identically distributed (i.i.d.)

X

-random variables and define the

P (X)

-valued random variable:

μ_{n} = \frac{1}{n} \sum_{i = 1}^{n} δ_{X_{i}},

(1)

then we have a random probability measure, usually indicated as the empirical measure. According to [8] (pp. 12–16),

P (X)

is endowed with a metric (compatible with the notion of weak convergence) in order to consider

P (X)

as a metric space itself. Let us recall the Wasserstein metric, defined on

P (X)

, based on the idea of coupling. In particular, given

μ, ν \in P (X)

,

Π (μ, ν)

represents the set of Borel probability measures

π

on

X \times X

, with first, resp. second, marginal

μ

, resp.

ν

. Namely,

π (A \times X) = μ (A)

and

π (X \times A) = ν (A)

for every Borel set

A \subset X

. Then, we define

P^{p} (X)

for

p \geq 1

, as the set of probability measures

μ \in P (X)

satisfying:

\int_{X} d {(x, x_{0})}^{p} μ (d x) < \infty,

where

x_{0} \in X

is an arbitrary reference point. Consequently, the p-Wasserstein metric on

P^{p} (X)

is defined as:

W_{p} (μ, ν) = {(inf_{π \in Π (μ, ν)} \int_{X \times X} d {(x, y)}^{p} π (d x, d y))}^{1 / p} = = {(inf_{X \sim μ, Y \sim ν} E [d {(X, Y)}^{p}])}^{1 / p},

(2)

where the infimum is taken over all pairs of

X

-valued random variables

X, Y

with, respectively, given marginals

μ

and

ν

.

2.2. Stochastic Optimal Control Problem

Following [13] (see also [14] and the references therein), let

{Ω, F, (F_{t}), P}

be a filtered probability space, with filtration

F_{t} = {F_{t}, t \in [0, T]}

,

T > 0

, supporting:

a controlled state variable ${(X_{t}^{α})}_{t \in [0, T]}$ , where $X_{t}$ is an i.i.d. sequence of $R^{n}$ -valued $F_{0}$ -measurable random variables;
a sequence ${W_{t}^{i}}_{i \geq 1}$ of independent and $F_{t}$ -adapted Brownian motions.

We consider a control as enforced on the state variable by

α = {(α_{t})}_{t \in [0, T]}

, that is an adapted process defined on

A = {{(α_{t})}_{t \in [0, T]}}

, containing

F

-progressively measurable controls, which take values in A identified as a closed subset of

R^{d}

. The objective function J is then defined as:

J (α, X) = E [\int_{0}^{T} f (α_{s}, X_{s}) d s + g (X_{T})],

where

f : A \times R^{n} \to R

denotes the running objective function, while

g : R^{n} \to R

denotes the terminal objective function, and the expectation is taken with respect to (w.r.t.) the probability measure

P

.

The goal corresponds to identifying the control

α

solving the maximization problem:

max_{α \in A} J (α, X_{t}),

according to the n-dimensional stochastic controlled process

X_{s}

s.t.

\{\begin{matrix} d X_{s} = b (α_{s}, X_{s}) d s + σ (α_{s}, X_{s}) d W_{s} \\ X_{t} = x, s \in [t, T] . \end{matrix}

(3)

The drift function

b : A \times R^{n} \to R^{n}

and the volatility function

b : A \times R^{n} \to R^{d + n}

are measurable, and they satisfy a uniform Lipschitz condition in x, i.e., there exists

K > 0

such that, for all

x, y \in R^{n}

and

α_{s} \in A

, it holds that:

| b (x, α) - b (y, α) | + | σ (x, α) - σ (y, α | \leq K | x - y | .

Previous assumptions guarantee that the SDE (3) has a unique solution, denoted by

{(X_{s}^{t, x})}_{s \in [t, T]}

. Therefore, the objective function can be explicitly expressed in terms of x, t, namely:

J (α, x, t) = E [\int_{0}^{T} f (α_{s}, X_{s}^{t, x}) d s + g (X_{T}^{t, x})],

with

(t, x) \in [0, T] \times R^{n}

, and the process

α \in A

takes values

α_{s} \in A

. Let v be the value function:

v (t, x) = sup_{α \in A} J (t, x, α),

(4)

then the corresponding optimal control

\hat{α} : [0, T] \times R^{n} \to A

is defined by:

J (t, x, \hat{α}) = v (t, x), \forall (t, x), \in [0, T] \times R^{n},

whose solution can be found by exploiting two different and interconnected approaches respectively based on the Hamilton–Jacobi–Bellman (HJB) equation and on the stochastic Pontryagin Maximum Principle (PMP).

The first one moves from the Dynamic Programming Principle (DPP), then leading to the nonlinear second order Partial Differential Equation (PDE) known as the HJB equation (see [3]), namely:

\{\begin{matrix} \frac{\partial v}{\partial t} (t, x) + sup_{a \in A} [L^{α} v (t, x) + f (x, a)] = 0 \\ v (T, x) = g (x), \end{matrix}

(5)

that holds

\forall (x, t) \in [0, T] \times R^{n}

, where

L

defines a second order operator called the infinitesimal generator of the controlled diffusion:

L^{a} v = b (x, a) \cdot \nabla_{x} v (x) + \frac{1}{2} T r [(σ (x, a) σ {(x, a)}^{T} \nabla_{x}^{2} v (x)] .

It is possible to define the Hamiltonian for the SOC problem [15,16] as

H : R^{d} \times R^{d} \times S_{d} \to R \cup {\infty}

, written as follows:

H (x, y, z) = sup_{a \in A} (b (x, a) \cdot y + \frac{1}{2} T r [σ σ^{T} (x, a) z] + f (x, a)),

where

S_{d}

denotes the set of symmetric

d \times d

matrices and the variables y and z are called the adjoint variables. The previously defined Hamiltonian allows encapsulating the nonlinearity of the HJB equation, which can be rearranged as:

\partial_{t} v (t, x) + H (x, \nabla (t, x), \nabla^{2} (t, x)) = 0 .

On the other hand, the stochastic PMP leads to a system of coupled Forward-Backward SDEs (FBSDEs) plus an external optimality condition in terms of the Hamiltonian function; see, e.g., [2,17] and the references therein. We define the local Hamiltonian as:

H_{t} (x, u, p, q) = f_{t} (x, u) + b_{t} (x, u) p + σ_{t} (x) q,

(6)

where

(p, q)

are the adjoint variables.

By assuming that

f_{t} (x, u)

,

b_{t} (x, u)

and

σ (x, u)

are progressively measurable, bounded in

C^{1} (x, u)

and Lipschitz

\forall (x, u) \in R^{2}

, we have that if an arbitrary control maximizes the Hamiltonian, then (necessary condition for optimality) it is the optimal one:

H_{τ} ({\hat{x}}_{τ}, {\hat{ν}}_{τ}, {\hat{P}}_{τ}, {\hat{Q}}_{τ}) = max_{ν \in R} H_{t} ({\hat{x}}_{τ}, ν, {\hat{P}}_{τ}, {\hat{Q}}_{τ}) .

Moreover, by requiring that

x \mapsto g (x)

and

x \mapsto {\hat{H}}_{t} (x, {\hat{P}}_{τ}, {\hat{Q}}_{τ}) = {sup}_{ν} H_{τ} ({\hat{x}}_{τ}, ν, {\hat{P}}_{τ}, {\hat{Q}}_{τ})

are both concave functions, the PMP becomes also a sufficient condition to characterize the optimal control. In addition, by the envelope theorem, it follows that the optimal control

\hat{ν}

maximizing

H

also maximizes its derivative

\partial_{x} H

:

\partial_{x} H_{τ} ({\hat{x}}_{τ}, {\hat{ν}}_{τ}, {\hat{P}}_{τ}, {\hat{Q}}_{τ}) = \partial_{x} {\hat{H}}_{t} ({\hat{x}}_{τ}, {\hat{P}}_{τ}, {\hat{Q}}_{τ}) .

2.3. Mean-Field Games

In what follows, according to [1,3,5,18], we will exploit the theory of SOC to analyse Mean-Field Games (MFGs), to gain insights into the symmetric stochastic differential games characterized by a number of players that tends to infinity. In particular, consider n players

i = 1, \dots, n

each of which controls a private state

X^{i}

whose dynamics read as follows:

d X_{t}^{i} = \frac{1}{N} \sum_{j = 1}^{N} b (t, X_{t}^{i}, X_{t}^{j}, α_{t}^{i}) + σ d W_{t}^{i},

(7)

σ \in R

being a (common) constant multiplying a random noise component, i.e.,

d W_{t}^{i}

, steered by the i-th copy of a standard Brownian motion.

Each control

α_{t}^{i}

represents the strategy played by the ith player, while

X_{t}^{j}

models the influence of the state of the other j-players. The dynamics in Equation (7) can be reformulated as:

\frac{1}{N} \sum_{j = 1}^{N} b (t, X_{t}^{i}, X_{t}^{j}, α_{t}^{i}) = \int b (t, X_{t}^{i}, X_{t}^{j}, α_{t}^{i}) {\bar{μ}}_{t}^{N} d x,

where the empirical distribution

{\bar{μ}}_{t}^{N}

of private states is defined by:

{\bar{μ}}_{t}^{N} = \frac{1}{N} \sum_{j = 1}^{N} δ_{X_{t}^{j}} .

Hence, the same dynamics for each player are obtained, since each player depends only on the global behaviour of the entire system and not on the state of the single player:

d X_{t}^{i} = \frac{1}{N} \sum_{j = 1}^{N} b (t, X_{t}^{i}, {\bar{μ}}_{t}^{N}, α_{t}^{i}) + σ d W_{t}^{i} .

Each individual is minimizing a cost composed by a running cost and a terminal one. Therefore, the objective function

J^{i} (α)

,

\forall i

, is defined by:

J^{i} (α^{i}) = E [\int_{0}^{T} f (t, X_{t}^{i}, {\bar{μ}}_{t}^{N}, α_{t}^{i}) d t + g (X_{T}^{i}, {\bar{μ}}_{T}^{N})],

the goal being to select

ϵ

-Nash equilibria, the vector

(α^{1 ⋆}, \dots, α^{N ⋆})

being a

ε

-Nash equilibrium if:

\forall i = 1, \dots, N \land \forall α^{i} J (α^{⋆}) - ϵ \leq J (α^{⋆ - i}, α^{i}),

meaning that if the ith player is changing his/her behaviour for a small value

ϵ

, while other players follow a fixed strategy, there is no change in the profit of the game in terms of the cost J. Let us note that, if

ϵ = 0

, we are dealing with the standard Nash equilibria.

In this scenario, the solution of an MFG (see [3]) corresponds to a couple

(α^{⋆}, μ^{⋆})

, where

μ_{t}^{⋆} = L (X_{t}^{α^{⋆}})

models the optimal empirical distribution and

α^{⋆} = ϕ^{⋆} (t, X_{t}^{i})

the optimal strategies.

Due to the symmetry of the system, all the agents play the same strategy profile:

ϕ^{1 ⋆} (t, X_{t}^{1}) = \dots = ϕ^{N ⋆} (t, X_{t}^{N}) = ϕ^{⋆} (t), \forall i = 1, \dots, N;

hence, by fixing the flow of the probability measure

{(μ_{t})}_{t \in [0, T]}

, the solution

ϕ^{⋆}

solves the SOC problem parametrized by the choice of the family

(μ_{t})

as:

ϕ^{⋆} = \underset{ϕ {(t)}_{t \in [0, T]}}{argmin} E [\int_{0}^{T} f (t, X_{t}, μ_{t}, ϕ (t, X_{t})) d t + g (X_{T}, μ_{T})],

subject to:

d X_{t} = b (t, X_{t}, μ_{t}, ϕ (t, X_{t})) d t + σ d W_{t} .

(8)

The solution of the latter optimization problem returns the best response of a representative player to the empirical measure

(μ_{t})

in a scenario where no player is profitable in changing his/her own strategy.

Concerning the choice of the flow of the probability measure

{(μ_{t})}_{t \in [0, T]}

(see Equation (1)), if

N \to \infty

, the asymptotic independence implied by the law of large numbers ensures that the empirical measure

{\bar{μ}}_{t}^{N}

tends to the statistical distribution of

X_{t}

in the form of a fixed point equation, that is

L (X^{α^{μ}}) = μ

. Once the optimal feedback

ϕ^{⋆}

has been found, for each choice of the parameter

{(μ_{t})}_{0 \leq t \leq T}

, it is necessary to check that the optimal empirical measure

μ_{t}^{*}

can be recovered from the statistical distribution of the optimal states

X_{t}^{α^{*}}

, i.e.,

\forall t

,

L (X_{t}^{α^{⋆}}) = μ_{t}^{⋆}

must hold.

By freezing the family

μ

of probability, the Hamiltonian becomes:

H^{μ_{t}} (t, x, y, α) = y b (t, x, μ_{t}, α) + f (t, x, μ_{t}, α) .

(9)

Let us assume that there exists a regular function:

\hat{α} (t, x, y) : [0, T] \times R \times R \to \underset{α \in A}{argmin} H^{μ_{t}} (t, x, y, α),

then we denote the infimum over controls

α

of

H^{μ_{t}} (t, x, y, α)

by:

H^{μ_{t}} (t, x, y) : = inf_{α \in A} H^{μ_{t}} (t, x, y, α),

ending up with a stochastic control problem that can be solved (see, e.g., [3] (pp. 7–9)) by applying one of the following two methods:

The PDE approach through HJB Equation (21) and the Kolmogorov equation;
The Backward Stochastic Differential Equation (BSDE) approach based on the PMP.

We introduce the HJB value equation following [3], (p. 7), i.e.,:

v (t, x) = inf_{α \in A_{t}} E [\int_{t}^{T} f (s, X_{s}, μ_{s}, α_{s}) d s + g (X_{T}, μ_{T}) | X_{t} = x],

(10)

where

A_{t}

denotes the set of admissible controls over the interval

[t, T]

. It is expected that v is the solution, in the viscosity sense, of the HJB equation; see, e.g., [19,20] and the references therein. Hence:

\partial_{t} v + \frac{σ^{2}}{2} \partial_{x x}^{2} v + H^{μ_{t}} (t, x, \partial_{x} v (t, x)) = 0, (t, x) \in [0, T] \times R,

(11)

with terminal condition

v (T, x) = g (x, μ_{T}), x \in R

. Imposing

\bar{μ} = {(μ_{t})}_{0 \leq t \leq T}

as the flow of the optimally controlled state, the flow of statistical distributions should satisfy Kolmogorov’s equation, so that, introducing the notation

β (t, x) = b (t, x, μ_{t}, ϕ (t, x))

with

ϕ

the optimal feedback, the flow

{(ν_{t})}_{0 \leq t \leq T}

of measures given by

ν_{t} = L (X_{t})

satisfies the Kolmogorov’s equation:

\partial_{t} ν - \frac{σ^{2}}{2} \partial_{x x}^{2} ν - d i v (β (t, x) ν) = 0, (t, x) \in [0, T] \times R,

(12)

with initial condition

ν_{0} = μ_{0}

. Such a PDE holds in the sense of distributions, since

ν_{t}

represents a density and the derivatives involved must be considered in the Wasserstein sense. Setting

ν_{t} = L (X_{t}) = μ_{t}

, we end up with a system of coupled non-linear forward-backward PDEs (11) and (12), constituting the so-called MFG-PDE system. On the other hand, we can approximate the solution of the aforementioned stochastic control problem via the stochastic Pontryagin principle. In particular, for each open-loop adapted control

α = {(α_{t})}_{0 \leq t \leq T}

, we denote by

X^{α} = {(X_{t}^{α})}_{0 \leq t \leq T}

the associated state, and we introduce the adjoint equation:

\{\begin{matrix} d Y_{t} = - \partial_{x} H^{μ_{t}} (t, X_{t}, Y_{t}, α_{t}) d t + Z_{t} d W_{t}, t \in [0, T] \\ Y_{T} = \partial_{x} g (X_{T}, μ_{T}) . \end{matrix}

(13)

H^{μ_{t}}

corresponds to the Hamiltonian defined in Equation (9), and the solution

{(Y_{t}, Z_{t})}_{0 \leq t \leq T}

of the BSDE is called a set of adjoint processes.

The PMP necessary condition states that whenever

X^{α}

is an optimal state, it must hold that

H^{μ_{t}} (t, X_{t}, Y_{t}, α_{t}) = H^{μ_{t}} (t, X_{t}, Y_{t}), t \in [0, T]

. Moreover, if the Hamiltonian

H^{μ_{t}}

is convex w.r.t. the variables

(x, α)

and the terminal g is convex w.r.t. the variable x, then the system given by Equations (13) and (8) characterizes the optimal states of the MFG problem.

3. Main Result

In this section, we generalize the approach proposed in [5], providing an application of the Mean-Field (MF) optimal control to Deep Learning (DL). In particular, we start by considering the learning process characterizing supervised learning as a population risk minimization problem, hence considering its probabilistic nature in the sense that the optimal control parameters, corresponding to the trainable weights in the associated Neural Network (NN) model, depending on the population distribution of input-target pairs constituting the randomness source.

3.1. Neural Network as a Dynamical System

In order to study DL as an optimal control problem, it is necessary to express the NN learning process as a dynamical system [6,21]. In the simplest form, the feed-forward propagation in a T layer,

T \geq 1

, network can be expressed by the following difference equation:

x_{t + 1} = x_{t} + f (x_{t}, θ_{t}), t = 0, \dots, T - 1,

(14)

where

x_{0}

is the input, e.g., an image, several time-series, etc., while

x_{T}

is the final output, to be compared to some target

y_{T}

by means of a given loss function. By moving from a discrete time formulation to a continuous one, the forward dynamics we are interested in will be described by a differential equation that takes the role of (14). The learning aim is to tune the trainable parameters

θ_{0}, \dots, θ_{T - 1}

to have

x_{T}

as close as possible to

y_{T}

, according to a specified metric and knowing that the target

y_{T}

is joined to the input

x_{0}

by means of a probability measure

μ_{0}

.

Following the dynamical systems approach developed in [6], the supervised learning method aims to approximate some function

F

, usually called the oracle, denoted by

F : X \to Y

.

As stated before, the set

X \subset R^{d}

contains the d-dimensional array of inputs, e.g., images, financial time-series, sound recorded data, text, etc., while

Y

are the targets modelling the corresponding images, numerical forecast, or predicted texts.

In this setting, it is standard to define what is called a hypothesis space as:

H = {F_{θ} : X \to Y | θ \in Θ} .

Training moves from a collection of K samples of input-target pairs

{x^{i}, y^{i} = F (x^{i})}_{i = 1}^{K},

the goal being to approximate

F

exploiting these training data points.

Let

(Ω, F, P)

be a probability space supporting random variables

x_{0} \in R^{d}

and

y_{T} \in R^{l}

, jointly distributed according to

μ_{0} : = P_{(x_{0}, y_{T})}

with

μ_{0}

modelling the distribution of the input-target pairs. The set of controls

Θ \subseteq R^{m}

denotes the admissible training weights that are assumed to be essentially bounded, measurable

L^{\infty} ([0, T], Θ)

functions. The network depth, i.e., the number of layers, is denoted by

T > 0

. We also introduce the functions:

the feed-forward dynamics $f : R^{d} \times Θ \to R^{d}$ ;
the terminal loss function $Φ : R^{d} \times R^{l} \to R$ ;
the regularization term $L : R^{d} \times Θ \to R$ .

State dynamics are described by an Ordinary Differential Equation (ODE) of the form:

{\dot{x}}_{t} = f (x_{t}, θ_{t}),

(15)

representing the continuous version of Equation (14), equipped with an initial condition

x_{0}

, which is a random variable responsible for the randomness term characterizing Equation (15).

The population risk minimization problem in DL can be expressed by the following MF-optimal control problem (see [5] (p. 5)):

inf_{θ \in L^{\infty} ([0, T], Θ)} J (θ) : = E_{μ_{0}} [Φ (x_{T}, y_{T}) + \int_{0}^{T} L (x_{t}, θ_{t}) d t],

(16)

subject to the dynamics expressed by the stochastic ODE (15). Since the weights

θ

are shared by the distribution

μ_{0}

of random variable

(x_{0}, y_{T})

pairs, Equation (16) can be studied as an MF-optimal control problem.

On the other hand, the empirical risk minimization problem can be expressed by a sampled optimal control problem after computing i.i.d. samples

{x_{0}^{i}, y_{T}^{i}})_{i = 1}^{N}

modelled by

μ_{0} = P_{(x_{0}, y_{T})}

:

inf_{θ \in L^{\infty} ([0, T], Θ)} J_{N} (θ) : = \frac{1}{N} \sum_{i = 1}^{N} [Φ (x_{T}^{i}, y_{T}^{i}) + \int_{0}^{T} L (x_{t}^{i}, θ_{t}) d t],

(17)

subject to the dynamics:

{\dot{x}}_{t}^{i} = f (x_{t}^{i}, θ_{t}), i = 1, \dots, N,

whose solutions, moving from random initial conditions through a deterministic path, correspond to random variables.

As in classical optimal control theory, the previous problem can be solved following two inter-connected approaches: a global theory, based on the Dynamic Programming Principle (DPP) leading to the HJB equation, or considering the Pontryagin Maximum Principle (PMP) approach, hence expressing the solution by a system of Forward Backward SDEs (FBSDEs) plus a local optimality condition.

3.2. HJB Equation

The idea behind the HJB formalism is to define a value function corresponding to the optimal loss of the control problem w.r.t. the general starting time and state. For the population risk minimization formulation expressed by Equation (16), the state argument of the value function corresponds to an infinite-dimensional object that models a joint distribution of the input-target as an element of a suitable Wasserstein space.

As regards random variables and their distribution, a suitable space must be defined for the rigorous treatment of the optimal control problem. In particular, we use the shorthand notation

L^{2} (Ω, R^{d + l})

for

L^{2} ((Ω, F, P), R^{d + l})

to denote the set of

R^{d + l}

-valued square integrable random variables w.r.t. a given probability measure

P

. Then, we deal with a Hilbert space considering the norm:

{| | X | |}_{L^{2}} : = {(E (| | X | |^{2})}^{\frac{1}{2}}, X \in L^{2} (Ω, R^{d + l}),

The set

P_{2} (R^{d + l})

denotes the integrable probability measures defined on the Euclidean space

R^{d + l}

. Let us recall that the random variable X is square integrable in

L^{2} (Ω, R^{d + l})

if and only if its law

P_{X} \in P_{2} (R^{d + l})

. The space

P_{2} (R^{d + l})

can be endowed with a metric by considering the Wasserstein distance defined in Equation (2). For

p = 2

, the two-Wasserstein distance reads:

W_{2} (μ, ν) : = inf \{{(\int_{R^{d + l} \times R^{d + l}} | | w - {z | |}^{2} π (d w, d z))}^{\frac{1}{2}} | π \in P_{2} (R^{d + l} \times R^{d + l}) w i t h m a r g i n a l s μ a n d ν\},

according to the marginals introduced in Section 2.1, or equivalently:

W_{2} (μ, ν) : = inf \{| | X - {Y | |}_{L^{2}} | X, Y \in L^{2} (Ω, R^{d + l}) w i t h P_{X} = μ, P_{Y} = ν\},

see, e.g., [5] (p. 6). Moreover,

\forall μ \in P_{2} (R^{d + l})

, we define the associated norm:

{| | μ | |}_{L^{2}} = {(\int_{R^{d + l}} {| | w | |}^{2} μ (d w))}^{\frac{1}{2}} .

Given a measurable function

ψ : R^{d + l} \to R^{q}

that is square integrable w.r.t. the probability distribution

μ

, the following notation is introduced:

〈 ψ (\cdot), μ 〉 : = \int_{R^{d + l}} ψ (w) μ (d w) .

(18)

Concerning the dynamical evolution of probability measures, let us fix

ξ \in L^{2} (Ω, R^{d + l})

and the control process

θ \in L^{\infty} ([0, T], Θ)

. Then, the dynamics of the system can be written as:

W_{s}^{t, ξ, θ} = ξ + \int_{t}^{s} \bar{f} (W_{s}^{t, ξ, θ}, θ_{t}) d s, s \in [t, T],

μ

being the law associated with the variable

ψ

defined by

μ = P_{ξ} \in P_{2} (R^{d + l})

, and we can rewrite the law of

W_{s}^{t, ξ, θ}

as:

P_{s}^{t, μ, θ} : = P_{W_{s}^{t, ξ, θ}} .

(19)

Indeed, the law involving the dynamics

W_{s}^{t, ξ, θ}

depends only on the law of

ξ

and not on the random variable itself; see, e.g., [5] (p. 7).

It turns out that, to obtain the HJB Equation (5) corresponding to the above introduced formulation, it is necessary to define the concept of the derivative w.r.t. a probability measure. To begin with, it is useful to consider probability measures on

R^{d + l}

as laws expressing probabilistic features of the

R^{d + l}

-valued random variables defined over the probability space

(Ω, F, P)

. Then, we define the Banach space of random variables to define the derivatives. Moreover, if we define a function

u : P_{2} (R^{d + l}) \to R

, it is possible to lift it into its extension U defined on

L^{2} ([0, T], R^{d + l})

, as follows:

U (X) = u (P_{X}), \forall X \in L^{2} (Ω, R^{d + l}),

then the definition of the derivative w.r.t. a probability measure can be expressed in terms of U in the usual Banach space setting. In particular, we have that u is

C^{1} (P_{2} (R^{d + l}))

, if the lifted function U is Fréchet differentiable with continuous derivatives.

Since

L^{2} (Ω, R^{d + l})

can be identified with its dual, if the Fréchet derivative

D U (X)

exists, by Riesz’s theorem, it can be identified with an element of

L^{2} (Ω, R^{d + l})

, i.e.,

D U (X) (Y) = E [D U (X) \cdot Y], \forall Y \in L^{2} (Ω, R^{d + l}) .

It is worth underlining that

D U (X)

does not depend on X, but only on the law described by X; hence, the derivative of u at

μ = P_{X}

is described by

\partial_{μ} u (P_{X}) : R^{d + l} \to R^{d + l}

, defined as:

D U (X) = \partial_{μ} u (P_{X}) (X) .

By duality, we know that

\partial_{μ} u (P_{X})

is square integrable w.r.t.

μ

. To define a notion for the chain rule in

P_{2} (R^{d + l})

, a dynamical system is described by:

W_{t} = ξ + \int_{0}^{t} \bar{f} (W_{s}) d s, ξ \in L^{2} (Ω, R^{d + l}),

where

\bar{f}

denotes the feed-forward dynamics. If a function

u \in C^{1} (P_{2} (R^{d + l}))

, meaning that it is differentiable with a continuous derivative w.r.t. a probability measure, then, for all

t \in [0, T]

, we have:

u (P_{W_{t}}) = u (P_{W_{0}}) + \int_{0}^{t} 〈 \partial_{μ} u (P_{W_{s}}) (\cdot) \cdot \bar{f} (\cdot), P_{W_{s}} 〉 d s,

(20)

where · denotes the usual inner product between the vector in

R^{d + l}

. Equivalently, exploiting the lifted function of u, we can state:

U (W_{t}) = U (W_{0}) + \int_{0}^{t} E [D U (W_{s}) \cdot \bar{f} (W_{s})] d s .

Moreover, the variable w denotes the concatenated

(d + l)

-dimensional variable

(x, y)

, where

x \in R^{d}

and

y \in R^{l}

. Correspondingly,

\bar{f} (w, θ) : = (f (x, θ), 0)

is the extended

(d + l)

-dimensional Feed-Forward Function (FFF),

\bar{L} (w, θ) = L (x, θ)

is the extended

(d + l)

-dimensional regularization loss and

\bar{Φ} (w) : = Φ (x, y)

represents the terminal loss function.

Since the state variable is identified with a probability distribution

μ \in P_{2} (R^{d + l})

, the resulting objective functional can be defined as:

J (t, μ, θ) : = E_{(x_{t}, y_{T}) \sim μ} [Φ (x_{T}, y_{T}) + \int_{t}^{T} L (x_{t}, θ_{t}) d t],

which can be written, with the concatenated variable w and the bracket notation introduced in (18), as:

J (t, μ, θ) : = 〈 \bar{Φ} (w), P_{T}^{t, μ, θ} 〉 + \int_{t}^{T} 〈 \bar{L} (w, θ_{s}), P_{s}^{t, μ, θ} 〉 d s .

In this setting, some assumptions for the value function are needed to solve Equation (16). In particular:

f, L and $Φ$ are bounded;
f, L and $Φ$ are Lipschitz w.r.t. x, and the Lipschitz constant of f and L are independent of $θ$ ;
$μ_{0} \in P_{2} (R^{(d + l)})$ .

The value function

v^{*} (t, μ)

is defined as the real-valued function on

[0, T] \times P_{2} (R^{d + l})

, corresponding to the infimum of the functional J over the training parameters

θ

:

v^{*} (t, μ) = inf_{θ \in L^{\infty} ([0, T], Θ)} J (t, μ, θ) .

It is essential to observe how the value function satisfies a recursive relation based on the Dynamic Programming Principle (DPP). This implies that, for any optimal trajectory starting from any intermediate point, the remaining part of the trajectory still has to be optimal. The latter principle can be expressed by defining the value function as:

v^{*} (t, μ) = inf_{θ \in L^{\infty} ([0, T]), Θ} [\int_{t}^{\hat{t}} 〈 \bar{L} (\cdot, θ_{s}), P_{s}^{t, μ, θ} 〉 d s + v^{*} (\hat{t}, P_{\hat{t}}^{t, μ, θ})],

\forall 0 \leq t \leq \hat{t} \leq T

and

μ \in P_{2} (R^{d + l})

.

Considering a small increment of time

\hat{t} = t + δ t

with

δ > 0

, we can compute the Taylor expansion in the Wasserstein sense, hence obtaining:

0 = inf_{θ \in L^{\infty} ([0, T]), Θ} [v^{*} (t + δ t, P_{t + δ t}^{t, μ, θ}) - v^{*} (t, μ) + \int_{t}^{t + δ t} 〈 \bar{L} (w, θ_{s}), P_{s}^{t, μ, θ} 〉 d s] .

By the chain rule in

P_{2} (R^{d + l})

, we have:

0 \approx inf_{θ \in L^{\infty} ([0, T]), Θ} [\partial_{t} v (t, μ) δ t + \int_{t}^{t + δ t} 〈 \partial_{μ} v (t, μ) (w) \cdot \bar{f} (w, θ) + \bar{L} (w, θ_{s}), μ 〉 d s] .

Since the infinitesimal

δ t

does not affect the distribution

μ

and the controls

θ

(see [5] (p. 13)), integrating the second term, we have:

0 \approx δ t inf_{θ \in L^{\infty} ([0, T]), Θ} [\partial_{t} v (t, μ) + 〈 \partial_{μ} v (t, μ) (w) \cdot \bar{f} (w, θ) + \bar{L} (w, θ), μ 〉] .

Taking

δ t \to 0

, we have:

\{\begin{matrix} \frac{\partial v}{\partial t} + inf_{θ \in Θ} 〈\partial_{μ} v (t, μ) (w) \cdot \bar{f} (w, θ) + \bar{L} (w, θ), μ〉 = 0, o n [0, T) \times P_{2} (R^{d + l}) \\ v (T, μ) = 〈 \bar{Φ} (w), μ 〉, o n P_{2} (R^{d + l}) . \end{matrix}

(21)

Since the value function should solve the HJB equation, it is essential to find the precise link between the solution of this PDE and the value function obtained from the minimization of the functional J. To provide the result, we use a verification argument allowing the following consideration: if the solution of the HJB is smooth enough, then it corresponds to the value function

v^{*}

; moreover, it allows computing the optimal control

θ^{*}

.

Theorem 1

(The verification argument). Let v be a function in

C^{1, 1} ([0, T] \times P_{2} (R^{d + l}))

. If v is a solution of the HJB equation in (21) and there exists

\bar{θ}

that is mapping

(t, μ) \to Θ

attaining the infimum in the HJB equation, then

v (t, μ) = v^{*} (t, μ)

, and

\bar{θ}

is an optimal feedback control policy, i.e.,

θ = θ^{*}

is a solution of the population risk minimization problem expressed by Equation (16), where

θ_{t}^{*} : = \bar{θ} (t, P_{w_{t}^{⋆}})

with

P_{w_{0}^{⋆}} = μ_{0}

and

\frac{d w_{t}^{⋆}}{d t} = \bar{f} (w_{t}^{⋆}, θ_{t}^{⋆})

.

Proof of Theorem 1.

Given any control process

θ

, applying Formula (20) between

s = t

and

s = T

with explicit time dependence gives:

v (T, P_{T}^{t, μ, θ}) = v (t, μ) + \int_{t}^{T} \frac{\partial v}{\partial t} (s, P_{s}^{t, μ, θ}) + 〈 \partial_{μ} v (s, P_{s}^{t, μ, θ}) (\cdot) \cdot \bar{f} (\cdot, θ_{s}), P_{s}^{t, μ, θ}) 〉 d s,

Equivalently:

\begin{matrix} v (t, μ) & = v (T, P_{T}^{t, μ, θ}) - \int_{t}^{T} \frac{\partial v}{\partial t} (s, P_{s}^{t, μ, θ}) + 〈 \partial_{μ} v (s, P_{s}^{t, μ, θ}) (\cdot) \cdot \bar{f} (\cdot, θ_{s}), P_{s}^{t, μ, θ} 〉 d s \\ \leq v (T, P_{T}^{t, μ, θ}) + \int_{t}^{T} 〈 \bar{L} (\cdot, θ_{s}), P_{s}^{t, μ, θ} 〉 d s \\ = 〈 \bar{Φ} (\cdot), P_{t}^{T, μ, θ} + \int_{t}^{T} 〈 \bar{L} (\cdot, θ_{s}), P_{s}^{t, μ, θ} 〉 d s \\ = J (t, μ, θ), \end{matrix}

where the first inequality comes from the infimum condition in (21).

Since the control is arbitrary, we have:

v (t, μ) \leq v^{*} (t, μ),

then it can be substituted with

θ^{*}

where

θ_{t}^{*} = \bar{θ} (t, P_{s}^{t, μ, θ^{*}})

is computed by the optimal feedback control. Repeating the above argument, the inequality becomes an equality since the infimum is attained for

\bar{θ}

:

v (t, μ) = J (t, μ, θ^{*}) \geq v^{*} (t, μ) .

Thus,

v (t, μ) = v^{*} (t, μ)

, and

\bar{θ}

defines an optimal control policy. For more details, see [5] (Proposition 3, pp. 13–14). □

The importance of Theorem 1 consists of linking smooth solutions of the parabolic PDE to the solutions of the population risk minimization problem, becoming a natural candidate for the DL problem.

Moreover, the optimal control policy

\bar{θ} : [0, T] \times P_{2} (R^{d + l}) \to Θ

is identified by computing the infimum in (21). Hence, it turns out that the HJB equation strongly characterizes the learning problem’s solution for the feedback, or closed-loop networks: control weights are actively adjusted according to the outputs, and this is the essential feature of closed-loop control. Nevertheless, the solution comes from a PDE that is in general difficult to solve, even numerically. On the other hand, open-loop solutions can be obtained from the closed-loop control policy by sequentially setting

θ_{t}^{⋆} = \bar{θ} (t, P_{w_{t}^{⋆}})

,

w_{t}^{⋆}

being the solution of the feed-forward ODE describing the dynamics of the state variable, with

θ = θ^{⋆}

up to time t. Usually within DL settings, open-loops are used during training or to measure the inference of a trained model, since trained weights for each neuron will have a fixed value.

The great limit of such a formulation relies in assuming that the value function

v^{⋆} (t, μ)

is continuously differentiable. It is straightforward to study a more flexible characterization for

v^{⋆}

dealing with weak solutions, also denoted as viscosity solutions. Thus, it is worth considering a weaker formulation of the PDE to go beyond the concept of classical solutions, by introducing the notion of viscosity ones, hence allowing obtaining relevant results when dealing with weaker assumptions on the coefficients defining the (stochastic) differential problem we are interested in; see, e.g., [5] (Section 5 pp. 14–22) for more details.

The key idea relies on exploiting the lifting identification between measures and random variables and moving from the Wasserstein space

P_{2} (R^{d + l})

to the Hilbert space

L^{2} (Ω, R^{d + l})

, using tools developed to study viscosity solutions.

We introduce a functional defined as the Hamiltonian for viscosity formulation

H (ξ, P) : L^{2} (Ω, R^{d + l}) \times L^{2} (Ω, R^{d + l}) \to R

through:

H (ξ, P) = inf_{θ \in Θ} E [P \cdot \bar{f} (ξ, θ) + \bar{L} (ξ, θ)] .

(22)

Then, the lifted Bellman equation can be written w.r.t.

V (t, ξ) = v (t, P_{ξ})

as follows:

\{\begin{matrix} \frac{\partial V}{\partial t} + H (ξ, D V (t, ξ)) = 0 o n [0, T) \times L^{2} (Ω, R^{d + l}), \\ V (T, ξ) = E [\bar{Φ} (ξ)] o n L^{2} (Ω, R^{d + l}); \end{matrix}

(23)

hence, the PDE we are analysing is now set within a larger space corresponding to

L^{2} (Ω, R^{d + l})

.

We say that a bounded, uniformly continuous function

u : [0, T] \times P_{2} (R^{d + l}) \to R

is a viscosity solution of HJB Equation (21) if its lifted function

U : [0, T] \times L^{2} (Ω, R^{d + l}) \to R

defined by:

U (t, ξ) = u (t, P_{ξ}),

is a viscosity solution to the lifted Bellman Equation (23), namely:

$U (T, ξ) \leq E [\bar{Φ} (ξ)]$ and for any test function $ψ \in C^{1, 1} ([0, T] \times L^{2} (Ω, R^{d + l}))$ such that $(U - ψ)$ has a local maximum at $(t_{0}, ξ_{0}) \in [0, T) \times L^{2} (Ω, R^{d + l})$ , $ψ$ solves:

$\partial_{t} ψ (t_{0}, ξ_{0}) + H (ξ_{0}, D ψ (t_{0}, ξ_{0})) \geq 0 .$
$U (T, ξ) \geq E [\bar{Φ} (ξ)]$ and for any test function $ψ \in C^{1, 1} ([0, T] \times L^{2} (Ω, R^{d + l}))$ such that the map $(U - ψ)$ has a local minimum at $(t_{0}, ξ_{0}) \in [0, T) \times L^{2} (Ω, R^{d + l})$ , $ψ$ solves:

$\partial_{t} ψ (t_{0}, ξ_{0}) + H (ξ_{0}, D ψ (t_{0}, ξ_{0})) \leq 0 .$

Actually, the unique solution of this formulation corresponds to the value function

v^{*}

from the minimization problem; see, e.g., [5] (Theorem 1, p. 15). Therefore, the HJB equation provides both the necessary and sufficient condition for the optimality of the learning procedure.

Adopting the MF-optimal control viewpoint implies that the population risk minimization problem of DL can be studied as a variational problem, whose solution is characterized by a suitable HJB equation, in analogy with the classical calculus of variations. In other words, the HJB equation is a global characterization of the value function to be solved over the entire space

P_{2} (R^{d + l})

of input-target distributions. From the numerical point of view, it is a hard task to get a solution for the entire space; this is why the learning problem is typically locally solved, around some (small set of) trajectories generated according to the initial condition

μ_{0} \in P_{2} (R^{d + l})

, then applying the obtained feedback to nearby input-target distributions.

3.3. Mean-Field Pontryagin Maximum Principle

We have seen how the HJB approach provides a characterization of the optimal solution for the population risk minimization problem that holds globally in

P_{2} (R^{d + l})

, at the price of being difficult to handle in practice. Moving from this consideration, the MF-PMP aims to show a local condition for optimality, expressed in terms of

E [H]

, i.e., the expectation of the Hamiltonian function.

Starting from the population risk minimization problem defined in Equation (16) and given a collection of K sample input-target pairs, the single ith input sample is considered. The prediction of the network can be approximated by a deterministic transformation of the terminal state

g (X_{T}^{i})

for some

g : R^{d} \to Y

that models a function both of the initial input

x^{i}

and of the control parameters

θ

. Moreover, we define a loss function

Φ : Y \times Y \to R

, which is minimized when the arguments are equal. Therefore, the goal is to minimize:

\sum_{i = 1}^{K} Φ (g (X_{T}^{i}), y^{i}) .

Since g is fixed, it can be absorbed into the definition of the loss function by defining the array

Φ_{i} (\cdot) : = Φ (\cdot, y_{i})

.

Then, the supervised learning problem can be expressed as:

min_{θ \in U} [\sum_{i = 1}^{K} Φ_{i} (X_{T}^{i}) + \int_{0}^{T} L (θ_{t}) d t],

(24)

where

L : Θ \to R

acts as a regularizer term to model a running cost.

Input variables

x = (x^{1}, \dots, x^{K})

can be considered as the elements of a Euclidean space

R^{d \times K}

, representing the initial conditions of the following ODE system:

{\dot{X}}_{t}^{i} = f_{θ_{t}} (t, X_{t}^{i}), X_{0}^{i} = x^{i}, 0 \leq t \leq T, i = 1, \dots, K,

(25)

where

θ : [0, T] \to Θ

are the control parameters to be trained. The dynamics (25) are decoupled except for the control. A general space

U

for controls

θ

is then defined as:

U : = {θ : [0, T] \to Θ : θ i s L e b e s g u e m e a s u r a b l e}

, and we are aiming to choose

θ

in

U

to have

g (X_{T}^{i})

closer to

y^{i}

for

i = 1, \dots, K

.

To formulate the PMP as a set of necessary conditions for optimal solutions, it is useful to define the Hamiltonian

H : [0, T] \times R^{d} \times R^{d} \times Θ \to R

given by:

H (t, x, p, θ) : = p \cdot f (t, x, θ) - L (θ),

with p modelling an adjoint process as in Equation (6).

Let us underline that all input-target pairs

(x_{0}, y_{0})

, connected by the distribution

μ_{0}

, share a common control parameter, and this feature suggests the idea to develop a maximum condition that has to hold in the average sense. Indeed, the control is now enforced on the continuity equation that describes the evolution of probability densities.

The following assumptions are needed:

f is bounded and f, L are continuous w.r.t. $θ$ ;
f, L and $Φ$ are continuously differentiable w.r.t. x;
the distribution $μ_{0}$ has bounded support in $R^{d} \times R^{l}$ , which means there exists $M > 0$ such that $μ ({(x, y) \in R^{d} \times R^{l} : | | x | | + | | y | | \leq 1}) = 1$ .

Theorem 2 (Mean-Field Pontryagin Maximum Principle).

Let assumptions 1–3 hold and

θ^{*} \in L^{\infty} ([0, T], Θ)

be the minimizer for

J (θ)

corresponding to the optimal control of the population risk minimization problem (16). Define the Hamiltonian

H : R^{d} \times R^{d} \times Θ \to R

as

H (x, p, θ) = p \cdot f (x, θ) - L (x, θ) .

(26)

Then, there exist absolutely continuous stochastic processes

x^{*}

and

p^{*}

solving the following forward backward SDEs:

{\dot{x}}_{t}^{*} = f (x_{t}^{*}, θ_{t}^{*}), x_{t}^{*} = x_{0},

(27)

{\dot{p}}_{t}^{*} = - \nabla_{x} H (x_{t}^{*}, p_{t}^{*}, θ_{t}^{*}), p_{T}^{*} = - \nabla_{x} Φ (x_{T}^{*}, y_{0}),

(28)

and the related optimality condition expressed in terms of the expectation of the Hamiltonian function:

E_{μ_{0}} H (x_{t}^{*}, p_{t}^{*}, θ_{t}^{*}) \geq E_{μ_{0}} H (x_{t}^{*}, p_{t}^{*}, θ), \forall θ \in Θ, f o r a . e . t \in [0, T] .

(29)

Proof of Theorem 2.

For the sake of simplicity, let us introduce a new coordinate

x^{0}

satisfying the dynamics

{\dot{x}}_{t}^{0} = L (x_{t},^{*}, θ_{t}^{*})

with

x_{0}^{0} = 0

. Through this choice, the definition of the Hamiltonian in Equation (26) can be rewritten without running loss L by redefining:

x \to (x^{0}, x), f \to (L, f), Φ (x_{T}, y_{0}) \to Φ (x_{T}, y_{0}) + x_{T}^{0} .

Assumptions 1–3 are still preserved, but now we consider without loss of generality the case

L \equiv 0

.

Let some

τ \in (0, T]

be a Lebesgue point of

\hat{f} (t) : = f (x_{t}^{*}, θ_{t}^{*})

; in this setting, these points are dense in

[0, T]

. Now, for

ε \in (0, τ)

, define the family of perturbed controls:

θ_{t}^{τ, ε} = \{\begin{matrix} ω & t \in [τ - ε, τ] \\ θ_{t}^{*} & o t h e r w i s e \end{matrix},

where

ω \in Θ

is an admissible control; this kind of perturbation is called needle perturbation. Accordingly, define

x_{t}^{τ, ε}

by:

x_{t}^{τ, ε} = x_{0} + \int_{0}^{t} f (x_{s}^{τ, ε}, θ_{s}^{τ, ε}) d s,

that is the solution of the forward propagation equation with the perturbed control

θ^{τ, ε}

. Clearly,

x_{t}^{*} = x_{t}^{τ, ε}

for every

t \leq τ - ε

and every

x_{0}

since the perturbation is not present. At the limit point

t = τ

, the following holds:

\frac{1}{ε} (x_{τ}^{τ, ε} - x_{τ}^{*}) = \frac{1}{ε} \int_{τ - ε}^{τ} f (x_{s}^{τ, ε}, ω) - f (f_{s}^{*}), θ_{s}^{*}) d s,

and since

τ

is a Lebesgue point of F:

v_{τ} : = lim_{ε \to 0} \frac{1}{ε} (x_{τ}^{τ, ε} - x_{τ}^{*}) = f (x_{τ}^{*}, ω) - f (x_{τ}^{*}, θ_{τ}^{*}) .

It is possible to characterize

v_{τ}

as the leading order perturbation on the state due to the needle perturbation introduced in the infinitesimal interval

[τ - ε, τ]

. In interval

(τ, T]

, the dynamics is the same before applying the perturbation since the controls are the same.

Now, it is necessary to consider how the perturbation

v_{τ}

propagates. Thus, define for

t \geq τ

:

v_{t}^{ε} : = \frac{1}{ε} (x_{t}^{τ, ε} - x_{t}^{*}),

and:

v_{t} : = lim_{ε \to 0} v_{t}^{ε},

v_{t}

is well-defined for almost every t, which are every Lebesgue point of the map

t \mapsto x^{*} (t)

, and it satisfies the following linearised equation:

\{\begin{matrix} {\dot{v}}_{t} = \nabla_{x} f {(x_{t}^{*}, θ_{t}^{*})}^{T} v_{t}, t \in (τ, T] \\ v_{τ} = f (x_{τ}^{*}, ω) - f (x_{τ}^{*}, θ_{τ}^{*}) . \end{matrix}

(30)

In particular,

v (T)

represents the perturbation of the final state introduced by this control. By the optimality assumption of

θ^{*}

, it follows that:

E_{μ_{0}} Φ (x_{T}^{τ, ε}, y_{0}) \geq E_{μ_{0}} Φ (x_{T}^{*}, y_{0}) .

Assumptions 1 and 2 (p. 13) imply

\nabla_{x} Φ

is bounded. By the dominated convergence theorem, we know that:

0 \leq lim_{ε \to 0} \frac{1}{ε} E_{μ_{0}} [Φ (x_{T}^{τ, ε}, y_{0}) - Φ (x_{T}^{*}, y_{0})]

= E_{μ_{0}} \frac{d}{d ε} Φ (x_{T}^{ε, τ}, y_{0}) |_{ε = 0^{+}}

= E_{μ_{0}} \nabla_{x} Φ (x_{T}^{ε, τ}, y_{0}) \cdot v_{T} .

(31)

Let us define

p^{⋆}

as the solution of the adjoint of Equation (30), hence:

{\dot{p}}_{t}^{*} = - \nabla_{x} f (x_{s}^{*}, θ_{s}^{*}) p_{t}^{*}, p_{T}^{*} = - \nabla_{x} Φ (x_{T}^{*}, y_{0}) .

By Equation (31), it follows that

E_{μ_{0}} p_{T}^{*} \cdot v_{T} \leq 0

, and moreover, for all

t \in [τ, t]

:

\frac{d}{d t} (p_{t}^{*} \cdot v_{t}) = {\dot{p}}_{t}^{*} \cdot v_{t} + {\dot{v}}_{t} \cdot p_{t}^{*} = 0;

thus,

E_{μ_{0}} p_{τ}^{*} \cdot v_{t} = E_{μ_{0}} p_{T}^{*} \cdot v_{T} \leq 0, \forall t \in [τ, T],

so that taking

t = τ

:

E_{μ_{0}} p_{τ}^{*} \cdot f (x_{τ}^{*}, θ_{τ}^{*}) \geq E_{μ_{0}} p_{T}^{*} \cdot f (x_{τ}^{*}, ω) .

Since

ω

is arbitrarily chosen, this completes the proof by recalling that

H (x, p, θ) = p \cdot f (x, θ)

. See [5] (Theorem 3, pp. 23–24) for more details. □

MF-PMP refers only to the control of the open-loop type; Equation (27) is a feed-forward ODE, describing the state dynamics under optimal controls

θ^{*}

. Equation (28) defines the evolution of the co-state variable

p_{s}^{*}

, characterizing the evolution of an adjoint variational condition backwards in time. It is interesting to note how the optimality condition described in Equation (29) does not involve first order partial derivatives, being expressed in terms of expectations. In particular, it requires that optimal solutions must globally maximize the Hamiltonian function. This aspect allows considering also cases of non-differentiable dynamics w.r.t. the controls weights, as well as cases characterized by optimal weights lying on the boundary of the training set

Θ

. Moreover, the usual first order optimality condition can be derived from (29). Comparing this MF formulation to the classical PMP, we can see that the main difference lies in the fact that the maximization condition is expressed in terms of the expectation above a probability density

μ_{0}

. The latter result is not surprising, since the mean-field-optimal control must depend on the probability distribution of input-target pairs.

Let us also note that the mean-field PMP expressed in Theorem 2 can be written more compactly as follows. For each control process

θ \in L^{\infty} ([0, T], Θ)

, we denote by

x^{θ} : = {x_{t}^{θ} : 0 \leq t \leq T}

and

p^{θ} : = {p_{t}^{θ} : 0 \leq t \leq T}

the solutions of Hamilton’s Equations (27) and (28), and we enforce the control expressed by the random variables

(x_{0}, y_{T}) \sim μ_{0}

, through:

{\dot{x}}_{t}^{θ} = f (x_{t}^{θ}, θ_{t}), x_{0}^{θ} = x_{0},

{\dot{p}}_{t}^{θ} = - \nabla_{x} H (x^{θ}, p_{t}^{θ}, θ_{t}), p^{θ} = - \nabla_{x} Φ (x_{T}^{θ}, y_{t}) .

Then,

θ^{*}

satisfies the PMP if and only if:

E_{μ_{0}} H (x_{t}^{θ^{*}}, p_{t}^{θ^{*}}, θ_{t}^{*}) \geq E_{μ_{0}} H (x_{t}^{θ^{*}}, p_{t}^{θ^{*}}, θ), \forall θ \in Θ .

(32)

Furthermore, observe that the mean-field PMP includes, as a special case, the necessary conditions for the optimality of the sampled optimal control problem (17). In order to point out this aspect, define the empirical measure:

μ_{0}^{N} : = \frac{1}{N} \sum_{i = 1}^{N} δ_{(x_{0}^{i}, y_{T}^{i})},

and apply the mean-field PMP with

μ_{0}^{N}

in place of

μ_{0}

to obtain:

\frac{1}{N} \sum_{i = 1}^{N} H (x_{t}^{θ^{*}, i}, p_{t}^{θ^{*}, i}, θ_{t}^{*}) \geq \frac{1}{N} \sum_{i = 1}^{N} H (x_{t}^{θ^{*}, i}, p_{t}^{θ^{*}, i}, θ), \forall θ \in Θ,

(33)

where

x_{t}^{θ^{*}, i}

and

p_{t}^{θ^{*}, i}

are defined through the input-target pair

(x_{0}^{i} . y_{T}^{i})

. Moreover, since

μ_{0}^{N}

is a random measure, (33) is a random equation whose solutions correspond to random variables.

Concerning possible numerical analysis of the DL algorithm based on the maximum principle, we refer to [4] (pp. 13–15), where a comparison can be found between usual gradient approaches to the discrete formulation of the Mean-Field PMP stated in Theorem 2, with a loss function based on Equation (24). The test and train losses of some variants of SGD algorithms are compared to the mean-field algorithm based on the discrete PMP. It is possible to observe that the latter algorithm is characterized by a better convergence rate, being then faster. This improvement is mainly due to the fact that it allows avoiding possibly getting stuck, caused by the flat regions, as clearly shown by the graphs reported in [4] (p. 14).

3.4. Connection between the HJB Equation and the PMP

In what follows, we provide connections between the global and local formulation of the HJB formalism via the PMP, exploiting the connection between Hamilton’s canonical equations (ODEs) and the Hamilton–Jacobi equations (PDEs). The Hamiltonian dynamics of Equations (27) and (28) describe the trajectory of a random variable that is completely determined by random variables

(x_{0}, y_{T})

. On the other hand, the optimality condition described by Equation (29) does not depend on the particular probability measure of the initial input-target pairs. Notice that the maximum principle can be expressed in terms of a Hamiltonian flow that depends on a probability measure in a suitable Wasserstein space and where Equation (29) is the corresponding lifting version. Analogously, in order to have both solutions in the same functional space, HJB has to be lifted in the

L^{2} (Ω, R^{d + l})

space.

Starting from the lifted Bellman Equation (23), lying in

L^{2} (Ω, R^{d + l})

, it is possible to apply the method of the characteristics and define the following system of equations, after introducing

P_{t} = D V (t, ξ_{t})

:

\{\begin{matrix} {\dot{ξ}}_{t} = D_{P} H (ξ_{t}, P_{t}) \\ {\dot{P}}_{t} = - D_{ξ} H (ξ_{t}, P_{t}) . \end{matrix}

(34)

Suppose Equation (34) has a solution that satisfies an initial condition given by:

P_{ξ_{0}} = μ_{0},

and a terminal one involving Bellman equation given by:

P_{T} = \nabla_{w} \bar{Φ} (ξ_{T}) .

We also assume that

\bar{θ} (ξ, P)

is the optimal control achieving the infimum in (21), as an interior point of

Θ

, then we can explicitly write the equation of the Hamiltonian w.r.t. the optimal control as:

H = E [P \cdot \bar{f} (ξ, \bar{θ} (ξ, P)) + \bar{L} (ξ, \bar{θ} (ξ, P))] .

Therefore, by the first order condition, we have:

E [\nabla_{θ} \bar{f} (ξ, \bar{θ} (ξ, P)) P + \nabla_{θ} \bar{L} (ξ, \bar{θ} (ξ, P))] = 0,

so that, taking into consideration Equation (34), we obtain the Hamilton-type equations:

\{\begin{matrix} {\dot{ξ}}_{t} = \bar{f} (ξ_{t}, \bar{θ} (ξ_{t}, P_{t})) \\ {\dot{P}}_{t} = - \nabla_{w} \bar{f} (ξ_{t}, \bar{θ} (ξ_{t}, P_{t})) P_{t} - \nabla_{w} \bar{L} (ξ_{t}, \bar{θ} (ξ_{t}, P_{t})) . \end{matrix}

(35)

Use

w = (x, y)

as concatenated variable and

θ^{*} = \bar{θ} (ξ_{t}, P_{t})

to remark that the last l components of

\bar{f}

are zero and by considering only the first d components:

\{\begin{matrix} {\dot{x}}_{t} = f (ξ_{t}, θ^{*}) \\ {\dot{p}}_{t} = - \nabla_{x} f (x_{t}, θ^{*}) p_{t} - \nabla_{x} L (x_{t}, θ_{t}^{*}) . \end{matrix}

(36)

Summing up: Hamilton’s equation of the system (36) can be viewed as the characteristic equations of the HJB equation in its lifted formulation described by Equation (23). Essentially, the PMP gives a necessary condition for any characteristic of the HJB equation: any characteristic originating from

μ_{0}

, that is the initial law of the random variables, must satisfy a local necessary optimal condition constituted by the mean-field PMP. This justifies the claim that the PMP constitutes a local condition if compared to the HJB equation.

3.5. Small-Time Uniqueness

A natural question is to understand when the PMP solutions also provide sufficient conditions to have optimality. We start by considering that the uniqueness of the solution implies sufficiency, and we investigate which assumptions are needed to have a unique solution of the mean-field PMP equations. Equations (27) and (28) model highly non-linear two-point boundary value problems in terms of

x^{*}

and

p^{*}

, which are also coupled through their laws. In general, even without the coupling, this kind of PDE is known to not have a unique solution; see, e.g., Ch. 7 of [22]. In order to prove the uniqueness, strong assumptions are needed to prove the small-time case.

Theorem 3

(Small-time uniqueness). Suppose that:

f is bounded;
f, L and Φ are continuously differentiable w.r.t. both x and θ with bounded and Lipschitz partial derivatives;
the distribution $μ_{0}$ has bounded support in $R^{d} \times R^{l}$ , which means there exists $M > 0$ such that $μ {(x, y) \in R^{d} \times R^{l} : | | x | | + | | y | | \leq 1} = 1$ ;
$H (x, p, θ)$ is strongly concave in θ and uniformly in x, $p \in R^{d}$ , i.e., $\nabla_{x x}^{2} H (x, p, θ) + λ_{0} ⪯ 0$ for some $λ_{0} \geq 0$ .

Then, for sufficiently small T, if

θ^{* 1}

and

θ^{* 2}

are solutions of the mean-field PMP derived in Theorem 2, then

θ^{* 1} = θ^{* 2}

.

Before proving the theorem, we report the following lemma, which provides an estimate of the difference between flow-maps driven by two different controls in terms of the small-time parameter T:

Lemma 1.

Let

θ^{1}

,

θ^{2} \in L^{\infty} ([0, T], Θ)

. Then, there exists a constant

T_{0}

such that for all

T \in [0, T_{0})

, it holds that:

| | x^{θ^{1}} - x^{θ^{2}} {| |}_{L^{\infty}} + | | p^{θ^{1}} - p^{θ^{2}} {| |}_{L^{\infty}} \leq C (T) | | θ^{1} - θ^{2} {| |}_{L^{\infty}},

where

C (T) > 0

satisfies

C (T) \to 0

as

T \to 0

.

Proof of Lemma 1.

We denote

δ θ : = θ^{1} - θ^{2}

,

δ x : = x^{θ^{1}} - x^{θ^{2}}

and

δ p = p^{θ^{1}} - p^{θ^{2}}

, respectively. Since

x_{0}^{θ^{1}} = x_{2}^{θ^{1}} = x_{0}

, integrating the respective ODEs while considering the first two assumptions of Theorem 6 leads to:

| | δ x_{t} | | \leq \int_{0}^{t} | | f (x_{s}^{θ^{1}}, θ_{s}^{1}) - f (x_{s}^{θ^{1}}, θ_{s}^{2}) | | d s \leq K_{L} \int_{0}^{T} | | δ x_{s} | | d s + K_{L} \int_{0}^{T} | | δ p_{s} | | d s,

and so:

{| | δ x | |}_{L^{\infty}} \leq K_{L} {T | | δ x | |}_{\infty} + K_{L} {T | | δ θ | |}_{\infty} .

Now, if

T \leq T_{0} : = \frac{1}{K_{L}}

. We then have:

{| | δ x | |}_{L^{\infty}} \leq \frac{K_{L} T}{1 - K_{L} T} {| | δ θ | |}_{L^{\infty}} .

(37)

Similarly:

| | δ p_{t} | | \leq K_{L} | | δ x_{T} | | + K_{L} \int_{t}^{T} | | δ x_{s} | | d s + K_{L} \int_{t}^{T} | | δ p_{s} | | d s

| | δ p | | \leq (K_{L} + K_{L} T) {| | δ x | |}_{L^{\infty}} + K_{L} {T | | δ p | |}_{L^{\infty}},

and hence:

{| | δ p | |}_{L^{\infty}} \leq \frac{K_{L} (1 + T)}{1 - K_{L} T} {| | δ x | |}_{L^{\infty}},

which combined with Equation (37) proves the lemma. □

Through the above estimate, it is now possible to prove Theorem 3.

Proof of Theorem 3.

In what follows, we exploit [5] (p. 29). By uniform strong concavity, the function

θ \to E_{μ_{0}} H (x_{t}^{θ^{1}}, p_{t}^{θ^{1}}, θ)

is strongly concave. Thus, consider a

λ_{0} > 0

such that:

\frac{λ_{0}}{2} | | θ_{t}^{1} - θ_{t}^{2} {| |}^{2} \leq [E_{μ_{0}} \nabla H (x_{t}^{θ^{1}}, p_{t}^{θ^{1}}, θ_{t}^{2}) - E_{μ_{0}} \nabla H (x_{t}^{θ^{1}}, p_{t}^{θ^{1}}, θ_{t}^{1})] \cdot (θ_{t}^{1} - θ_{t}^{2}) .

A similar expression holds for

θ \to E_{μ_{0}} H (x_{t}^{θ^{2}}, p_{t}^{θ^{2}}, θ)

, and by combining the two inequalities, also exploiting the smoothness of functions involved, we have:

\begin{matrix} λ_{0} | | θ_{t}^{1} - θ_{t}^{2} {| |}^{2} & \leq [E_{μ_{0}} \nabla H (x_{t}^{θ^{1}}, p_{t}^{θ^{1}}, θ_{t}^{2}) - E_{μ_{0}} \nabla H (x_{t}^{θ^{1}}, p_{t}^{θ^{1}}, θ_{t}^{1})] \cdot (θ_{t}^{1} - θ_{t}^{2}) \\ + [E_{μ_{0}} \nabla H (x_{t}^{θ^{2}}, p_{t}^{θ^{2}}, θ_{t}^{1}) - E_{μ_{0}} \nabla H (x_{t}^{θ^{2}}, p_{t}^{θ^{2}}, θ_{t}^{2})] \cdot (θ_{t}^{1} - θ_{t}^{2}) \\ \leq E_{μ_{0}} | | \nabla H (x_{t}^{θ^{1}}, p_{t}^{θ^{1}}, θ_{t}^{1}) - \nabla H (x_{t}^{θ^{2}}, p_{t}^{θ^{2}}, θ_{t}^{1}) | | | | θ_{t}^{1} - θ_{t}^{2} | | \\ + E_{μ_{0}} | | \nabla H (x_{t}^{θ^{1}}, p_{t}^{θ^{1}}, θ_{t}^{2}) - \nabla H (x_{t}^{θ^{2}}, p_{t}^{θ^{2}}, θ_{t}^{2}) | | | | \leq θ_{t}^{1} - θ_{t}^{2} | | \\ K_{L} {| | δ θ | |}_{L^{\infty}} {(| | δ x | |}_{L^{\infty}} + {| | δ p | |}_{L^{\infty}}) . \end{matrix}

Combining the above inequality with Lemma 1, we have:

{| | δ θ | |}_{L^{\infty}}^{2} \leq \frac{K_{L}}{λ_{0}} {C (T) | | δ θ | |}_{L^{\infty}}^{2} .

However,

C (T) = o (1)

, and by taking T sufficiently small, so that

K_{L} C (T) < λ_{0}

, this implies that

{| | δ θ | |}_{L^{\infty}} = 0

. For more details, see [5] (Lemma 2, p. 28). □

Within the ML setting, the small-period T roughly corresponds to the regime where the reachable set of the forward dynamics is small. This can be interpreted as the case where the model has low capacity or low expressive power. In other words, Theorem 3 states that the optimal solution is unique if the network capacity is low, even when there is a huge number of parameters. It is essential to underline that the hypothesis of the strong concavity of the Hamiltonian H does not imply convexity for the loss function J, which can also be highly non-convex due to the non-linear transformation introduced by the activation functions

σ

.

4. Conclusions

In this paper, typical NN structures are considered as the evolution of a dynamical system to rigorously state the population risk minimization problem related to DL. Two parallel and connected perspectives are followed. The result expressed in Theorem 2 represents the generalization of the PMP in the calculus of variations and a local characterization of optimal trajectories derived from HJB Equation (21). Deriving the necessary condition for the optimality of the PMP provides several advantages: there is no reference to the derivative w.r.t. the probability measure in the Wasserstein sense; the parameter set

Θ

is highly general; maximization is point-wise in t, once

x^{⋆}

and

p^{⋆}

are known. Moreover, it allows the possibility to develop a learning algorithm without referring to the classical methods of DL such as the SGD. As an interest point with a general relevance within the optimal control theory, we remark that the controls, and then the weights

θ

, are only assumed to be measurable and essentially bounded in time. In particular, we allow for discontinuous terms. Even in the case where the number of parameters is infinite, as in the MFG setting, it is feasible to derive non-trivial stability estimates. This aspect results in contrast with generalized bounds based on classical statistical learning, where the increasing number of parameters adversely affects the generalization. On the other hand, it will be interesting to consider algorithms with weights having only discrete values, e.g., binary, values, requiring small memory amounts, with corresponding efficiency and speed improvements. A further direction of our ongoing research is based on the study of the PMP from a discrete-time perspective in order to relate the theoretical framework directly to the applications.

Author Contributions

Conceptualization, L.D.P. and M.G.; Investigation, L.D.P. and M.G.; Methodology, L.D.P. and M.G.; Writing—original draft, L.D.P. and M.G.; Writing—review and editing, L.D.P. and M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cardaliguet, P. Notes on Mean-Field Games; Technical Report, from P.-L. Lions’ lectures at Collège de France; Collège de France: Paris, France, 2012. [Google Scholar]
Touzi, N. Stochastic Control and Application to Finance; Chapter 1–4; Ecole Polytechnique: Paris, France, 2018. [Google Scholar]
Carmona, R.; Delarue, F.; Lachapelle, A. Control of Mc-Kean-Vlasov Dynamics Versus Mean-Field Games; Technical Report; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Li, Q.; Chen, L.; Tai, C.; E, W. Maximum principle based algorithms for deep learning. J. Mach. Learn. Res. 2018, 18, 1–29. [Google Scholar]
Weinan, E.; Han, J.; Li, Q. A Mean-Field Optimal Control Formulation of Deep Learning. arXiv 2018, arXiv:1807.01083. [Google Scholar]
Li, Q.; Lin, T.T.; Shen, Z. Deep Learning via Dynamical Systems: An Approximation Perspective. arXiv 2019, arXiv:1912.10382v1. [Google Scholar]
Athans, M.; Falb, P.L. Optimal Control: An Introduction to the Theory and Its Applications; Courier Corporation, Dover Publications, Inc.: Mineola, NY, USA, 2013. [Google Scholar]
Lacker, D. Mean-Field Games and Interacting Particle Systems. 2018. Available online: http://www.columbia.edu/~dl3133/MFGSpring2018.pdf (accessed on 17 November 2020).
Gangbo, W.; Kim, H.K.; Pacini, T. Differential forms on Wasserstein space and infinite-dimensional Hamiltonian systems. arXiv 2009, arXiv:0807.1065. [Google Scholar] [CrossRef]
Sagitov, S. Weak Convergence of Probability Measures; Chalmers University of Technology and Gothenburg University, 2015. Available online: http://www.math.chalmers.se/~serik/WeakConv/C-space.pdf (accessed on 17 November 2020).
Billingsley, P. Weak Convergence in Metric Spaces; Wiley Series in Probability and Statistics: Hoboken, NJ, USA, 1999. [Google Scholar]
Bergström, H. Weak Convergence of Measures. In A Volume in Probability and Mathematical Statistics: A Series of Monographs and Textbooks; Springer: New York, NY, USA, 1982. [Google Scholar]
Benazzoli, C.; Campi, L.; Di Persio, L. Mean field games with controlled jump–diffusion dynamics: Existence results and an illiquid interbank market model. Stoch. Process. Their Appl. 2020, 130, 6927–6964. [Google Scholar] [CrossRef]
Benazzoli, C.; Campi, L.; Di Persio, L. ϵ-Nash equilibrium in stochastic differential games with mean-field interaction and controlled jumps. Stat. Probab. Lett. 2019. [Google Scholar] [CrossRef]
Carrillo, J.A.; Pimentel, E.A.; Voskanyan, V.K. On a mean-field optimal control problem. Nonlinear Anal. Theory Methods Appl. 2020, 199, 112039. [Google Scholar] [CrossRef]
Liberzon, D. Calculus of Variations and Optimal Control Theory: A Concise Introduction; Princeton University Press: Princeton, NJ, USA, 2012. [Google Scholar]
Ma, J.; Yong, J. Forward-Backward Stochastic Differential Equations and Their Applications; Springer: New York, NY, USA, 2007. [Google Scholar]
Hadikhanloo, S. Learning in Mean-Field Games. Ph.D. Thesis, Dauphine Universitè Paris, Paris, France, 2018. [Google Scholar]
Fleming, W.H.; Soner, H.M. Controlled Markov Processes and Viscosity Solutions; Stochastic Modelling and Applied Probability; Springer: New York, NY, USA, 2006. [Google Scholar]
Frankowska, H. Hamilton-Jacobi equations: Viscosity solutions and generalized gradients. J. Math. Anal. Appl. 1989, 141, 1. [Google Scholar] [CrossRef] [Green Version]
Weinan, E. A Proposal on Machine Learning via Dynamical Systems. arXiv 2019, arXiv:1912.10382v1. [Google Scholar]
Kelley, W.G.; Peterson, A.C. The Theory of Differential Equations: Classical and Qualitative; Springer Science and Business Media: Berlin, Germany, 2010. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Persio, L.D.; Garbelli, M. Deep Learning and Mean-Field Games: A Stochastic Optimal Control Perspective. Symmetry 2021, 13, 14. https://doi.org/10.3390/sym13010014

AMA Style

Persio LD, Garbelli M. Deep Learning and Mean-Field Games: A Stochastic Optimal Control Perspective. Symmetry. 2021; 13(1):14. https://doi.org/10.3390/sym13010014

Chicago/Turabian Style

Persio, Luca Di, and Matteo Garbelli. 2021. "Deep Learning and Mean-Field Games: A Stochastic Optimal Control Perspective" Symmetry 13, no. 1: 14. https://doi.org/10.3390/sym13010014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning and Mean-Field Games: A Stochastic Optimal Control Perspective

Abstract

1. Introduction

2. Problem Formulation and Preliminaries

2.1. Wasserstein Metrics

2.2. Stochastic Optimal Control Problem

2.3. Mean-Field Games

3. Main Result

3.1. Neural Network as a Dynamical System

3.2. HJB Equation

3.3. Mean-Field Pontryagin Maximum Principle

3.4. Connection between the HJB Equation and the PMP

3.5. Small-Time Uniqueness

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI