Next Article in Journal
Supporting Human–Robot Interaction in Manufacturing with Augmented Reality and Effective Human–Computer Interaction: A Review and Framework
Previous Article in Journal
Form Deviation Uncertainty and Conformity Assessment on a Coordinate Measuring Machine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Finite-Time Turnpike Property in Machine Learning

by
Martin Gugat
Department Mathematik, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Cauerstr. 11, 91058 Erlangen, Germany
Machines 2024, 12(10), 705; https://doi.org/10.3390/machines12100705
Submission received: 14 August 2024 / Revised: 23 September 2024 / Accepted: 2 October 2024 / Published: 4 October 2024
(This article belongs to the Section Robotics, Mechatronics and Intelligent Machines)

Abstract

:
The finite-time turnpike property describes the situation in an optimal control problem where an optimal trajectory reaches the desired state before the end of the time interval and remains there. We consider a machine learning problem with a neural ordinary differential equation that can be seen as a homogenization of a deep ResNet. We show that with the appropriate scaling of the quadratic control cost and the non-smooth tracking term, the optimal control problem has the finite-time turnpike property; that is, the desired state is reached within the time interval and the optimal state remains there until the terminal time T. The time t 0 where the optimal trajectories reach the desired state can serve as an additional design parameter. Since ResNets can be viewed as discretizations of neural odes, the choice of t 0 corresponds to the choice of the number of layers; that is, the depth of the neural network. The choice of t 0 allows us to achieve a compromise between the depth of the network and the size of the optimal system parameters, which we hope will be useful to determine the optimal depths for neural network architectures in the future.

1. Introduction

We study a system that is governed by a neural ODE that can be considered a continuous-time ResNet. Before we can outline the system, some notation is necessary.
The activation function σ is assumed to be continuously differentiable and Lipschitz continuous with a Lipschitz constant that is less than or equal to 1, for example
σ ( z ) = tanh ( z ) ,
or σ ( z ) = 1 1 + exp ( z ) where for z R d the function σ acts component-wise; that is, σ ( z ) R d with the i-th component, e.g., ( σ ( z ) ) i = tanh ( z i ) , ( i { 1 , , d } ).
Let a real number T > 0 and natural numbers d and p in { 1 , 2 , 3 , } be given. For i { 1 , , p } and t [ 0 , T ] almost everywhere, let w i ( t ) R d and a i ( t ) R d be given. The w i ( t ) are the columns of the matrix W ( t ) R d × p and the a i ( t ) are the columns of the matrix A ( t ) R d × p . For t [ 0 , T ] , almost everywhere, let the bias vector b ( t ) in R p with the components b i ( t ) ( i { 1 , , p } ) be given. In order to state the required regularity assumptions, we introduce the space
X ( T ) = { measurable   functions ( W ( t ) , A ( t ) , b ( t ) ) defined   on ( 0 , T )
          such that 0 T W ( t ) 2 + A ( t ) 2 + b ( t ) 2 d t < } .
For parameters ( W , A , b ) X ( T ) , the system S is defined as follows:
S x ( 0 ) = x 0 R d , x ( t ) = i = 1 p σ ( a i ( t ) x ( t ) + b i ( t ) ) w i ( t )
(see for example [1,2]).
The motivation to study (1) is that a time-discrete version can be considered as a residual neural network (ResNet) that has been very useful in many applications; see [3] for identification problems with physics-informed neural ordinary differential equations, [4] for applications in image classification for the detection of colorectal cancer, and [5] for examples in image registration and classification problems. A time-discrete version can be obtained for example by an explicit Euler discretization of (1); see (28) in Section 4. The fact that ’ResNet, PolyNet, FractalNet and RevNet, can be interpreted as different numerical discretizations of differential equations’ has been discussed in detail in [6].
For the given time horizon T > 0 , we study an optimal control problem on the time interval [ 0 , T ] . The desired state is given by x T R d ; that is, x T denotes the desired output of the system. Let t 0 ( 0 , T ) be given. For the training of the system, we study the loss function with a tracking term
Q ( W , A , b ) = t 0 T | x ( t ) x T | + | x ( t ) | d t .
with the non-smooth norm | z | = i = 1 d | z i | . For our result, the inclusion of the derivative x in the definition of Q in (2) is essential, since due to this inclusion the loss function multiplied with the factor 1 T t 0 + 1 is an upper bound for the maximum norm of x x T on [ t 0 , T ] ; see the inequality (9) in Lemma 1. This allows us to prove the finite-time turnpike property in Theorem 1. We will explain the turnpike phenomenon below.
We define the control cost (regularization term)
R ( W , A , b ) = 0 T 1 2 W ( t ) 2 + 1 2 A ( t ) 2 + 1 2 b ( t ) 2 d t .
Here W ( t ) denotes the Frobenius norm of W ( t ) .
Lemma 10 in [7] states that system (1) is exactly controllable; that is, the terminal condition
x ( t 0 ) = z
can be satisfied for all t 0 > 0 . To be precise, for all t 0 > 0 there exists a constant C e > 0 , such that for all z R d we can find a control u e x a c t = ( W e x a c t , A e x a c t , b e x a c t ) , such that for the state x ˜ that is generated by (1) with the initial condition x ˜ ( 0 ) = x 0 , we have x ˜ ( t 0 ) = z and
u e x a c t L 2 ( 0 , t 0 ) C e z x 0 .
Also, the linearized system is exactly controllable in the sense that for all t 0 > 0 there exists a constant C e > 0 , such that for all z R d we can find a control δ u , such that for the state δ x that is generated by the linearized system that is stated below with the initial condition δ x ( 0 ) = 0 , we have δ x ( t 0 ) = z and
δ u L 2 ( 0 , t 0 ) C e z .
The linearized system at a given u = ( W , A , b ) for the variation δ x of the state that is generated by a variation δ u = ( δ W , δ A , δ b ) of the control is
δ x ( t ) = i = 1 p σ ( a i ( t ) x ( t ) + b i ( t ) ) δ w i ( t ) + i = 1 p σ ( a i ( t ) x ( t ) + b i ( t ) ) w i ( t ) x ( t ) δ a i ( t )
+ i = 1 p σ ( a i ( t ) x ( t ) + b i ( t ) ) w i ( t ) δ b i ( t ) + i = 1 p σ ( a i ( t ) x ( t ) + b i ( t ) ) w i ( t ) a i ( t ) δ x ( t )
with the initial condition δ x ( 0 ) = 0 .
A universal approximation theorem for the corresponding time-discrete case with recurrent neural networks can be found in the seminal paper [8] by Cybenko; see also [9,10,11,12].
For a parameter γ > 0 , define
J ( W , A , b ) = γ Q ( W , A , b ) + R ( W , A , b )
with Q as defined in (2) and R as defined in (3). We study the minimization (training) problem
P ( T , γ ) : min ( W , A , b ) X ( T ) J ( W , A , b )
Our main result is that if γ is chosen sufficiently large, the optimal control problem P ( T , γ ) has the finite-time turnpike property; that is, the desired state is already reached within the time interval [ 0 , T ] and remains there until the end of the time interval.
Figure 1 shows an example for the graph of the function t | x ( t ) | ( t [ 0 , T ] ), where x is an optimal trajectory that has the finite-time turnpike property. Our main result in Theorem 1 states that the optimal trajectories of P ( T , γ ) vanish on the interval [ t 0 , T ] .
The finite-time turnpike property has been studied for example in [13,14,15]. In the first two references, the finite-time turnpike property is achieved by the non-smoothness of the objective functional. In this paper, we use a similar approach adapted to the framework of neural ordinary differential equations.
The finite-time turnpike property is an extremal case of the celebrated turnpike property that has originally been studied in economics. The turnpike analysis investigates how the solutions of dynamic optimal control problems with a time evolution are related to the solutions of the corresponding static problems where the time derivatives are set to zero and the initial conditions are cancelled. It turns out that, often for large time horizons on large parts of the time interval, the solution of the dynamic problems is very close to the solution of the corresponding static problem. For an overview about the turnpike property, see [16,17,18,19] and the numerous references therein.
In the case of the finite-time turnpike property, after a finite time, the solution of the dynamic problem coincides with the solution of the static problem. The exponential turnpike property for ResNets and beyond has been studied for example in [20], but not the finite-time turnpike property.
Our approach yields an optimization problem with a non-smooth objective functional without terminal constraints. For the numerical solution of this type of problem, a relaxation approach can be used; see, for example, [21]. Due to the finite-time turnpike property, we obtain learning problems without terminal conditions (that are easier to solve than problems with terminal constraints), where the optimal trajectories still attain the desired terminal states exactly.
Since the objective functional contains a tracking term of L 1 -norm type, our problem is related to studies in compressed sensing, where this type of objective functional is used to enhance sparsity; see [22]. For a study about sparsity in Bregman machine learning, see [23].
In Section 3, we discuss the well-posedness of P ( T , γ ) . We present a result about the existence of solutions of P ( T , γ ) for a fixed matrix A. This implies the existence of a solution for the problem where the feasible set contains only constant matrices A that are independent of t; see Remark 2.
In Section 4, numerical examples are presented that illustrate that the finite-time turnpike property is also visible in the time-discrete case.

2. The Finite-Time Turnpike Property

The following Theorem contains our main result, which states that the control cost entails the finite-time turnpike property.
Theorem 1.
For each sufficiently large γ > 0 each optimal trajectory for P ( T , γ ) satisfies
x ( t ) = x T , t [ t 0 , T ] ,
that is P ( T , γ ) has the finite-time turnpike property. For t t 0 for the optimal parameters, we have W ( t ) = 0 , A ( t ) = 0 and b ( t ) = 0 . The optimal parameters remain unchanged if γ is further enlarged or if T is further enlarged.
Remark 1.
The condition in Theorem 1 that γ is sufficiently large is satisfied if inequality (19) holds, which is stated below. This requires a large size of γ if t 0 is chosen very close to T. In fact, as t 0 ( 0 , T ) approaches T, the lower bound for γ converges to infinity.
For the proof of Theorem 1, we need a result regarding the embedding of the continuous functions in the Sobolev space W 1 , 1 which are related to our learning problem due to the definition of Q in (2). Let
L 1 ( 0 , T ) = { f : [ 0 , T ] R , f is   measurable ,   i . e . , 0 T | f ( t ) | d t < } .
Consider the embedding of the space of continuous functions in the space
W 1 , 1 ( 0 , T ) = { f L 1 ( 0 , T ) : f L 1 ( 0 , T ) } .
Lemma 1.
Let t 0 [ 0 , T ) . For all x W 1 , 1 ( t 0 , T ) , we have the inequality
max t [ t 0 , T ] | x ( t ) | 1 T t 0 + 1 t 0 T x ( t ) + x ( t ) d t .
Proof of Lemma 1.
For t 1 , t 2 [ t 0 , T ] , we have
| x ( t 1 ) x ( t 2 ) | = t 1 t 2 x ( t ) d t t 1 t 2 | x ( t ) | d t .
Thus, x is continuous on [ t 0 , T ] . Hence, there exists a point t [ t 0 , T ] with
| x ( t ) | = min t [ t 0 , T ] | x ( t ) | 1 T t 0 t 0 T | x ( t ) | d t .
Thus, for all τ [ t 0 , T ] the following inequality holds:
| x ( τ ) | | x ( t ) | + | x ( t ) x ( τ ) | 1 T t 0 t 0 T | x ( t ) | d t + t 0 T | x ( t ) | d t 1 T t 0 + 1 t 0 T x ( t ) + x ( t ) d t .
Now, we are prepared to detail the proof of Theorem 1.
Proof of Theorem 1.
  • Case 1: If x 0 = x T , the parameters u = ( W , A , b ) = ( 0 , 0 , 0 ) generate the constant state x ( t ) = x T . Hence, u = 0 solves P ( T , γ ) and the assertion follows.
  • Case 2: Now, we assume that x 0 x T . For u = ( W , A , b ) X ( T ) , define the cost
    C ( 0 , t 0 ) ( u ) = 0 t 0 1 2 W ( t ) 2 + 1 2 A ( t ) 2 + 1 2 b ( t ) 2 d t .
Consider the non-smooth tracking term Q ( u ) as defined in (2). For u X ( T ) , define the objective functional
K T ( u ) = C ( 0 , t 0 ) ( u ) + γ Q ( u ) .
We consider the auxiliary problem
Q ( T ) : min u X ( T ) K T ( u ) .
We show that for each solution u of Q ( T ) , we have
Q ( u ) = 0
by an indirect proof. Suppose that there exists a solution u = ( W , A , b ) of Q ( T ) , such that Q ( u ) > 0 . Then, for the corresponding optimal state x that is generated by system (1), we have x ( t 0 ) x T ; otherwise, (that is, if x ( t 0 ) = x T ) we could proceed as follows: Switch off the control at t 0 and continue with the zero control ( 0 , 0 , 0 ) for the t ( t 0 , T ] that generates the constant state x T on ( t 0 , T ] . This approach would strictly improve the performance, since for the concatenated control u M ( t ) = u ( t ) for t [ 0 , t 0 ] , u M ( t ) = 0 for t ( t 0 , T ] , we have C ( t 0 , T ) ( u M ) = 0 , Q ( u M ) = 0 and thus, K T ( u M ) = C ( 0 , t 0 ) ( u ) < C ( 0 , t 0 ) ( u ) + γ Q ( u ) = K T ( u ) which is a contradiction, since u is an optimal control for Q ( T ) .
Define the auxiliary state
x ˜ ( t 0 ) = x T + 1 Q ( u ) x ( t 0 ) x T .
The exact controllability of the linearized system implies that we can find a control v ˜ L 2 ( 0 , t 0 ) that due to (6) and (12) satisfies the inequality
v ˜ L 2 ( 0 , t 0 ) C e x ˜ ( t 0 ) x T = C e 1 Q ( u ) x ( t 0 ) x T
that generates the state V ˜ with V ˜ ( 0 , · ) = 0 and V ˜ ( t 0 ) = x ˜ ( t 0 ) x T .
Due to (9), we have
x ( t 0 ) x T 1 T t 0 + 1 t 0 T x ( t ) x T + x ( t ) d t
= 1 T t 0 + 1 Q ( u ) .
Thus, (13) implies
v ˜ L 2 ( 0 , t 0 ) C e 1 T t 0 + 1 .
For a step-size ε ( 0 , Q ( u ) ) , define
λ = 1 ε Q ( u ) ( 0 , 1 ) .
Consider the control u that, for t ( 0 , t 0 ] , is defined by the equation
u ( t ) = u ( t ) ε v ˜ ( t )
and, for t ( t 0 , T ) , we define v ˜ = ( δ W , δ A , δ b ) with
δ W ( t ) = 1 Q ( u ) W ( t ) , δ A ( t ) = 1 Q ( u ) A ( t ) , δ b ( t ) = 1 Q ( u ) b ( t ) .
We will show that if γ > 0 is sufficiently large, v ˜ is a descent direction, in the sense that with a little step in the direction of v ˜ , we can improve the performance of the control u . This can be seen as follows.
For the state x = x + δ x that is generated with the solution δ x of the linearized system with the initial condition δ x ( 0 ) = 0 and the control ε v ˜ , we have, at t 0 ,
δ x ( t 0 ) = ε ( x ˜ ( t 0 ) x T ) = ε Q ( u ) ( x ( t 0 ) x T ) .
where the last step follows from (12).
On the following time interval [ t 0 , T ] , for the corresponding state x = x + δ x that is generated with the solution δ x of the linearized system starting with
δ x ( t 0 ) = ε Q ( u ) x ( t 0 ) x T
and the control ε v ˜ , we have
δ x ( t ) = ε Q ( u ) x ( t ) x T + O ( ε v ˜ 2 ) .
For t [ t 0 , T ] , due to the definition of λ in (15), this yields
x ( t ) x T = λ ( x ( t ) x T ) + O ( ε v ˜ 2 ) .
Thus, for the tracking term with u defined in (16) we have the bound
Q ( u ) = λ Q ( u ) + O ( ε 2 ) = 1 ε Q ( u ) Q ( u ) + O ( ε 2 ) .
For the control cost defined in (10), we have
C ( 0 , t 0 ) ( u ) = u ε v ˜ , u ε v ˜ L 2 ( 0 , t 0 ) = C ( 0 , t 0 ) ( u ) 2 ε u , v ˜ L 2 ( 0 , t 0 ) + ε 2 C ( 0 , t 0 ) ( v ˜ ) .
With K T as defined in (11), consider the function
p ( ε ) = K T ( u ε v ˜ )
= C ( 0 , t 0 ) ( u ) 2 ε u , v ˜ L 2 ( 0 , t 0 ) + ε 2 C ( 0 , t 0 ) ( v ˜ ) + γ 1 ε Q ( u ) Q ( u ) + O ( ε 2 ) .
Then, we have
p ( ε ) = 2 u , v ˜ L 2 ( 0 , t 0 ) + 2 ε C ( 0 , t 0 ) ( v ˜ ) γ + O ( ε ) .
This yields
p ( 0 ) = 2 u , v ˜ L 2 ( 0 , t 0 ) γ .
The exact controllability of (1) implies that there is a control u e x a c t L 2 ( 0 , t 0 ) with (due to (5))
u e x a c t L 2 ( 0 , t 0 ) C e x 0 x T
that generates the state V e x a c t with V e x a c t ( 0 , · ) = x 0 and V e x a c t ( t 0 , · ) = x T . For t > t 0 , let u e x a c t ( t ) = 0 . Since u e x a c t is feasible for Q ( T ) , this yields the inequality
C ( t 0 , T ) ( u ) K T ( u ) K T ( u e x a c t ) = u e x a c t L 2 ( 0 , t 0 ) 2 C e 2 x 0 x T L 2 ( 0 , L ) 2 .
Hence, we have
u , v ˜ L 2 ( 0 , t 0 ) C e x 0 x T L 2 ( 0 , L ) v ˜ L 2 ( 0 , t 0 ) x 0 x T L 2 ( 0 , L ) C e 2 1 T t 0 + 1 .
Thus, if
γ > 2 x 0 x T L 2 ( 0 , L ) C e 2 1 T t 0 + 1 ,
due to (17) and (18), we have p ( 0 ) γ + 2 x 0 x T L 2 ( 0 , L ) C e 2 1 T t 0 + 1 < 0 . This implies that for a ε > 0 sufficiently small, we have
K T ( u ε v ˜ ) < K T ( u ) ,
which is a contradiction to the optimality of u .
Hence, for any optimal control of Q ( T ) , we have
Q ( u ) = 0 .
With inequality (14), this implies that for each optimal state, we have
x ( t 0 ) = x T .
Now, we come back to the problem
P ( T , γ ) : min u J ( u )
with J defined in (7). Let v P ( T ) denote the optimal value of P ( T , γ ) and v Q ( T ) denote the optimal value of Q ( T ) . Since K T ( u ) J ( u ) , we have v Q ( T ) v P ( T ) .
Moreover, any optimal control u for Q ( T ) is feasible for P ( T , γ ) . Since (21) holds, i.e., x ( t 0 ) = x T , we have C ( t 0 , T ) ( u ) = 0 . Hence, v P ( T ) J ( u ) = K T ( u ) = v Q ( T ) , and thus v P ( T ) v Q ( T ) . Therefore, we have
v P ( T ) = v Q ( T ) .
Equation (22) implies that the parameters that are optimal for P ( T ) are also optimal for Q ( T ) . Hence, for each optimal state, we have (21) and (20), which implies (8) and the assertion follows. Thus, we have proved Theorem 1. □

3. The Existence of Solutions of P(T, γ) for Fixed A

For the sake of the completeness of the analysis, we also state an existence result. However, we can only prove the existence of a solution for the problem where the matrix A is fixed and not an optimization parameter for P ( T , γ ) . Thus, for a given matrix-valued function A ( t ) , we consider the problem
P ( T , γ , A ) : min ( · , A , · ) X ( T ) J ( · , A , · )
In order to show the existence of a solution of P ( T , γ , A ) , we assume that there exists a number M > 0 , such that for t [ 0 , T ] , almost everywhere, we have max i { 1 , , p } ( a i ) ( t ) M . This is the case if the a i are elements of the function space L ( 0 , T ) ; for example, if they are step functions over ( 0 , T ) .
Theorem 2.
Assume that sup x | σ ( x ) | 1 and the Lipschitz constant of σ is less than or equal to 1. Assume that A ( t ) is given, such that we have
ess sup i { 1 , , p } , s [ 0 , T ] ( a i ) ( s ) < .
Then, for each T > 0 and γ > 0 , problem P ( T , γ , A ) has a solution W , b , such that ( W , A , b ) X ( T ) .
If A ( t ) = 0 for t t 0 , for sufficiently large γ, each solution of P ( T , γ , A ) has the finite-time turnpike property stated in Theorem 1.
The proof of Theorem 2 uses Gronwall’s Lemma (see, for example, [24]). For the convenience of the reader, we state it here:
Lemma 2
(Gronwall’s Lemma). Let L > 0 , U 0 0 , ε 0 and an integrable function U on [ 0 , T ] be given.
Assume that for t [ 0 , T ] , almost everywhere, the integral inequality
0 U ( t ) U 0 + 0 t L U ( τ ) + ε d τ
hold. Then, for t [ 0 , T ] , almost everywhere, the function U satisfies the inequality
U ( t ) U 0 e L t + ε e L t 1 L .
Now, we present the proof of Theorem 2.
Proof of Theorem 2.
Consider a minimizing sequence ( u n ) n = 1 with u n = ( W n , A , b n ) X ( T ) for all n { 1 , 2 , 3 , } . Define the norm
u n X ( T ) = 0 T W n ( t ) 2 + A ( t ) 2 + b n ( t ) 2 d t
and the corresponding inner product that gives a Hilbert space structure to X ( T ) . Due to the definition of J, there exists a number M > 0 , such that for all n { 1 , 2 , } we have
u n X ( T ) M ,
that is the sequence is bounded in X ( T ) .
Hence, there exists a weakly converging subsequence with a limit
u = ( W , A , b ) X ( T ) .
Let x denote the state generated by u . For the states x n generated by the u n as a solution of S defined in (1) we can assume, by increasing M if necessary, that we have
sup s [ 0 , T ] , n { 1 , 2 , 3 , } x n ( s ) M .
Due to Mazur’s Lemma (see for example [25,26]), there exists a subsequence of convex combinations that converges strongly. To be precise, there exist convex combinations
v k = m = k N ( k ) λ m ( k ) u m , with   λ m ( k ) 0 , k m N ( k )   and   m = k N ( k ) λ m ( k ) = 1 ,
such that
lim k v k u X ( T ) = 0 .
This implies
lim k 0 T W n ( t ) W ( t ) + b n ( t ) b ( t ) d t = 0 .
Since σ is Lipschitz continuous with a Lipschitz constant that is less than or equal to 1, this implies for i { 1 , , p }
σ ( m = k N ( k ) λ m ( k ) [ ( a i ) ( t ) x m ( t ) + ( b i ) m ( t ) ) ] σ ( ( a i ) ( t ) x ( t ) + ( b i ) ( t ) )
m = k N ( k ) λ m ( k ) [ ( a i ) ( t ) x m ( t ) + ( b i ) m ( t ) ) ) ] ( ( a i ) ( t ) x ( t ) + ( b i ) ( t ) ) .
Thus, for t [ 0 , T ] , almost everywhere, we have
m = k N ( k ) λ m ( k ) x m ( t ) x ( t )
i = 1 p 0 t m = k N ( k ) λ m ( k ) ( w i ) m ( s ) ( w i ) ( s ) σ ( ( a i ) ( s ) x ( s ) + ( b i ) ( s ) )
+ m = k N ( k ) λ m ( k ) ( w i ) m ( s ) σ ( m = k N ( k ) λ m ( k ) ( a i ) ( s ) x m ( s ) + ( b i ) m ( s ) ) ) σ ( ( a i ) ( s ) x ( s ) + ( b i ) ( s ) ) d s .
Then, the fact that sup x | σ ( x ) | 1 , the Cauchy–Schwarz inequality, (25) and (26) yield
m = k N ( k ) λ m ( k ) x m ( t ) x ( t ) i = 1 p 0 t m = k N ( k ) λ m ( k ) ( w i ) m ( s ) ( w i ) ( s ) d s
+ 0 t m = k N ( k ) λ m ( k ) ( w i ) m ( s ) 2 d s 0 t σ ( m = k N ( k ) λ m ( k ) ( a i ) ( s ) x m ( s ) + ( b i ) m ( s ) ) ) σ ( ( a i ) ( s ) x ( s ) + ( b i ) ( s ) ) 2 d s
i = 1 p 0 t m = k N ( k ) λ m ( k ) ( w i ) m ( s ) ( w i ) ( s ) d s
+ M 0 t ( m = k N ( k ) λ m ( k ) ( a i ) ( s ) x m ( s ) + ( b i ) m ( s ) ) ( ( a i ) ( s ) x ( s ) + ( b i ) ( s ) ) 2 d s
i = 1 p 0 t m = k N ( k ) λ m ( k ) ( w i ) m ( s ) ( w i ) ( s ) d s
+ M 0 t ( a i ) ( s ) m = k N ( k ) λ m ( k ) x m ( s ) x ( s ) 2 d s + M 0 t m = k N ( k ) λ m ( k ) ( b i ) m ( s ) ( b i ) ( s ) 2 d s .
Due to Mazur’s Lemma, this yields the existence of a sequence ( ϵ k ) k with ϵ k 0 and lim k ϵ k = 0 , such that
m = k N ( k ) λ m ( k ) x m ( t ) x ( t ) ε k + i = 1 p M 0 t ess sup s ( 0 , T ) ( a i ) ( s ) 2 m = k N ( k ) λ m ( k ) x m ( t ) x ( t ) 2 d t .
Thus, by increasing the value of M if necessary, we obtain for t [ 0 , T ] , almost everywhere,
m = k N ( k ) λ m ( k ) x m ( t ) x ( t ) ε k + i = 1 p M 0 t M 2 m = k N ( k ) λ m ( k ) x m ( s ) x ( s ) 2 d s .
Since ( | u | + | v | ) 2 2 | u | 2 + 2 | v | 2 and
i = 1 p | z i | 2 p i = 1 p | z i | , this yields the integral inequality
m = k N ( k ) λ m ( k ) x m ( t ) x ( t ) 2 2 ( ε k ) 2 + 2 p M 4 i = 1 p 0 t m = k N ( k ) λ m ( k ) x m ( s ) x ( s ) 2 d s .
The integral inequality (27) has the form of (23) in Lemma 2. Hence, (24) from Gronwall’s Lemma yields for t [ 0 , T ] , almost everywhere,
m = k N ( k ) λ m ( k ) x m ( t ) x ( t ) = O ( ε k ) .
This yields
lim k max t [ 0 T ] m = k N ( k ) λ m ( k ) x m ( t ) x ( t ) = 0 .
For the time derivatives we obtain, again by increasing the value of M if necessary,
0 T m = k N ( k ) λ m ( k ) x m ( t ) x ( t ) d t
i = 1 p 0 T m = k N ( k ) λ m ( k ) ( w i ) m ( t ) ( w i ) ( t ) σ ( ( a i ) ( t ) x ( t ) + ( b i ) ( t ) )
+ m = k N ( k ) λ m ( k ) ( w i ) m ( t ) σ ( m = k N ( k ) λ m ( k ) ( a i ) ( t ) x m ( t ) + ( b i ) m ( t ) ) ) σ ( ( a i ) ( t ) x ( t ) + ( b i ) ( t ) ) d t
i = 1 p 0 T m = k N ( k ) λ m ( k ) ( w i ) m ( t ) ( w i ) ( t ) d t
+ 0 T m = k N ( k ) λ m ( k ) ( w i ) m ( t ) 2 d t 0 T m = k N ( k ) λ m ( k ) ( a i ) ( t ) x m ( t ) + ( b i ) m ( t ) ( a i ) ( t ) x ( t ) + ( b i ) ( t ) 2 d t
ε k + i = 1 p M 0 T m = k N ( k ) λ m ( k ) ( a i ) ( t ) x m ( t ) ( a i ) ( t ) x ( t ) 2 d t
+ M 0 T m = k N ( k ) λ m ( k ) ( b i ) m ( t ) ( b i ) ( t ) 2 d t
ε k ( 1 + M ) + M i = 1 p 0 T ( a i ) m ( t ) m = k N ( k ) λ m ( k ) x m ( t ) x ( t ) 2 d t
ε k ( 1 + M ) + M i = 1 p 0 T ( a i ) ( t ) 2 m = k N ( k ) λ m ( k ) x m ( t ) x 2 d t
ε k ( 1 + M )
+ M i = 1 p ess sup s [ 0 , T ] ( a i ) ( s ) 0 T m = k N ( k ) λ m ( k ) x m ( t ) x ( t ) 2 d t
ε k ( 1 + M ) + M i = 1 p ess sup s [ 0 , T ] ( a i ) ( s ) ε k = O ( ε k ) .
Thus, we have
lim   inf k   Q ( v k ) Q ( u ) and lim   inf k   R ( v k ) R ( u ) .
This yields
lim   inf k   J ( u k ) J ( u ) .
Hence, u is a solution of P ( T , γ , A ) . This shows that solutions of P ( T , γ , A ) exist.
The exact controllability properties that have been used for the construction in the proof of Theorem 1 still hold if the matrix A is fixed. Hence, also the assertion about the finite-time turnpike property follows. □
Remark 2.
The results in Theorem 1 can be adapted to the case where the feasible set only contains constant (that is, time-independent) matrices A and constant vectors b, since the exact controllability results that we have used in the proof still hold in this case. Since in this case A and b are in finite-dimensional spaces, Theorem 2 implies the existence of optimal parameters ( W , A , b ) . We consider a problem of this type in the numerical example that is presented in the next section.

4. Numerical Examples

To illustrate our findings, we present numerical experiments. Let a natural number k max and k 0 { 1 , , k max } be given. For k { 1 , 2 , , k max } , we consider the time-discrete system
S d i s x ( 0 ) = x 0 R d , x ( k ) = x ( k 1 ) + i = 1 p σ ( a i x ( k 1 ) + b i ) w i ( k )
Here, the a i R d and the real numbers b i are independent of the time step. Our results can be adapted to the case of a constant matrix A and a constant vector b, since the exact controllability results that we have used in the proofs still hold. Since in this case A and b are in finite-dimensional spaces, we also obtain the existence of optimal parameters in this case.
The matrices W ( k ) depend on the current time step. To define the objective functional for the time-discrete case, let
Q d i s ( W , A , b ) = k = k 0 k max | x ( k ) x T | + k = k 0 k max 1 | x ( k + 1 ) x ( k ) |
where x is the solution of (28),
R d i s ( W , A , b ) = 1 2 A 2 + 1 2 b 2 + k = 1 k max 1 2 W ( k ) 2
and
J d i s ( W , A , b ) = γ Q d i s ( W , A , b ) + R d i s ( W , A , b ) .
We consider the minimization problem
min W ( k ) R d × p ( k { 1 , , k max } ) , A R d × p , b R p J d i s ( W , A , b ) .
For the numerical example we have chosen d = p = 3 , k max = 10 , γ = 10 5 , x 0 = ( 1 , 1 , 1 ) and x T = 0 . For the training, we have used the global optimization toolbox in matlab.
Table 1 contains the evolution of the norms | x ( k ) | ( k { 1 , , k max } ) along the computed approximation of the optimal trajectories for different values of k 0 . It clearly illustrates the finite-time turnpike behavior that is predicted by Theorem 1. The zeros in Table 1 represent numerical values of the order less than 5 × 10 5 .
For γ = 10 4 and k 0 = 5 , we have obtained the numerical result presented in Table 2. Also, for this smaller value of γ , the turnpike structure is still visible. Here, we have given the size of the small norms (that are the numerical approximations of zero) in more detail.

5. Conclusions

We have shown that with a suitable non-smooth loss function, each solution of a learning problem has the finite-time turnpike property, which means that it reaches the desired state exactly after a finite time. Since the finite time t 0 can be considered as a problem parameter, this situation allows us to choose t 0 in a convenient way. Thus, t 0 arises as an additional design parameter in the design of optimal neural networks, which corresponds to the number of layers. Since for t [ t 0 , T ] the optimal parameters are zero, System (1) has a constant state on [ t 0 , T ] , and thus the time horizon can be cut off at t 0 .
Therefore, the problem of finding the optimal number of layers in a neural network corresponds in the setting of neural ODEs to the problem of time-optimal control (see, for example, [27]), which is defined as follows. Let a number ρ > 0 be given, which serves as problem parameter. The optimization problem involves
finding   a   minimal   value   of   t 0
subject to (1), the terminal constraint x ( t 0 ) = x T and for u = ( W , A , b ) X ( t 0 ) the inequality constraint C ( 0 , t 0 ) ( u ) ρ holds.
Here, C ( 0 , t 0 ) is as defined in (10). The solution of the time-optimal control problem is closely related to the solution of P ( T , γ ) . This can be seen as follows. Let ω ( T , γ ) denote the optimal value of P ( T , γ ) . Then, Theorem 1 implies that (if γ is sufficiently large) for optimal parameters u ( t ) that solve P ( T , γ ) , we have Q ( u ) = 0 and C ( 0 , t 0 ) ( u ) = ω ( T , γ ) , and for the optimal state we have x ( t 0 ) = x T . Hence, we conclude that the optimal parameters for P ( T , γ ) also solve the time-optimal control problem with parameter ρ = ω ( T , γ ) and the optimal time is t 0 .
This relation allows us to adapt the choice of t 0 to the desired value of C ( 0 , t 0 ) ( u ) . If t 0 is enlarged, this value can be decreased for the optimal parameters.
We have shown the existence of a solution of the nonlinear optimization problem for the case that one of the parameters, namely the matrix A ( t ) , is fixed. In order to show that a solution also exists with A ( t ) as an additional (time-dependent) optimization parameter, we expect that an additional regularization term in the objective functional (for example 0 T A ( t ) 2 d t ) is necessary. This is a topic for future research.
We expect that the finite-time turnpike property also holds in the case t 0 = 0 . However, the proof that is presented here does not apply to this case, so this is another topic for future research. As a possible application of our results, we have the numerical solution of shape inverse problems in mind, as described in [28]. Studying the finite-time turnpike phenomenon in a practical machine learning scenario will be of high value for future research.
Moreover, it would be interesting to combine the dynamics with an approach that allows for data assimilation, such as nudging induced neural networks (NINNs) that have been introduced recently in [29].

Funding

This research was funded by DFG in the framework of the Collaborative Research Centre CRC/Transregio 154, Mathematical Modelling, Simulation and Optimization Using the Example of Gas Networks, project C03 and project C05, Projektnummer 239904186 and the Bundesministerium für Bildung und Forschung (BMBF) and the Croatian Ministry of Science and Education under DAAD grant 57654073 ’Uncertain data in control of PDE systems’.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Marion, P. Generalization bounds for neural ordinary differential equations and deep residual networks. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates, Inc.: New York, NY, USA, 2024; Volume 36. [Google Scholar]
  2. Dupont, E.; Doucet, A.; Teh, Y.W. Augmented Neural ODEs. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
  3. Lai, Z.; Mylonas, C.; Nagarajaiah, S.; Chatzi, E. Structural identification with physics-informed neural ordinary differential equations. J. Sound Vib. 2021, 508, 116196. [Google Scholar] [CrossRef]
  4. Sarwinda, D.; Paradisa, R.H.; Bustamam, A.; Anggia, P. Deep learning in image classification using residual network (ResNet) variants for detection of colorectal cancer. Procedia Comput. Sci. 2021, 179, 423–431. [Google Scholar] [CrossRef]
  5. Thorpe, M.; van Gennip, Y. Deep limits of residual neural networks. Res. Math. Sci. 2023, 10, 6. [Google Scholar] [CrossRef]
  6. Lu, Y.; Zhong, A.; Li, Q.; Dong, B. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 3276–3285. [Google Scholar]
  7. Álvarez López, A.; Slimane, A.H.; Zuazua, E. Interplay between depth and width for interpolation in neural ODEs. Neural Netw. 2024, 180, 106640. [Google Scholar] [CrossRef] [PubMed]
  8. Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
  9. Pinkus, A. Approximation theory of the MLP model in neural networks. Acta Numer. 1999, 8, 143–195. [Google Scholar] [CrossRef]
  10. Schäfer, A.M.; Zimmermann, H.G. Recurrent neural networks are universal approximators. In Proceedings of the Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, 10–14 September 2006; Proceedings, Part I 16. Springer: Berlin/Heidelberg, Germany, 2006; pp. 632–640. [Google Scholar]
  11. Schäfer, A.M.; Udluft, S.; Zimmermann, H.G. Learning long term dependencies with recurrent neural networks. In Proceedings of the Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, 10–14 September 2006; Proceedings, Part I 16. Springer: Berlin/Heidelberg, Germany, 2006; pp. 71–80. [Google Scholar]
  12. Schaefer, A.M.; Udluft, S.; Zimmermann, H.G. Learning long-term dependencies with recurrent neural networks. Neurocomputing 2008, 71, 2481–2488. [Google Scholar] [CrossRef]
  13. Gugat, M.; Schuster, M.; Zuazua, E. The finite-time turnpike phenomenon for optimal control problems: Stabilization by non-smooth tracking terms. In Proceedings of the Stabilization of Distributed Parameter Systems: Design Methods And Applications; Springer: Cham, Switzerland, 2021; pp. 17–41. [Google Scholar] [CrossRef]
  14. Gugat, M.; Schuster, M. Optimal Neumann control of the wave equation with L 1-control cost: The finite-time turnpike property. Optimization 2024, 1–28. [Google Scholar] [CrossRef]
  15. Gugat, M. Optimal Boundary Control of the Wave Equation: The Finite-Time Turnpike Phenomenon. In Mathematical Reports; Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU): Erlangen, Germany, 2022. [Google Scholar]
  16. Zaslavski, A.J. Turnpike Phenomenon in Metric Spaces; Springer Nature: Berlin/Heidelberg, Germany, 2023; Volume 201. [Google Scholar]
  17. Grüne, L.; Faulwasser, T. Turnpike properties in optimal control: An overview of discrete-time and continuous-time results. In Handbook of Numerical Analysis; Trelat, E., Zuazua, E., Eds.; Elsevier: Amsterdam, The Netherlands, 2022. [Google Scholar] [CrossRef]
  18. Grüne, L.; Guglielmi, R. Turnpike properties and strict dissipativity for discrete time linear quadratic optimal control problems. SIAM J. Control Optim. 2018, 56, 1282–1302. [Google Scholar] [CrossRef]
  19. Trélat, E.; Zuazua, E. The turnpike property in finite-dimensional nonlinear optimal control. J. Differ. Equ. 2015, 258, 81–114. [Google Scholar] [CrossRef]
  20. Geshkovski, B.; Zuazua, E. Turnpike in optimal control of PDEs, ResNets, and beyond. Acta Numer. 2022, 31, 135–263. [Google Scholar] [CrossRef]
  21. Steffensen, S. Relaxation approaches for nonlinear sparse optimization problems. Optimization 2023, 73, 3237–3258. [Google Scholar] [CrossRef]
  22. Burger, M.; Föcke, J.; Nickel, L.; Jung, P.; Augustin, S. Reconstruction methods in thz single-pixel imaging. In Proceedings of the Compressed Sensing and Its Applications: Third International MATHEON Conference 2017, Berlin, Germany, 4–8 December 2017; Springer: Cham, Switzerland, 2019; pp. 263–290. [Google Scholar]
  23. Bungert, L.; Roith, T.; Tenbrinck, D.; Burger, M. A Bregman learning framework for sparse neural networks. J. Mach. Learn. Res. 2022, 23, 1–43. [Google Scholar]
  24. Gugat, M. Optimal Boundary Control and Boundary Stabilization of Hyperbolic Systems; Birkhäuser: Basel, Switzerland, 2015. [Google Scholar] [CrossRef]
  25. Ciarlet, P.G. Mathematical Elasticity: Three-Dimensional Elasticity; SIAM: Philadelphia, PA, USA, 2021. [Google Scholar]
  26. Heuser, H.G. Functional Analysis; Horvath, J., Translator; A Wiley-Interscience Publication; Chichester etc.; John Wiley & Sons: Hoboken, NJ, USA, 1982. [Google Scholar]
  27. LaSalle, J. Time optimal control systems. Proc. Natl. Acad. Sci. USA 1959, 45, 573–577. [Google Scholar] [CrossRef] [PubMed]
  28. Jackowska-Strumillo, L.; Sokolowski, J.; Żochowski, A.; Henrot, A. On numerical solution of shape inverse problems. Comput. Optim. Appl. 2002, 23, 231–255. [Google Scholar] [CrossRef]
  29. Antil, H.; Löhner, R.; Price, R. NINNs: Nudging induced neural networks. Phys. Nonlinear Phenom. 2024, 470, 134364. [Google Scholar] [CrossRef]
Figure 1. A typical graph of | x ( t ) | , where x is an optimal trajectory of P ( T , γ ) that satisfies the finite-time turnpike property. Here t 0 = 1 and T = 5 .
Figure 1. A typical graph of | x ( t ) | , where x is an optimal trajectory of P ( T , γ ) that satisfies the finite-time turnpike property. Here t 0 = 1 and T = 5 .
Machines 12 00705 g001
Table 1. The evolution of the norm of the state along the approximations of the optimal trajectories. The structure of the numerical solutions shows the finite-time turnpike structure that is indicated in Theorem 1. Here, γ = 10 5 .
Table 1. The evolution of the norm of the state along the approximations of the optimal trajectories. The structure of the numerical solutions shows the finite-time turnpike structure that is indicated in Theorem 1. Here, γ = 10 5 .
γ = 10 5 | x ( 1 ) | | x ( 2 ) | | x ( 3 ) | | x ( 3 ) | | x ( 5 ) | | x ( 6 ) | | x ( 7 ) | | x ( 8 ) | | x ( 9 ) | | x ( 10 ) |
k 0 = 1 3000000000
k 0 = 2 32.217800000000
k 0 = 3 33.23433.31240000000
k 0 = 4 32.33030.90890.7560000000
k 0 = 5 32.86382.69892.20980.039600000
k 0 = 6 31.85571.05191.27951.11560.73790.0002000
Table 2. For γ = 10 4 , the finite-time turnpike structure for k 0 = 5 is still visible.
Table 2. For γ = 10 4 , the finite-time turnpike structure for k 0 = 5 is still visible.
γ = 10 4 | x ( 1 ) | | x ( 2 ) | | x ( 3 ) | | x ( 3 ) | | x ( 5 ) | | x ( 6 ) | | x ( 7 ) | | x ( 8 ) | | x ( 9 ) | | x ( 10 ) |
k 0 = 5 33.26222.78612.08090.2694 3 × 10 8 2 × 10 8 3 × 10 8 2 × 10 8 2 × 10 8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gugat, M. The Finite-Time Turnpike Property in Machine Learning. Machines 2024, 12, 705. https://doi.org/10.3390/machines12100705

AMA Style

Gugat M. The Finite-Time Turnpike Property in Machine Learning. Machines. 2024; 12(10):705. https://doi.org/10.3390/machines12100705

Chicago/Turabian Style

Gugat, Martin. 2024. "The Finite-Time Turnpike Property in Machine Learning" Machines 12, no. 10: 705. https://doi.org/10.3390/machines12100705

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop