1. Introduction
The compatibility of penalized regularization with machine learning approaches allows for the successful treatment of various challenges in learning theory such as variable selection (see 
Tibshirani (
1996)) and dimension reduction (see 
Zou et al. (
2006)). The objective of many machine learning models used in mathematical finance is to predict asset prices by learning functions depending on stochastic inputs. In general, there is no guarantee that these stochastic factor models are consistent with no-arbitrage conditions. This paper introduces a novel penalized regularization approach to address this modelling difficulty in a manner consistent with financial theory. The incorporation of an arbitrage-penalty term allows various machine learning methods to be directly and coherently integrated into mathematical finance applications. We focus on regression-type model selection tasks in this article. However, the arbitrage-penalty can also be applied to other types of machine learning algorithms with financial applications.
To motivate our approach we first consider informally, similar to (
Björk 2009, Chapter 10), the following simple situation that will later be made more precise. Let 
 be a filtered probability space satisfying the usual conditions. Let 
 be an 
-adapted real-valued stochastic process with continuous paths representing the price of a financial asset. Let 
r be the constant risk-free interest rate and assume a fixed time interval 
. Existence of a martingale measure 
 equivalent to the underlying real-world measure 
 implies absence of arbitrage. The price at time 
 of a derivative security with integrable payoff 
 at time 
T is given by the risk-neutral pricing formula
      
With 
 we may express Equation (
1) under the real-world probability measure 
 as
      
Equivalently, the price given by Equation (
2) can be expressed under 
 by defining the state-price density process 
. If 
 is the minimal martingale measure of 
Schweizer (
1995) then the transformation
      
      can be interpreted as finding the process closest to 
 which is a (local) martingale under 
. The purpose of this paper is to find an analogue of the transformation (
3) in this setting when 
 is described by a stochastic factor model, as is the case with most machine learning approaches to mathematical finance. For example, 
 may be described by a deep neural network with stochastic inputs.
The above-ignored well-known results regarding the uniqueness of 
 (see 
Schweizer (
1995)) and other important generalizations of the martingale approach to arbitrage theory. In particular, the more general setting for the fundamental theorem of asset pricing of 
Delbaen and Schachermayer (
1998) implies that if “arbitrage”, in the sense of no-free lunch with vanishing risk, exists then the transformation (
3) is undefined. However, many machine learning approaches to mathematical finance may admit arbitrage so it is necessary to consider the general case. The arbitrage-regularization framework introduced in this paper integrates machine learning methodologies with the general martingale approach to arbitrage theory.
We consider a general framework for learning the arbitrage-free factor model that is most similar to a factor model within a prespecified class of alternative factor models. This search is optimized by minimizing a loss-function measuring the distance of the alternative model to the original factor model with the additional constraint that the market described by the alternative model is a local martingale under a reference probability measure.
The main theoretical results rely on asymptotics for the arbitrage-regularization penalty for selecting the optimal arbitrage-free model from a class of stochastic factor models. Relaxation of the asymptotic results necessary for practical implementation are presented. Throughout this paper, the bond market will serve as the primary example of our methods since no-arbitrage conditions for factor models are well understood, see 
Filipović (
2001) and the references therein. Numerical results applying the arbitrage-regularization methodology are implemented using real data.
The remainder of this paper is organized as follows. 
Section 2 states the arbitrage-regularization problem and presents an overview of relevant background on bond markets. 
Section 3 develops the arbitrage-penalty and establishes the main asymptotic optimality results. Non-asymptotic relaxations of these results are also considered and linked with transaction costs. 
Section 4 specializes the general results to bond markets and where a simplified expression for the arbitrage-penalty is obtained. Numerical implementations of the results are considered and the arbitrage-regularization methodology is used to generate new machine learning-based models consistent with no free lunch with vanishing risk (NFLVR) and the results are compared with classical term-structure models as benchmarks. 
Section 5 concludes and 
Appendix A contains supplementary material primarily required for the proofs, such as functional Itô calculus and 
-convergence results. Proofs of the main theorems of the paper are included in 
Appendix B.
  2. The Arbitrage-Regularization Problem
For the remainder of the paper all stochastic processes described in this paper are defined on a common stochastic basis 
. Let 
 be a probability measure equivalent to the reference probability measure 
 and 
 denotes the risk-free rate in effect at time 
. Assume that there exists an asset whose price process, denoted by 
, is a strictly positive 
-martingale which serves as numéraire. Unless otherwise specified all the processes in this paper will be described under the martingale-measure for 
, denoted 
 is defined by
      
The choice of the numéraire can be used to encode or remove any trend from the price processes being modelled. Price processes which are local martingales under 
 or 
 are usually only semi-martingales under the objective measure 
. Further details on numéraires can be found in 
Shreve (
2004).
We consider a large financial market 
, indexed by a non-empty Borel subset 
, were 
D is a positive integer. For example, 
 may be used to represent a bond market where, using the parameterization of 
Musiela and Rutkowski (
1997), 
 represents the collection of all possible maturities and 
 represents the time 
t price of a zero-coupon bond with maturity 
u.
For each 
, the process 
 will be driven by a latent, possibly infinite-dimensional, factor process. In the case of the bond market, this latent process will be the forward-rate curve. Write
      
      where 
, 
 is a family of path-dependent functionals encoding the latent process into the asset price 
, 
 is the factor model for the latent process, and 
 are the 
-valued stochastic factors driving the latent process. Following 
Fournie (
2010), 
 will be allowed to depend on the local quadratic-variation of the factor process 
, denoted by 
 and defined by
      
      where 
 denotes the usual quadratic-variation of the factor process 
. It is instructive to note that the local quadratic-variation 
 is well-defined due to Assumption 1, imposed later.
In the case of the bond market, 
 will be the map taking a forward-rate curve, such as 
, to the time 
t price of a zero-coupon bond with maturity 
u, as defined by
      
It will often be convenient to use the reparameterization of 
Musiela and Rutkowski (
1997) and rewrite (
6) as
      
      where 
, for 
, and represents the time to maturity of the bond.
In general, 
 will be allowed to depend on the path of 
. Thus, 
 will be a path-dependent functional of regularity 
 in the sense of 
Fournie (
2010) as discussed in 
Appendix A.2. However, as in the bond market, if 
 depends only on the current value of 
 then the requirement that 
 be of class 
, in the sense of 
Fournie (
2010), is equivalent to it being of regularity 
 in the classical sense; where 
. Therefore, the classical Itô-calculus would apply to 
.
Analogously to 
Björk and Christensen (
1999), the factor model 
 for the latent process will always be assumed to be suitably integrable and suitably differentiable. Specifically, 
 will belong to a Banach subspace 
 of 
 which can be continuously embedded within the Fréchet space 
; where 
 is a Borel probability measure supported on 
I, 
 is a Borel probability measure supported on 
, and both 
 and 
 are equivalent to the corresponding Lebesgue measures restricted to their supports. Here, 
 is kept fixed.
An example from the bond modelling literature is the Nelson-Siegel model (see 
Nelson and Siegel (
1987) and 
Diebold and Rudebusch (
2013)), which expresses the forward-rate curve as a function of its level, slope, and curvature through the factor model. The Nelson-Siegel family is part of a larger class of affine term-structure models, in which, at any given time, the forward-rate curve is described in terms of a set of market factors as
      
      where 
d is a positive integer and 
 and 
 is a forward-rate curve typically calibrated to the data available at time 
. Note that the forward-rate curves in (
7) are parameterized according to the change of variables in (6), however, since 
 represents all times to maturities these are indeed traded assets. However, as shown in 
Filipović (
2001), the Nelson-Siegel model is typically not arbitrage-free therefore we would like to learn the closest arbitrage-free factor model, driven by the same stochastic factors. Therefore, given a non-empty and unbounded hypothesis class 
 of plausible alternative models, we optimize
      
      where 
 is required to contain the (naive) factor model 
 and 
 is continuous and coercive loss function. For example, 
ℓ may be taken to be the norm on 
. Geometrically, (
8) describes a projection of 
 onto the (possibly non-convex) subset of 
 of factor models making each 
 into a 
-local martingale for every 
. The requirement that 
 contains the (naive) factor model 
 is for consistency, in order to ensure that for any arbitrage-free factor model 
 the solution to problem (
8) is itself.
In general, the problem described by (
8) may be challenging to implement as projections onto non-convex sets are intractable. In analogy with regularization literature, such as 
Hastie et al. (
2015), instead we consider the following relaxation of problem (
8) which is more amenable to numerical implementation
      
      where 
 is a family of functions from 
 to 
 taking value 0 if each
 is a 
-local martingale simultaneously for every value of 
u and 
 is a meta-parameter determining the amount of emphasis placed on the penalizing factor models which fail to meet this requirement. Problem (
9) is called the 
arbitrage-regularization problem.
At the moment, there are only two available lines of research which are comparable to the arbitrage-regularization problem. Results of the first kind, such as the arbitrage-free Nelson-Siegel model of 
Christensen et al. (
2011a), provide closed-form case-by-case arbitrage-free variants of specific model only if they coincide with specific arbitrage-free HJM type factor models, such as those studied by 
Björk and Christensen (
1999). However, the reliance on analytic methods typically limit this type of approach to simple or specific models and does not allow for a general or computationally viable solution to the problem. Moreover, arbitrage-free corrections derived in this way are not guaranteed to be optimal in the sense of (
8), or approximately optimal in the sense of (
9). This will be examined further in the numerics section of this paper.
The use of a penalty to capture no-arbitrage conditions has, to the best of the authors’ knowledge, thus far only been explored numerically by 
Chen et al. (
2019) within the discrete-time portfolio optimization setting. A similar problem has been treated in 
Chen et al. (
2006) for learning the equivalent martingale measure in the multinomial tree setting for stock prices. Our paper provides the first instance of a theoretical result in this direction as well as such a framework that applies to large-financial markets such as bond markets or which applies in the continuous-time setting.
Before presenting the main results we first state necessary assumptions.
Assumption 1. The following assumptions will be maintained throughout this paper.
- (i) 
  is an -valued diffusion process which is the unique strong solution towhere ,  is an -valued Brownian motion, the components  are continuous, the components  are measurable and such that the diffusion matrix is a continuous function of β for any fixed .
- (ii) 
 The stochastic differential equation (10) has a unique -valued solution for each . - (iii) 
 For every ,  is a non-anticipative functional in  verifying the following “predictable-dependence” condition of Fournie (2010): for all  and all , where  is the set of -dimensional positive semi-definite matrices with real-coefficients,
 The central problem of the paper will be addressed in full generality before turning to applications in term-structure models, in the next section.
  3. Main Results
In this section, we show the asymptotic equivalence of problems (
8) and (
9) for general asset classes. This requires the construction of the penalty term 
 measuring how far a given factor model is from being a 
-local martingale. The construction of 
 is made in two steps. First, a drift condition which guarantees that each 
 is simultaneously a 
-local martingale is obtained. This condition generalizes the drift condition of 
Heath et al. (
1992) and provides an analogue to the consistency condition of 
Filipović and Teichmann (
2004). Second, the drift condition is used to produce the penalty term in (
9). Subsequently, the optimizers of (
9) will be used to asymptotically solve problem (
8).
Proposition 1 (Drift Condition). 
The processes  are -local-martingales, for each  simultaneously, if and only ifis satisfied -a.s. for every  and every , where  and ∇ respectively denote the horizontal and vertical derivative of Fournie (2010) (see Appendix A.2 Equation (A3)). The drift condition in Proposition 1 implies that if 
 is such that the difference of the left and right-hand sides of (
11) is equal to 0, 
-a.s. for all 
 then 
 is a 
-local martingale simultaneously for all 
. Thus, 
 is simultaneously a 
-local martingale for all 
 if for every 
 the 
-valued process 
 is equal to 0 
-a.s, where 
 is defined using (
11) by
      
      where 
. The arbitrage-penalty is defined as follows.
Definition 1 (Arbitrage-Penalty). 
Let  be a family of -adapted -valued stochastic processes for whichholds for all , , , and -almost every . Then, for every , the family  of functionsis said to define an arbitrage-penalty. Remark 1. Whenever  fails to be integrable, we make the convention that .
 The convergence of the optimizers of (
9) to the optimizers of (
8) is demonstrated in the next theorem. The proof relies on the theory of 
-convergence, which is useful for interchanging the limit and an arginf operations.
Assumption 2. Assume that
- (i) 
 For every  and -a.e. , the function  is continuous on ,
- (ii) 
  is closed and non-empty.
Please note that both statements (i) and (ii) are with respect to the relative topology on .
 Theorem 1. Under Assumption 2 the following hold:
- (i) 
 Equation (8) admits a minimizer on , - (ii) 
 
- (iii) 
 If for every  is lower-semi-continuous on  then where  is defined on  as 
 Theorem 1 provides a theoretical means of asymptotically computing the optimizer 
 of problem (
8). In practice, this limit cannot always be computed and only very large values of 
 can be used. However, in reality trading does not occur in a friction-less market but every transaction placed at time 
t incurs a cost 
. Moreover, only a finite number of assets are traded.
Consider a market with frictions where only finitely many assets are traded. In this setting, an admissible strategy is an adapted, left-continuous of finite-variation process 
 whose corresponding wealth process is 
-a.s. bounded below. In the context of this paper, the sub-market 
 with proportional transaction cost 
 is precisely such a market. Any such admissible strategy on this finite sub-market defines an admissible portfolio whose liquidation value, as defined by (
Guasoni 2006, Equation 2.2) and (
Guasoni 2006, Remark 2.4), is defined by
      
      where 
 denote the optimizer of 
9 for a fixed value of 
, 
 denotes the weak derivative of 
 in the sense of measures, and 
 denotes its variation. The first term on the right-hand side of (
17) is the capital gains from trading, second represents the cost incurred from various transaction costs, and the last term represents the cost of instantaneous liquidation at time 
t. Although more general transactions costs may be considered, the proportional transaction costs presented here are sufficient for the formulation of the next result.
The next result guarantees that the market model  is arbitrage-free, granted that  is large enough to cover the spread between  and . The following assumption quantifies the requirement that  be taken to be sufficiently large.
Assumption 3. There exists some  and some  such that for every , positive integer n, and every  the following holds:
- (i) 
 - (ii) 
 
 Proposition 2. If  for all times  then for any admissible strategy θ trading , , then  implies that 
 In the next section, we apply Theorem 1 and the arbitrage-regularization (
9) to the bond market.
  4. Arbitrage-Regularization for Bond Pricing
As discussed in 
Diebold and Rudebusch (
2013), affine term-structure models are commonly used in forward-rate curve modelling due to their tractability and the interpretability. In the formulation of 
Björk and Christensen (
1999), as further developed in 
Filipović (
2000); 
Filipović et al. (
2010), affine term-structure models are characterized by (
7) together with the additional requirement that its stochastic factor process 
 follows an affine diffusion. By 
Cuchiero (
2011) this means that the dynamics of 
 are given by
      
      for some 
 matrices 
 and 
, and some vectors 
 and 
 in 
 such that there exists a solution 
 to the following Riccati system
      
      such that 
 has negative real part for all 
, 
, 
, and 
.
Fix meta-parameters 
 and 
. For the next result, all the factor models will be taken as belonging to the weighted Sobolev space 
 with weight function
      
      where 
C is a unique constant ensuring that 
 and its weighted integral is equal to 1. Fix measures 
 on 
 and 
 on 
. The space 
 is defined of all 
-locally integrable, 
k-times weakly differentiable functions 
 equipped with the norm: 
      where 
 is a multi-index, 
, and 
 is the weak derivative of 
f of order 
 defined by
      
Here, 
 is the space of all compactly supported functions with infinitely many derivatives. Furthermore, 
k is required to satisfy
      
Remark 2. In the case where  and , the Sobolev space  is a reproducing kernel Hilbert space (see Nelson and Siegel (1987)) therefore point evaluation is a continuous linear functional and by (weighted) Morrey-Sobolev Theorem of Brown and Opic (1992) it can be embedded within a space of continuous functions. Therefore, given any  and any -valued stochastic process , the process  is a well-defined process in the following space of forward-rate curves of Filipović (2001) defined by   Analytic tractability is ensured by requiring that the factor models considered for the arbitrage-regularization (
9) belong to the class 
 defined by
      
      where 
. This class of functions generalizes the Nelson-Siegel family (
7) discussed in the introduction.
Under these conditions the following theorem characterizes the asymptotic behavior of (
9) in 
 as solving problem (
8), given fixed meta-parameters 
. Following 
Filipović (
2001), it will be convenient to denote
      
Theorem 2. Let φ be given by (7),  be as in (18), and fix . Then - (i) 
 For every  there exists an element  in  minimizing where  is defined by where , , , and , for .
- (ii) 
 The following inclusion holds 
 It may convenient to understand the 
 as a function of 
 when interpreting approximations of the limit (26) as a function of 
. The following result removes the challenges posed by the unbounded interval 
, in which 
 lies, by reparameterizing problem (
23) with a bounded meta-parameter 
.
Corollary 1. Let φ be given by (7),  be as in (18), ϕ be in , and fix . For every , define . Then  minimizes (23) if and only if it minimizeswhere  is as in (24). In particular, the following inclusion holdswhere  is as in (16).  Next, the arbitrage-regularization of forward-rate curves will be considered using deep learning methods.
  4.1. A Deep Learning Approach to Arbitrage-Regularization
The flexibility of feed-forward neural networks (ffNNs), as described in the universal approximation theorems of 
Hornik (
1991); 
Kratsios (
2019b), makes the collection of ffNNs a well-suited class of alternative models for the arbitrage-regularization problem. In the context of this paper an ffNN is any function from 
 to 
 of the form
        
        where each 
 for some 
 dimensional matrix 
 and some 
, where 
 and 
, 
 is a smooth activation-function, and • denotes component-wise composition. Fix integers 
, and 
. The set of all feed-forward neural networks with 
 for 
, 
, and fixed activation function 
 will be denoted by 
.
To maintain analytic tractability, it will be required that our hypothesis class 
 consists of all 
 of the form
        
        where 
, 
 for all 
, and where 
 denotes the transpose of 
. The process 
 will be assumed to be a 
d dimensional Ornstein-Uhlenbeck process and in particular will be of the form(
18). Therefore, the special class of models we consider here are of the form (
7).
It has been shown in 
Rahimi and Recht (
2008), among others, that if a network is appropriately designed, then training only the final layer and suitably initializing the matrices 
 performs comparably well to networks with all the layers trained. More recently, the approximation capabilities of neural networks with randomized first few layers has is shown in 
Gonon (
2020). This phenomenon was observed in numerous numerical studies, such as 
Jaeger and Haas (
2004), where the entries of the matrices 
 are chosen entirely randomly. This practice has also become fundamental to feasible implementations of recurrent neural network (RNN) theory and reservoir computing, as studied in 
Gelenbe (
1989), where training speed becomes a key factor in determining the feasibility of the RNN and reservoir computing paradigms.
The hypothesis class of alternative factor models to be considered in the arbitrage-regularization problem effectively reduces from (
28) to
        
        where 
 and 
 is initialized through by
        
 is a given factor model of the form (
21), and 
 is a uniform random sample on a non-empty compact subset of 
; 
. Thus, the optimization problem (
30) is random since it relies on randomly generated data points 
. However, instead of initializing 
 in an ad-hoc random manner, the initialization (
30) guarantees that the shapes generated by (
29) are close to those produced by the naive factor model (
7). In this case, a brief computation shows that 
 simplifies to
        
        where 
 with the integration is defined component-wise and 
 denotes the 
 entry of the vector 
.
  4.2. Numerical Implementations
The data-set for this implementation consists of German bond data for 31 maturities with observations obtained on 1273 trading days from January 4th 2010 to December 30th 2014. As is common practice in machine learning, further details of our code and implementation can be found on 
Kratsios (
2019a). The code is relatively flexible and could be adapted to other bond-data sets.
The performance of the arbitrage-regularization methodology will now be applied to two factor models of affine type and its performance will be evaluated numerically. The first factor model is the commonly used dynamic Nelson-Siegel model of 
Diebold and Rudebusch (
2013) and the second is a machine learning extension of the classical PCA approach to term-structure modeling. The performance of the arbitrage-regularization for each model will be benchmarked against both the original factor models and against the HJM-extension of the Vasiček model. The Vasiček model is a natural benchmark since, as shown in 
Björk and Christensen (
1999), it is consistent with a low-dimensional factor model. Therefore, each of the factor models contains roughly the same number of driving factors which ensures that the comparisons are fair. Moreover, the numéraire process 
 will be taken to be the money-market and we take 
. The meta-parameter 
 is taken to be 
 so that it is approximately 1.
As described in (
29)–(
31), the solution to the arbitrage-regularization (
9), will be numerically approximated using randomly initialized deep feed-forward neural networks. The initialization network 
f of (
29) is selected to have fixed depth 
, fixed height 
 and its weights are learned using the ADAM algorithm. The meta-parameters 
 and 
 are chosen empirically, and the parameters of the Ornstein-Uhlenbeck process are estimated using the maximum-likelihood. Once the model parameters have been learned, and the factor model optimizing (
9) has been learned, the day ahead predictions of the stochastic factors are obtained through Kalman filter estimates of the hidden parameters 
 for each of the factor models. In the case of the Vasiček model the unobservable short-rate parameter is also estimated using the Kalman filter (see 
Bain and Crisan (
2009)). These day-ahead predictions are then fed into the factor model and used to compute the next-day bond prices. These predictions are then compared to the realized next-day bond prices.
  4.2.1. Model 1: The Dynamic Nelson-Siegel Model (Practitioner Model)
The Nelson-Siegel family is a low-dimensional family of forward-rate curve models used by various central banks to produce forward-rate or yield curves. As discussed in 
Carmona (
2014), Finland, Italy, and Spain are such examples with other countries such as Canada, Belgium, and France relying on a slight extension of this model. The Nelson-Siegel model’s popularity is largely due to its interpretable factors and satisfactory empirical performance. It is defined by
          
          where, as discussed in 
Diebold and Rudebusch (
2013), the first factor represents the long-term level of the forward-rate curve, the second represents its shape, the third represents its curvature, and 
 is a shape parameter; typically kept fixed.
Since market conditions are continually changing, the Nelson-Siegel model is typically extended from a static model to a dynamic model by replacing the static choice of 
 with a three-dimensional Ornstein-Uhlenbeck process and fixing the shape parameter 
 as in 
Diebold and Rudebusch (
2013). However, as demonstrated in 
Filipović (
2001), the dynamic Nelson-Siegel model does not admit an equivalent measure to 
 that makes the entire bond market simultaneously into local martingales. It was then shown in 
Christensen et al. (
2011a) that a specific additive perturbation of the Nelson-Siegel family circumvents this problem, but empirically this is observed to come at the cost of reduced predictive accuracy. In our implementation, the parameters of the Ornstein-Uhlenbeck process driving 
 will be estimated using the maximum likelihood method described in 
Meucci (
2005).
  4.2.2. Model 2: dPCA (Machine-Learning Model)
The dynamic Nelson-Siegel model’s shape has been developed through practitioner experience. The second factor model considered here will be of a different type, with its factors learned algorithmically. As with (
32), consider a static three-factor model for the forward-rate curve of the form
          
          where 
 are the first three principal components of the forward-rate curve calibrated on the first 100 days of data.
Subsequently, a time-series for the 
 parameters is generated, using the first 100 days of data, where on each day the 
 are optimizes according to the Elastic-Net (ENET) regression problem of 
Hastie et al. (
2015) defined by
          
          on rolling windows consisting of 100 data points and 
 are the available data-points on the forward-rate curve at time 
t. The meta-parameters 
 and 
 are chosen by cross-validation on the first 100 training days and then fixed.
The ENET regression is used due to its factor selection abilities and computational efficiency. Next, analogously to the dynamic Nelson-Siegel model, an 
-valued Ornstein-Uhlenbeck process 
 is calibrated, using the maximum likelihood methodology outlined in 
Meucci (
2005) to the time-series 
. These will provide the hidden stochastic factors in the dynamic PCA model (
33). Thus, the dPCA model is the factor model with stochastic inputs defined by
          
The resulting model differs from the dynamic Nelson-Siegel model in that its factors and dynamics are not chosen by practitioner experience but learned through the data and implicitly encode some path-dependence. However, as with the dynamic Nelson-Siegel model it falls within the scope of Theorem 2.
  5. Discussion
The predictive performance of the Vasiček (Vasiček), dPCA, A-Reg(dPCA), the dynamic Nelson-Siegel Model (dNS), the arbitrage-free Nelson-Siegel model of 
Christensen et al. (
2011a) (AFNS), and the arbitrage-regularization of the dynamic Nelson-Siegel Model (A-Reg(dNS)) is reported in the following tables. The predictive quality is quantified by the estimated mean-squared errors when making day-ahead predictions of the bond price for each maturity, for all but the first days in our data-set. The lowest estimated mean-squared errors recorded are highlighted using bold font and the second lowest estimated mean-squared errors on each maturity are emphasized using italics.
Table 1 evaluates the performance of the considered models on the short-mid end of the curve. Overall, the performance of all the models are generally comparable at the very short end but rapidly after the dPCA model begins to outperform the rest. The accuracy of the Vasiček model on small maturities is likely to it being a short-rate model.
 In 
Table 2 the dPCA model outperforms the rest by progressively larger margins. Most notably, in 
Table 3 and  
Table 4 which summarize the performance of the models for very long bong maturities the A-Reg(dPCA) model shows very low predictive error for a low number of factors while simultaneously being consistent with no-arbitrage conditions.
Even though arbitrage-free regularization does slightly reduce its accuracy, which is natural since it adds a constraint into an otherwise purely predictive process, the arbitrage-regularized dPCA model is still much more accurate than the rest.
An advantage of the A-Reg(dPCA) model is that it can accurately model the long-end of the forward-rate curve in an arbitrage-free manner. This fact is due to the dynamic factor selection properties of the dPCA model which otherwise could not have been used in a consistent manner if it were not for Theorem 2.
The numerical implementation highlights a few key facts about the arbitrage-regularization methodology. First, for nearly every maturity, the empirical performance of the arbitrage-regularization of a factor model is comparable to the original factor model. An analogous phenomenon was observed in 
Devin et al. (
2010) when projecting infinite-dimensional arbitrage-free HJM models onto the finite-dimensional manifold of Nelson-Siegel curves. Therefore, correcting for arbitrage does not come at a significant predictive cost. However, it does come with the benefit of making the model theoretically sound and compatible with the techniques of arbitrage-pricing theory.
Second, since (
9) incorporates an additional constraint into the modeling procedure the arbitrage-regularization of a factor model has a reduction in performance as compared to the initial factor model. This phenomenon has also been observed empirically in 
Christensen et al. (
2011a) for the arbitrage-free Nelson-Siegel correction of the dynamic Nelson-Siegel model. Therefore, one should not expect to improve on the predictive performance of the initial factor model by correcting for the existence of arbitrage.
Third, the empirical performance of A-Reg(dPCA) was significantly better than the empirical performance of the other arbitrage-free models, namely AFNS, A-Reg(dNS), and the Vasiček model, across nearly all maturities. This was especially true for mid and long maturity zero-coupon bonds. Moreover, the performance of A-Reg(dPCA) and dPCA were comparable. Similarly, for most maturities, the empirical performance of the AFNS, dNS, and A-Reg(dNS) models were all similar and notably lower than the performance of the A-Reg(dPCA), dPCA, and Vasiček models. This emphasizes the fact that arbitrage-regularization methodology produces performant models only if the original model itself produces accurate predictions. Therefore, it is up to the practitioner to make an appropriate choice of model. However, the methodology used to develop dPCA and A-Reg(dPCA) could be used as a generic starting point.
Since the arbitrage-regularization methodology applies to nearly any factor model, one may use any methodology to produce an accurate reference factor model and then apply arbitrage-regularization to make it theoretically consistent at a small cost in performance. This opens the possibility to applying machine learning models, such as dPCA, to finance without the worry that they are not arbitrage-free since their asymptotic arbitrage-regularization is well-defined. Furthermore, the flexibility of deep feed-forward neural networks allows for the efficient implementation of (
9).
The AFNS model proposes an arbitrage-free correction for the dynamic Nelson-Siegel. However, there is no guarantee that the AFNS corrects dNS optimally, and the predictive gap between these two models is documented in 
Christensen et al. (
2011a). This is both echoed in 
Table 2 and  
Table 3. Furthermore, this is also reflected by Theorem 2 which guarantees asymptotic optimally of the A-Reg(dNS) model.
Unlike most regularization problems where there is a trade-off between the regularization term and the (un-regularized) objective function, the arbitrage-regularization requires 
 to be taken as close to 1 as possible. Since the limit (26) can only be approximated numerically 
 cannot be evaluated, however 
 can be taken to be arbitrarily close to, but less than, 1. This choice is justified by 
Figure 1 and 
Figure 2 which illustrates that for values of 
 near 1 there is little change in the model’s predictive performance.
Figure 1 and 
Figure 2 plot the change in the shape of the day-ahead predicted forward-rate curve and the change in the MSE of the day-ahead predicted bond prices as function of 
. In those figures, the curves with a pink color correspond to low values of 
 and the curves progressing towards a blue color correspond to high values of lambda. Please note that in these plots, the reparameterization of Corollary 1 is used and an abuse of notation is made by using 
 to denote 
.
 In the case of the dNS model, an interesting property is that long-maturity bond prices do not change much, whereas short-maturity bond prices exhibit more dramatic changes. This property suggests that the dNS model is closer to being arbitrage-free on the long end of the curve than it is on the short end. This paper introduced a novel model-selection problem and provided an asymptotic solution in the form of the penalized optimization given by problem (
9). The problem was posed and solved in a generalized HJM-type setting, within Theorem 1 and specialized to the term-structure of interest setting in Theorem 2 where simple expressions for the penalty term were derived.
The key innovation of the paper was the construction of the penalty term 
 defining the arbitrage-regularization problem (
9). The construction of this term in Proposition 1 relied on the structure of the generalized HJM-type setting proposed in 
Heath et al. (
1992) and generalized in (
4) which allowed one to encode the dynamics of a large class of factor models with stochastic inputs into the specific structure of any asset class.
The numerical feasibility of the proposed method was made possible by the flexibility of feed-forward neural networks, as demonstrated in 
Hornik (
1991); 
Kratsios (
2019b), which allowed the optimizer of the arbitrage-regularization problem (
9) to be approximated to arbitrary precision. In the numerics section of this paper, it was found that the arbitrage-regularization of a factor model does not heavily impact its predictive performance but does make it approximately consistent with no-arbitrage requirements.
In particular, the compatibility of the proposed approach with generic factor models with stochastic inputs allowed for the consistent use of factor models generated from machine learning methods. The A-Reg(dPCA) model is a novel example of such an approximately arbitrage-free model where the dynamics and factors were generated algorithmically instead of through practitioner experience.
The precise quantification of the approximate arbitrage-free property was made in Proposition 2. Thus, approximately arbitrage-free factor models under the stylized assumption of no transaction costs were indeed arbitrage-free when proportional transaction costs are in place, which is a more realistic assumption.
Finally, the arbitrage-regularization approach introduced in this paper opens the door to the compatible use of predictive machine-learning factor models with the no-free lunch with vanishing risk condition. The general treatment in Theorem 1 can be transferred to other asset classes and models generated from other learning algorithms. This approach can be an important new avenue of research lying at the junction of predictive machine learning and mathematical finance.