**1. Introduction**

A risk measure is a functional on the probability distribution of the fluctuating returns of a security or a portfolio. Since it is impossible to condense all the information in a probability distribution into a single number, there is no unique way to choose the "best" risk measure. In Markowitz's ground breaking portfolio selection theory [1], with the assumption of Gaussian distributed returns, variance offered itself as the natural risk measure. The crises of the late eighties and early nineties led both the industry and regulators to realize that the most dangerous risk lurked in the asymptotically far tail of the return distribution. To grasp this risk, a high quantile of the profit and loss distribution called Value at Risk (VaR) was introduced by J.P. Morgan [2]. For a certain period, VaR became a kind of industry standard, and it was embraced by international financial regulation as the official risk measure in 1996 [3]. Value at Risk is a threshold which losses only exceed with a small probability (such as, e.g., 0.05 or 0.01), corresponding to a confidence level of *α* = 0.95, resp. 0.99. (In this context, it is

**Citation:** Papp, G.; Kondor, I.; Caccioli, F. Optimizing Expected Shortfall under an -1 Constraint—An Analytic Approach. *Entropy* **2021**, *23*, 523. https://doi.org/10.3390/ e23050523

Academic Editors: Ryszard Kutner and Geert Verdoolaege

Received: 9 March 2021 Accepted: 22 April 2021 Published: 24 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

customary to regard losses as positive and profits as negative). As a quantile, VaR is not sensitive to the distribution of losses above the confidence level and is not subadditive when two portfolios are combined. This triggered a search for alternatives and led Artzner et al. [4] to formulate a set of axioms that any coherent risk measure should satisfy. The simplest and most intuitive of these coherent measures is the Expected Shortfall (ES) [5,6]. ES is essentially the expected loss above a high quantile that can be chosen to be the VaR itself. After a long debate about the relative merits and drawbacks of ES, whose details are not pertinent to our present study, regulators adopted ES as the current official market risk measure to be used to assess the financial health of banks and determine the capital charge they are required to hold against their risks. The regulators and the industry settled on a confidence level of *α* = 0.975 [7].

ES is mainly designed to be a diagnostic tool. At the same time, it is also a constraint that banks have to respect when considering the composition of their portfolios. It is then in their best interest to optimize ES, in order to keep their capital charge as low as possible. However, the optimization of ES is fraught with problems of estimation error, which is quite natural if one considers that the number of different items *N* in a bank's portfolio can be very large, whereas the number of observations (the length of the available time series *T*) is always limited. In addition, at the regulatory confidence level, one has to throw away 97.5% of the data. Moreover, the estimation error increases with the ratio *r* = *N*/*T* and at a critical value of *r*, it actually diverges, growing beyond any limit. As shown in [8], the instability of the optimization of ES (as well as all the coherent risk measures) follows directly from the coherence axioms [4].

The divergence of ES is the signature of a phase transition. The critical *r* for ES is smaller or equal to 1/2, its value depending on the confidence level *α*. For ES, there is then a line of critical points, a phase diagram, on the *r* − *α* plane. A part of this phase diagram has been traced out by numerical simulations in [9], while the full phase diagram has been determined by analytical calculations by Ciliberti et al. [10]. Going beyond merely determining the phase diagram, a detailed study of the estimation error and other relevant quantities has been performed inside the whole feasibility region in [11,12], and it was shown that, due to the nontrivial behavior of the contour lines of constant estimation error, especially in the vicinity of *α* = 1, the number of data necessary to have a reasonably low estimation error was way above any *T* available in practice.

Because of the large sample fluctuations of ES, its optimization constitutes a problem in high dimensional statistics [13]. A standard tool to tame these large fluctuations is to introduce regularizers, which penalize large excursions. Although the introduction of these penalties may seem an arbitrary statistical trick coming from outside of finance, it was shown in [14] that these regularizers express liquidity considerations, and take into account, already at the construction of the portfolio, the expected market impact of a future liquidation. The regularizers are usually chosen to be some constraints on the norm of the portfolio weights. In [15], we studied the effect of an -2 regularizer on ES and found that -2 obviously suppresses the instability and, for sufficiently small *r* and with a strong enough regularizer, it extends the range where the estimation error is reasonably small by a factor of about 4.

It is interesting to see how an -1 regularizer works with ES. (The importance of studying the effect of various regularizers in combination with the different risk measures was emphasized by [16]). The regularizer -1 is known to produce sparse solutions, which means that in order to rein in large fluctuations, it eliminates some of the securities from the portfolio. This obviously contradicts the principle of diversification, but considerations of transaction costs or the technical difficulties of managing large portfolios may make it desirable to remove the most volatile items from the portfolio, and this is precisely what a no-short constraint tends to do.

It has been known for 20 years now that the optimization of ES can be translated into a linear programming problem [17]. Accordingly, as it has been realized in [18], the piece-wise linear -1 with an infinite slope corresponding to an infinite penalty on short selling can prevent the instability of ES. The purpose of this paper is to determine the effect of -1-regularization on the phase diagram and also on the behavior of the various quantities of interest inside the region where the optimization of ES is feasible and meaningful. (We will see that as a result of regularization new characteristic lines appear on the *r* − *α* plane, beyond which the optimization of ES is still mathematically feasible, but the results become meaningless, as they correspond to negative risk.) In [12], a detailed analytical investigation of the behavior of the estimation error, the in-sample cost, the sensitivity to small changes in the composition of the portfolio, and the distribution of optimal weights were carried out in the non-regularized case. Here, we derive the same quantities for an -1-regularized ES, including the special case where short selling is banned, that is when the portfolio weights are constrained to be non-negative. The density of the items eliminated from the portfolio, to be referred to as the "condensate" in the following, is also determined. The most striking result of the present study is that the regularized solution can be mapped back onto the unregularized one. We are not aware of a similarly tight relationship between a regularized and an unregularized problem, not only in a finance context, but neither in the general context of machine learning.

#### **2. Method and Preliminaries**

If the true probability distribution of returns were known, it would be easy to calculate the true value of Expected Shortfall and the optimal portfolio weights. However, the true distribution of returns is unknown, therefore one has to rely on finite samples of empirical data. This means one observes *N* time series of length *T* and estimates the optimal weights and ES on the basis of this information. It is clear that the weights and ES so obtained will deviate from their "true" values. (The latter would be obtained in an infinitely long stationary sample.) The deviation of the estimated values will be the stronger the shorter the length *T* and the larger the dimension *N*. Performing this measurement on different samples one would obtain different estimates: there is a distribution of ES and of the optimal weights over the samples. In a real market, one cannot repeat such an experiment multiple times. Instead, one has to squeeze out as much information as possible from a single sample of limited size. There are well-known numerical methods for this, like cross-validation or bootstrap [19]. In contrast, in the present work we aim to obtain *analytic* results. In order to mimic empirical sampling, we choose a simple data generating process, such as a multivariate Gaussian. The true value of ES is easy to obtain for this case, which provides a standard to measure finite sample deviations from. Then we determine ES for a large number of random samples of length *T* drawn from this underlying distribution, average it over the random samples and finally compare this average to its true value. This procedure will give us an idea about how large the estimation error is for a given dimension *N*, sample size *T*, and confidence level *α*, under the idealized conditions of stationarity and Gaussian fluctuations, and how much it will be reduced when we apply an -1 regularizer of a given strength. It is reasonable to assume that the estimation error obtained under these idealized circumstances will be a lower bound to the estimation error for real-life processes.

Now we wish to implement this program via analytic calculations. The averaging over the random samples just described is analogous to the averaging over the random realization of disorder in the statistical physics of random systems, which enables us to borrow methods from that field, in particular the replica method [20]. It assumes that both *N* and *T* are large, with their ratio *r* = *N*/*T* kept finite (thermodynamic or Kolmogorov limit). A small value of *r* corresponds to the classical setup in statistics where one has a large number of observations relative to the dimension. Estimates in this case are sharp and close to their true values. In contrast, when *r* is of order unity, or larger, we are in the high dimensional limit where fluctuations are large. It is here that the regularizer becomes important.

In the usual application of -1 in finite dimensional numerical studies, the regularizer eliminates the dimensions one by one, in a stepwise manner, as the strength of the regularizer is increasing. In our present work, the large *N*, *T* limit and the averaging over infinitely many samples result in a continuous dependence of the "condensate" density (the relative number *N*0/*N* of the dimensions eliminated by -1) on the aspect ratio *r*, the confidence level *α*, and the strength of -1. In a study of -1-regularized variance [21], we found that the stepwise increase of the density of eliminated weights in a numerical experiment nicely follows the continuous curve obtained analytically. It is obvious that the situation is similar in the case of ES, but we have also confirmed this by numerical simulations.

For the sake of simplicity, we will also assume that the returns are independent, that is the true covariance matrix is diagonal. This is not an innocent assumption: it will be seen, for example, that the maximum degree of sparsity that -1 can achieve in this scheme is one half of the total number of dimensions, whereas for correlated returns the maximum sparsity can be either larger or smaller than 1/2, according to whether correlations are predominantly positive or negative. Combining -1 with a non-diagonal covariance matrix poses additional technical difficulties that we wish to avoid in the present account. However, we do allow the diagonal elements *σi* of the covariance matrix to be different from each other.

As a further simplification, we do not impose any other constraint on the optimization of ES beside the budget constraint and the -1 regularizer. In particular, we do not set a constraint on the expected return, and seek the global minimum of the regularized ES. This is in line with a number of studies, [22–24] among others, which focus on the global minimum in the problem of variance optimization, because of the extremely noisy estimates of the expected return. Furthermore, the global minimum is precisely what one needs in minimizing tracking-errors, that is, when trying to follow, say, a market index as closely as possible [23].

The replica method used below have already been applied with minor variations to various portfolio optimization problems in a number of papers [10–12,14,18,21,25–28], where the replica derivation of the main formulae were repeatedly explained, so we do not need to go through that exercise again here. Then the natural starting point for our present work is the detailed study of the behavior of ES *without* regularization in [12]. The argumen<sup>t</sup> there leads to a relationship between ES and an effective cost or free energy per asset *f* as follows:

$$\text{ES} = \frac{fr}{1 - \mathfrak{a}}.\tag{1}$$

The free energy *f* itself is given by the minimum of a functional depending on six order parameters

$$\begin{split} f(\lambda, \epsilon, q\_0, \Delta, \not q\_0, \hat{\Delta}) &= \quad \lambda + \frac{1}{r} (1 - a)\epsilon - \Delta \not q\_0 - \hat{\Delta} q\_0 \\ &+ \quad \langle \min\_{\overline{\omega}} \left[ V(w, z, \sigma) \right] \rangle\_{\sigma, \overline{z}} + \frac{\Delta}{2r\sqrt{\pi}} \int\_{-\infty}^{\infty} \mathrm{d}s \, e^{-s^2} g\left( \frac{\epsilon}{\Delta} + s\sqrt{\frac{2q\_0}{\Delta^2}} \right) \,. \end{split} \tag{2}$$

where

$$V(w, z, \sigma) = \mathring{\Lambda}\sigma^2 w^2 - \lambda w - zw\sigma \sqrt{-2\mathring{q}\_0} + \eta^+\theta(w)w - \eta^-\theta(-w)w \tag{3}$$

and the double average ...*<sup>σ</sup>*,*<sup>z</sup>* means

$$\int\_0^\infty \mathrm{d}\sigma \, \frac{1}{N} \sum\_i \delta(\sigma - \sigma\_i) \int\_{-\infty}^\infty \frac{\mathrm{d}z}{\sqrt{2\pi}} e^{-z^2/2} \dots \tag{4}$$

Finally, the function *g* in the integral in (2) is defined as

$$g(\mathbf{x}) = \begin{cases} \begin{array}{ll} 0, & \mathbf{x} \ge \mathbf{0} \\ \mathbf{x}^2, & -1 \le \mathbf{x} \le \mathbf{0} \\ -2\mathbf{x} - 1, & \mathbf{x} < -1 \end{array} . \end{cases} . \tag{5}$$

The differences with respect to the setup in [12] are the following: a trivial change of notation (*τ* there is 1/*r* here); the variable *σ* has been introduced in (3), which together with the recipe (4) allows us to consider assets with different volatilities *σi*; and the regularizer has been built into the effective potential (3). Note that the -1 in (3) is asymmetric in order to allow us to penalize long and short positions separately. The usual -1 corresponds to *η*<sup>+</sup> = *η*<sup>−</sup>, the ban on short selling to *η*<sup>−</sup> → ∞. We will also use the arrangemen<sup>t</sup> where there is a finite penalty *η*<sup>−</sup> on short positions and none on long ones *η*<sup>+</sup> = 0.

A note on signs: for consistency, the order parameters *λ*, Δ, *q*0, and Δˆ must be positive, *q*ˆ0 negative, and can be of either sign. Furthermore, *λ* must be larger or equal to the right slope of the regularizer: *λ* ≥ *<sup>η</sup>*<sup>+</sup>.

Before setting out to derive the stationarity conditions that determine the optimal value of the free energy and thence of ES, we spell out the meaning of the order parameters. The first of these is the Lagrange multiplier *λ* that enforces the budget constraint:

$$\sum\_{i=1}^{N} w\_i = N.\tag{6}$$

Note that the sum of portfolio weights is set to *N* here, instead of the usual 1. This is to keep the weights of order unity in the large *N* limit.

Because of the relationship between *λ* and the budget constraint, *λ* can be thought of as a kind of chemical potential. It is an important quantity, because, as we shall see later, its value at the stationary point is equal to the free energy, hence directly related to the optimal value of ES. In [12], we argued that this optimal value of ES is, in fact, the in-sample estimate of Expected Shortfall. According to (1), ES is proportional to the product *f r*, which means *f* , and hence *λ* too, must be inversely proportional to *r* when *r* = *N*/*T* → 0, because ES is certainly finite in this limit: a finite *N* and *T* → ∞ corresponds to the case of having complete information. This spurious divergence of *f* and *λ* is an artifact, due to our having absorbed a factor 1/*r* in their definition. This is explained purely by convenience: we wish to keep as close to the convention in [12] as possible. The opposite limit, when *λ* − *η*<sup>+</sup> vanishes, is another important point: it signals the instability of the portfolio, and the onset of the phase transition.

The next order parameter, , was suggested by [17] as a proxy for Value at Risk. Indeed, in the limit *r* → 0 where we know the true distribution of returns, will be seen to be equal to the known value of VaR for a Gaussian.

The third order parameter, *q*0, is of central importance: According to [12], the ratio of the out-of-sample estimate ES*out* and its true value ES(0) is given by the square root of *q*0. For the case of different *σi*s considered here, *q*0 has to be amended by a factor depending on the structure of the portfolio [21] as

$$\mathfrak{q}\_0 = q\_0 \frac{1}{N} \sum\_i \frac{1}{\sigma\_i^2}. \tag{7}$$

Then the ratio of the estimated and true ES will be

$$\frac{\text{ES}\_{\text{out}}}{\text{ES}^{(0)}} = \sqrt{\vec{q}\_0} \tag{8}$$

that is the relative estimation error is √*q*˜0 − 1.

The fourth order parameter, Δ, measures the sensitivity to a small shift in the returns. ˆ

The remaining two order parameters, *q*ˆ0 and Δ , are auxiliary variables that do not have an obvious meaning, they enter the picture through the replica formalism, and can be eliminated once the stationarity conditions have been established. The stationarity or saddle point conditions are derived by taking the derivative of the free energy with respect to the order parameters and setting them to zero. They will be written up in the next Section.
