**1. Introduction**

Over the past several years, neural nets, particularly deep nets, have become the state-of-the-art in a remarkable number of machine learning problems, from mastering go to image recognition/segmentation and machine translation (see the review article [1] for more background). Despite all their practical successes, a robust theory of why they work so well is in its infancy. Much of the work to date has focused on the problem of explaining and quantifying the expressivity (the ability to approximate a rich class of functions) of deep neural nets [2–11]. Expressivity can be seen both as an effect of both depth and width. It has been known since at least the work of Cybenko [12] and Hornik-Stinchcombe-White [13] that if no constraint is placed on the width of a hidden layer, then a single hidden layer is enough to approximate essentially any function. The purpose of this article, in contrast, is to investigate the "effect of depth without the aid of width." More precisely, for each *d* ≥ 1, we would like to estimate:

$$w\_{\min}(d) := \min \left\{ w \in \mathbb{N} \; \middle| \; \begin{array}{c} \text{ReLU nets of width } w \text{ can approximate any} \\ \text{positive continuous function on } [0,1]^d \text{ arbitrarily well} \end{array} \right\}. \tag{1}$$

Here, N = {0, 1, 2, ...} are the natural numbers and ReLU is the so-called "rectified linear unit," ReLU(*t*) = max{0, *<sup>t</sup>*}, which is the most popular non-linearity used in practice (see (4) for the exact definition). In Theorem 1, we prove that *<sup>ω</sup>*min(*d*) ≤ *d* + 2. This raises two questions:


A priori, it is not clear how to estimate *<sup>ω</sup>min*(*d*) and whether it is even finite. One of the contributions of this article is to provide reasonable bounds on *<sup>ω</sup>min*(*d*) (see Theorem 1). Moreover, we also provide quantitative estimates on the corresponding rate of approximation. On the subject of Q1, we will prove in forthcoming work with M.Sellke [14] that in fact, *<sup>ω</sup>*min(*d*) = *d* + 1. When *d* = 1, the lower bound is simple to check, and the upper bound follows for example from Theorem 3.1 in [5]. The main results in this article, however, concern Q1 and Q2 for convex functions. For instance, we prove in Theorem 1 that:

$$w\_{\text{min}}^{\text{conv}}(d) \le d + 1,\tag{2}$$

where:

$$w\_{\min}^{\text{conv}}(d) := \min \left\{ w \in \mathbb{N} \; \middle| \; \begin{array}{l} \text{ReLU nets of width } w \text{ can approximate any} \\ \text{positive convex function on } [0,1]^d \text{ arbitrarily well} \end{array} \right\}. \tag{3}$$

This illustrates a central point of the present paper: the convexity of the ReLU activation makes ReLU nets well-adapted to representing convex functions on [0, <sup>1</sup>]*<sup>d</sup>*.

Theorem 1 also addresses Q2 by providing quantitative estimates on the depth of a ReLU net with width *d* + 1 that approximates a given convex function. We provide similar depth estimates for arbitrary continuous functions on [0, <sup>1</sup>]*<sup>d</sup>*, but this time for nets of width *d* + 3. Several of our depth estimates are based on the work of Balázs-György-Szepesvári [15] on max-affine estimators in convex regression.

In order to prove Theorem 1, we must understand what functions can be exactly computed by a ReLU net. Such functions are always piecewise affine, and we prove in Theorem 2 the converse: every piecewise affine function on [0, 1]*d* can be exactly represented by a ReLU net with hidden layer width at most *d* +3. Moreover, we prove that the depth of the network that computes such a function is bounded by the number affine pieces it contains. This extends the results of Arora-Basu-Mianjy-Mukherjee (e.g., Theorem 2.1 and Corollary 2.2 in [2]).

Convex functions again play a special role. We show that every convex function on [0, 1]*d* that is piecewise affine with *N* pieces can be represented exactly by a ReLU net with width *d* + 1 and depth *N*.

#### **2. Statement of Results**

To state our results precisely, we set notation and recall several definitions. For *d* ≥ 1 and a continuous function *f* : [0, 1]*d* → R, write:

$$\|f\|\_{C^0} := \sup\_{x \in [0,1]^d} |f(x)|\,.$$

Further, denote by:

$$\omega\_f(\varepsilon) := \sup \{ |f(\mathbf{x}) - f(y)| \: \mid \: |\mathbf{x} - y| \le \varepsilon \} $$

the modulus of continuity of *f* , whose value at *ε* is the maximum that *f* can change when its argumen<sup>t</sup> moves by at most *ε*. Note that by the definition of a continuous function, *<sup>ω</sup>f*(*ε*) → 0 as *ε* → 0. Next, given *d*in, *d*out, and *w* ≥ 1, we define a feed-forward neural net with ReLU activations, input dimension *d*in, hidden layer width *w*, depth *n*, and output dimension *d*out to be any member of the finite-dimensional family of functions:

$$\text{RelLU} \diamond A\_{\text{il}} \diamond \cdots \diamond \text{RelLU} \diamond A\_{\text{l}} \diamond \text{RelLU} \diamond A\_{\text{l}} \tag{4}$$

that map R*<sup>d</sup>* to R*d*out + = {*x* = *<sup>x</sup>*1,..., *xd*out ∈ R*d*out | *xi* ≥ <sup>0</sup>}. In (4),

$$A\_j: \mathbb{R}^w \to \mathbb{R}^w, \ j = 2, \dots, n - 1, \qquad A\_1: \mathbb{R}^{d\_n} \to \mathbb{R}^w, \ A\_n: \mathbb{R}^w \to \mathbb{R}^{d\_{\text{out}}}$$

are affine transformations, and for every *m* ≥ 1:

$$\text{ReLU}(\mathbf{x}\_1, \dots, \mathbf{x}\_m) = (\max\{0, \mathbf{x}\_1\}, \dots, \max\{0, \mathbf{x}\_m\})\,\,.$$

We often denote such a net by N and write:

$$f\_{\mathcal{N}}(\mathbf{x}) := \operatorname{ReLU} \diamond A\_n \diamond \cdots \diamond \operatorname{ReLU} \diamond A\_1 \diamond \operatorname{ReLU} \diamond A\_1(\mathbf{x})$$

for the function it computes. Our first result contrasts both the width and depth required to approximate continuous, convex, and smooth functions by ReLU nets.

**Theorem 1.** *Let d* ≥ 1 *and f* : [0, 1]*d* → R+ *be a positive function with f C*<sup>0</sup> = 1*. We have the following three cases:*

**1. (***f* **is continuous)** *There exists a sequence of feed-forward neural nets* N*k with ReLU activations, input dimension d*, *hidden layer width d* + 2, *and output dimension* 1, *such that:*

$$\lim\_{k \to \infty} \|f - f\_{\mathcal{N}\_k}\|\_{\mathbb{C}^0} = 0. \tag{5}$$

*In particular, wmin*(*d*) ≤ *d* + 2. *Moreover, write <sup>ω</sup>f for the modulus of continuity of f* , *and fix ε* > 0. *There exists a feed-forward neural net* N*ε with ReLU activations, input dimension d*, *hidden layer width d* + 3, *output dimension* 1, *and:*

$$\text{depth}\left(\mathcal{N}\_{\varepsilon}\right) = \frac{2 \cdot d!}{\omega\_f \left(\varepsilon\right)^d} \tag{6}$$

*such that:*

$$\|f - f\_{\mathcal{N}\_t}\|\_{\mathbb{C}^0} \le \varepsilon. \tag{7}$$

**2. (***f* **is convex)** *There exists a sequence of feed-forward neural nets* N*k with ReLU activations, input dimension d*, *hidden layer width d* + 1, *and output dimension* 1, *such that:*

$$\lim\_{k \to \infty} \|f - f\_{\mathcal{N}\_k}\|\_{C^0} = 0. \tag{8}$$

*Hence, ωconv min* (*d*) ≤ *d* + 1. *Further, there exists C* > 0 *such that if f is both convex and Lipschitz with Lipschitz constant L*, *then the nets* N*k in* (8) *can be taken to satisfy:*

$$depth\left(N\_k\right) = k + 1, \qquad \left\|f - f\_{N\_k}\right\|\_{C^0} \le CLd^{3/2}k^{-2/d}.\tag{9}$$

**3. (***f* **is smooth)** *There exists a constant K depending only on d and a constant C depending only on the maximum of the first K derivative of f such that for every k* ≥ 3*, the width d* + 2 *nets* N*k in* (5) *can be chosen so that:*

$$\text{depth}(\mathcal{N}\_k) = k, \qquad \left\| f - f\_{\mathcal{N}\_k} \right\|\_{C^0} \le \mathbb{C} \left( k - 2 \right)^{-1/d}. \tag{10}$$

The main novelty of Theorem 1 is the width estimate *w*conv min (*d*) ≤ *d* + 1 and the quantitative depth estimates (9) for convex functions, as well as the analogous estimates (6) and (7) for continuous functions. Let us briefly explain the origin of the other estimates. The relation (5) and the corresponding estimate *<sup>w</sup>*min(*d*) ≤ *d* + 2 are a combination of the well-known fact that ReLU nets with one hidden layer can approximate any continuous function and a simple procedure by which a ReLU net with input dimension *d* and a single hidden layer of width *n* can be replaced by another ReLU net that computes the same function, but has depth *n* + 2 and width *d* + 2. For these width *d* + 2 nets, we are unaware of how to obtain quantitative estimates on the depth required to approximate a fixed continuous function to a given precision. At the expense of changing the width of our ReLU nets from *d* + 2 to *d* + 3, however, we furnish the estimates (6) and (7). On the other hand, using Theorem 3.1 in [5], when *f* is

sufficiently smooth, we obtain the depth estimates (10) for width *d* + 2 ReLU nets. Indeed, since we are working on a compact set [0, <sup>1</sup>]*<sup>d</sup>*, the smoothness classes *Ww*,*q*,*<sup>γ</sup>* from [5] reduce to classes of functions that have sufficiently many bounded derivatives.

Our next result concerns the exact representation of piecewise affine functions by ReLU nets. Instead of measuring the complexity of such a function by its Lipschitz constant or modulus of continuity, the complexity of a piecewise affine function can be thought of as the minimal number of affine pieces needed to define it.

**Theorem 2.** *Let d* ≥ 1 *and f* : [0, 1]*d* → R+ *be the function computed by some* ReLU *net with input dimension d, output dimension* 1, *and arbitrary width. There exist affine functions g<sup>α</sup>*, *hβ* : [0, 1]*d* → R *such that f can be written as the difference of positive convex functions:*

$$f = \emptyset - h, \qquad \mathcal{g} := \max\_{1 \le a \le N} g\_{a'} \qquad h := \max\_{1 \le \emptyset \le M} h\_{\emptyset}. \tag{11}$$

*Moreover, there exists a feed-forward neural net* N *with ReLU activations, input dimension d*, *hidden layer width d* + 3, *output dimension* 1, *and:*

$$depth\left(\mathcal{N}\right) = \mathcal{Z}(M+N) \tag{12}$$

*that computes f exactly. Finally, if f is convex (and hence, h vanishes), then the width of* N *can be taken to be d* + 1*, and the depth can be taken to be N*.

The fact that the function computed by a ReLU net can be written as (11) follows from Theorem 2.1 in [2]. The novelty in Theorem 2 is therefore the uniform width estimate *d* + 3 in the representation on any function computed by a ReLU net and the *d* + 1 width estimate for convex functions. Theorem 2 will be used in the proof of Theorem 1.

#### **3. Relation to Previous Work**

This article is related to several strands of prior work:


#### **4. Proof of Theorem 2**

**Proof of Theorem 2.** We first treat the case:

$$f = \sup\_{1 \le \boldsymbol{\alpha} \le N} \mathbb{g}\_{\boldsymbol{\alpha}} \qquad \mathbb{g}\_{\boldsymbol{\alpha}} : [0, 1]^d \to \mathbb{R} \quad \text{affine}$$

when *f* is convex. We seek to show that *f* can be exactly represented by a ReLU net with input dimension *d*, hidden layer width *d* + 1, and depth *N*. Our proof relies on the following observation.

**Lemma 1.** *Fix d* ≥ 1, *and let T* : <sup>R</sup>*d*+ → R *be an arbitrary function and L* : R*<sup>d</sup>* → R *be affine. Define an invertible affine transformation A* : R*d*+<sup>1</sup> → R*d*+<sup>1</sup> *by:*

$$A(\mathbf{x}, \mathbf{y}) = (\mathbf{x}, L(\mathbf{x}) + \mathbf{y})\dots$$

*Then, the image of the graph of T under:*

$$A \circ \text{ReLU} \circ A^{-1}$$

*is the graph of x* → max{*T*(*x*), *<sup>L</sup>*(*x*)}, *viewed as a function on* <sup>R</sup>*d*+.

**Proof.** We have *<sup>A</sup>*−<sup>1</sup>(*<sup>x</sup>*, *y*)=(*<sup>x</sup>*, −*<sup>L</sup>*(*x*) + *y*). Hence, for each *x* ∈ <sup>R</sup>*d*+, we have:

$$A \diamond \text{ReLU} \circ A^{-1}(\mathbf{x}, T(\mathbf{x})) = \left(\mathbf{x}, \left(T(\mathbf{x}) - L(\mathbf{x})\right) \mathbf{1}\_{\{T(\mathbf{x}) - L(\mathbf{x}) > 0\}} + L(\mathbf{x})\right).$$

$$= \left(\mathbf{x}, \max\{T(\mathbf{x}), L(\mathbf{x})\}\right).$$

We now construct a neural net that computes *f* . We note that the construction is potentially applicable to the study of avoiding sets (see the work of Shang [17]). Define invertible affine functions *Aα* : R*d*+<sup>1</sup> → R*d*+<sup>1</sup> by:

$$A\_{\mathfrak{a}}(\mathbf{x}, \mathbf{x}\_{d+1}) := (\mathbf{x}, \mathbf{g}\_{\mathfrak{a}}(\mathbf{x}) + \mathbf{x}\_{d+1}), \qquad \mathbf{x} = (\mathbf{x}\_1, \dots, \mathbf{x}\_d), \mathbf{y}$$

and set:

$$H\_{\mathfrak{a}} := A\_{\mathfrak{a}} \circ \mathrm{ReLU} \circ A\_{\mathfrak{a}}^{-1}.$$

Further, define:

$$H\_{\rm out} := \operatorname{ReLU} \diamond \langle \vec{e}\_{d+1'} \cdot \rangle \tag{13}$$

.

where*ed*+<sup>1</sup> is the (*d* + 1)th standard basis vector so that *ed*+1, · is the linear map from R*d*+<sup>1</sup> to R that maps (*<sup>x</sup>*1,..., *xd*+<sup>1</sup>) to *xd*+1. Finally, set:

$$H\_{\rm in} := \operatorname{ReLU} \circ (\operatorname{id}, 0) \,,$$

where (id, <sup>0</sup>)(*x*)=(*<sup>x</sup>*, 0) maps [0, 1]*d* to the graph of the zero function. Note that the ReLU in this initial layer is linear. With this notation, repeatedly using Lemma 1, we find that:

$$H\_{\rm out} \diamond H\_N \diamond \cdots \cdot \diamond H\_1 \diamond H\_{\rm in}$$

therefore has input dimension *d*, hidden layer width *d* + 1, depth *N*, and computes *f* exactly.

Next, consider the general case when *f* is given by:

$$f = \emptyset - h, \qquad \mathfrak{g} = \sup\_{1 \le \alpha \le N} \mathfrak{g}\_{\alpha}, \qquad h = \sup\_{1 \le \beta \le M} h\_{\beta}$$

as in (11). For this situation, we use a different way of computing the maximum using ReLU nets.

**Lemma 2.** *There exists a* ReLU *net* M *with input dimension* 2, *hidden layer width* 2*, output dimension* 1*, and depth* 2 *such that:*

$$\mathcal{M}\left(\mathbf{x},\boldsymbol{y}\right) = \max\{\mathbf{x},\boldsymbol{y}\}, \qquad \mathbf{x} \in \mathbb{R}, \boldsymbol{y} \in \mathbb{R}\_+\dots$$

**Proof.** Set *<sup>A</sup>*1(*<sup>x</sup>*, *y*) := (*x* − *y*, *y*), *<sup>A</sup>*2(*<sup>z</sup>*, *w*) = *z* + *w*, and define:

$$\mathcal{M} = \operatorname{ReLU} \circ A\_2 \circ \operatorname{ReLU} \circ A\_1.$$

We have for each *y* ≥ 0, *x* ∈ R:

$$f\_{\mathcal{M}}(\mathbf{x}, y) = \text{ReLU}((\mathbf{x} - y)\mathbf{1}\_{\{x - y > 0\}} + y) = \max\{\mathbf{x}, y\},$$

as desired.

We now describe how to construct a ReLU net N with input dimension *d*, hidden layer width *d* + 3, output dimension 1, and depth 2(*M* + *N*) that exactly computes *f* . We use width *d* to copy the input *x*, width 2 to compute successive maximums of the positive affine functions *g<sup>α</sup>*, *hβ* using the net M from Lemma 2 above, and width 1 as memory in which we store *g* = sup*α gα* while computing *h* = sup*β <sup>h</sup>β*. The final layer computes the difference *f* = *g* − *h*.

#### **5. Proof of Theorem 1**

**Proof of Theorem 1.** We begin by showing (8) and (9). Suppose *f* : [0, 1]*d* → R+ is convex, and fix *ε* > 0. A simple discretization argumen<sup>t</sup> shows that there exists a piecewise affine convex function *g* : [0, 1]*d* → R+ such that *f* − *<sup>g</sup> C*<sup>0</sup> ≤ *ε*. By Theorem 2, *g* can be exactly represented by a ReLU net with hidden layer width *d* + 1. This proves (8). In the case that *f* is Lipschitz, we use the following, a special case of Lemma 4.1 in [15].

**Proposition 1.** *Suppose f* : [0, 1]*d* → R *is convex and Lipschitz with Lipschitz constant L. Then, for every k* ≥ 1*, there exist k affine maps Aj* : [0, 1]*d* → R *such that:*

$$\left\| f - \sup\_{1 \le j \le k} A\_j \right\|\_{C^0} \le 72L \, d^{3/2} k^{-2/d}.$$

Combining this result with Theorem 2 proves (9). We turn to checking (5) and (10). We need the following observations, which seems to be well known, but not written down in the literature.

**Lemma 3.** *Let* N *be a* ReLU *net with input dimension d*, *a single hidden layer of width n*, *and output dimension* 1. *There exists another* ReLU *net* N : *that computes the same function as* N *, but has input dimension d and n* + 2 *hidden layers with width d* + 2.

**Proof.** Denote by {*Aj*}*nj*=<sup>1</sup> the affine functions computed by each neuron in the hidden layer of N so that: 

$$f\_{\mathcal{N}}(\mathbf{x}) = \text{ReLU}\left(b + \sum\_{j=1}^{n} c\_j \text{ReLU}(A\_j(\mathbf{x}))\right).$$

Let *T* > 0 be sufficiently large so that:

$$T + \sum\_{j=1}^{k} c\_j \operatorname{ReLU}(A\_j(\mathbf{x})) > 0, \qquad \forall 1 \le k \le n\_\prime \ x \in [0, 1]^d.$$

:

The affine transformations A ; *j* computed by the *j*th hidden layer of N are then:

$$\tilde{A}\_1(\mathbf{x}) := (\mathbf{x}, A\_j(\mathbf{x}), T) \qquad \text{and} \qquad \tilde{A}\_{n+2}(\mathbf{x}, y, z) = z - T + b, \qquad \mathbf{x} \in \mathbb{R}^d, y, z \in \mathbb{R}$$

and:

$$
\tilde{A}\_j(\mathbf{x}, y, z) = \left(\mathbf{x}, A\_j(\mathbf{x}), z + c\_{j-1}y\right), \qquad j = 2, \dots, n+1.
$$

We are essentially using width *d* to copy in the input variable, width 1 to compute each *Aj*, and width 1 to store the output.

Recall that positive continuous functions can be arbitrarily well approximated by smooth functions and hence by ReLU nets with a single hidden layer (see, e.g., Theorem 3.1 [5]). The relation (5) therefore follows from Lemma 3. Similarly, by Theorem 3.1 in [5], if *f* is smooth, then there exists *K* = *K*(*d*) > 0 and a constant *Cf* depending only on the maximum value of the first *K* derivatives of *f* such that:

$$\inf\_{\mathcal{N}} \|f - f\_{\mathcal{N}}\| \le \mathcal{C}\_f n^{-1/d} \prime$$

where the infimum is over ReLU nets N with a single hidden layer of width *n*. Combining this with Lemma 3 proves (10).

It remains to prove (6) and (7). To do this, fix a positive continuous function *f* : [0, 1]*d* → R+ with modulus of continuity *<sup>ω</sup>f* . Recall that the volume of the unit *d*-simplex is 1/*d*!, and fix *ε* > 0. Consider the partition:

$$[0,1]^d = \bigcup\_{j=1}^{d!/\omega\_f(\mathfrak{e})^d} \mathcal{P}\_j$$

of [0, 1]*d* into *<sup>d</sup>*!/*<sup>ω</sup>f*(*ε*)*<sup>d</sup>* copies of *<sup>ω</sup>f*(*ε*) times the standard *d*-simplex. Here, each P*j* denotes a single scaled copy of the unit simplex. To create this partition, we first sub-divide [0, 1]*d* into at most *<sup>ω</sup>f*(*ε*)−*<sup>d</sup>* cubes of side length at most *<sup>ω</sup>f*(*ε*). Then, we subdivide each such smaller cube into *d*! copies of the standard simplex (which has volume 1/*d*!) rescaled to have side length *<sup>ω</sup>f*(*ε*). Define *fε* to be a piecewise linear approximation to *f* obtained by setting *fε* equal to *f* on the vertices of the <sup>P</sup>*j*'s and taking *fε* to be affine on their interiors. Since the diameter of each P*j* is *<sup>ω</sup>f*(*ε*), we have:

$$\|f - f\_{\varepsilon}\|\_{C^{0}} \le \varepsilon.$$

Next, since *fε* is a piecewise affine function, by Theorem 2.1 in [2] (see Theorem 2), we may write:

$$f\_t = g\_t - h\_{\varepsilon\_t}$$

where *g<sup>ε</sup>*, *hε* are convex, positive, and piecewise affine. Applying Theorem 2 completes the proof of (6) and (7).
