Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The Universal Approximation Property

The Universal Approximation Property The universal approximation property of various machine learning models is currently only understood on a case-by-case basis, limiting the rapid development of new theoretically jus- tified neural network architectures and blurring our understanding of our current models’ potential. This paper works towards overcoming these challenges by presenting a charac- terization, a representation, a construction method, and an existence result, each of which applies to any universal approximator on most function spaces of practical interest. Our characterization result is used to describe which activation functions allow the feed-forward architecture to maintain its universal approximation capabilities when multiple constraints are imposed on its final layers and its remaining layers are only sparsely connected. These include a rescaled and shifted Leaky ReLU activation function but not the ReLU activation function. Our construction and representation result is used to exhibit a simple modifica- tion of the feed-forward architecture, which can approximate any continuous function with non-pathological growth, uniformly on the entire Euclidean input space. This improves the known capabilities of the feed-forward architecture. Keywords Universal approximation · Constrained approximation · Uniform approximation · Deep learning · Topological transitivity · Composition operators Mathematics Subject Classification (2010) 68T07 47B33 · 47A16 · 68T05 · 30L05 · 46M40 1 Introduction Neural networks have their organic origins in [1]and in[2], wherein the authors pioneered a method for emulating the behavior of the human brain using digital computing. Their th mathematical roots are traced back to Hilbert’s 13 problem, which postulated that all high- dimensional continuous functions are a combination of univariate continuous functions. Anastasis Kratsios anastasis.kratsios@math.ethz.ch (ETH) Eidgenossische ¨ Technische Hochschule Zurich, ¨ Ramistrasse ¨ 101, CH-8092 Z¨ urich, Switzerland 436 A. Kratsios Arguably the second major wave of innovation in the theory of neural networks hap- pened following the universal approximation theorems of [3, 4], and of [5], which merged these two seemingly unrelated problems by demonstrating that the feed-forward architecture is capable of approximating any continuous function between any two Euclidean spaces, uniformly on compacts. This series of papers initiated the theoretical justification of the empirically observed performance of neural networks, which had up until that point only been justified by analogy with the Kolmogorov-Arnold Representation Theorem of [6]. Since then the universal approximation capabilities, of a limited number of neural net- work architectures, such as the feed-forward, residual, and convolutional neural networks has been solidified as a cornerstone of their approximation success. This, coupled with the numerous hardware advances have led neural networks to find ubiquitous use in a number of areas, ranging from biology, see [7, 8], to computer vision and imaging, see [9, 10], and to mathematical finance, see [11–15]. As a result, a variety of neural network architectures have emerged with the common thread between them being that they describe an algorith- mically generated set of complicated functions built by combining elementary functions in some manner. However, the case-by-case basis for which the universal approximation property is currently understood limits the rapid development of new theoretically justified archi- tectures. This paper works at overcoming this challenges by directly studying universal approximation property itself in the form of far-reaching characterizations, representations, construction methods, and existence results applicable to most situations encounterable in practice. The paper’s contributions are organized as follows. Section 2 overviews the analytic, topological, and learning-theoretic background required in formulating the paper’s results. Section 3 contains the paper’s main results. These include a characterization, a repre- sentation result, a construction theorem, and existence result applicable to any universal approximator in most function spaces of practical interest. The characterization result shows that an architecture has the UAP on a function space if and only if that architecture implicitly decomposes the function space into a collection of separable Banach subspaces, whereon the architecture contains the orbit of a topologically transitive dynamical system. Next, the representation result shows that any universal approximator can always be approximately realized as a transformation of the feed-forward architecture. This result reduces the prob- lem of constructing new universal architectures for identifying the correct transformation of the feed-forward architecture for the given learning task. The construction result gives con- ditions on a set of transformations of the feed-forward architecture, guaranteeing that the resultant is a universal approximator on the target function space. Lastly, we obtain a gen- eral existence and representation result for universal approximators generated by a small number of functions applicable to many function spaces. Section 4 then focuses the main theoretical results to the feed-forward architecture. Our characterization result is used to exhibit the dynamical system representation on the space of continuous functions by composing any function with an additional deep feed-forward layer whose activation function is continuous, injective, and has no fixed points. Using this representation, we show that the set of all such deep feed-forward networks constructed through this dynamical system maintain its universal approximation property even when constraints are imposed on the network’s final layers or when sparsity is imposed on the network’s connections’ initial layers. In particular, we show that feed-forward networks with ReLU activation function fail these requirements, but a simple affine transformation of the Leaky-ReLU activation function is of this type. We provide a simple and explicit method for modifying most commonly used activation functions into this form. We also show that The Universal Approximation Property 437 the conditions on the activation function are sharp for this dynamical system representation to have the desired topological transitivity properties. As an application of our construction and representation results, we build a modification of the feed-forward architecture which can uniformly approximate a large class of contin- uous functions which need not vanish at infinity. This architecture approximates uniformly on the entire input space and not only on compact subsets thereof. This refines the known guarantees for feed-forward networks (see [16, 17]) which only guarantee uniform approxi- mation on compacts subsets of the input space, and consequentially, for functions vanishing at infinity. As a final application of the results, the existence theorem is then used to pro- vide a representation of a small universal approximator on L (R), which provides the first concrete step towards obtaining a tractable universal approximator thereon. 2 Background and preliminaries This section overviews the analytic, topological, and learning-theoretic background used to in this paper. 2.1 Metric spaces Typically, two points x, y ∈ R are thought of as being near to one another if y belongs to the open ball with radius δ> 0 centered about x defined by Ball (x, δ) {z ∈ R : x − z <δ},where (x, z) x − z denotes the Euclidean distance function. The analogue can be said if we replace R by a set X on which there is a distance function d : X × X →[0, ∞) quantifying the closeness of any two members of X. Many familiar properties of the Euclidean distance function are axiomatically required of d in order to maintain many of the useful analytic properties of R ; namely, d is required to satisfy the triangle inequality, symmetry in its arguments, and it vanishes precisely when its arguments are identical. As before, two points x, y ∈ X are thought of as being close if they belong to the same open ball,Ball (x, δ) {z ∈ X : d (x, z) < δ} where δ> 0. Together, the X X pair (X, d ) is called a metric space, and this simple structure can be used to describe many familiar constructions prevalent throughout learning theory. We follow the convention of only denoting (X, d ) by X whenever the context is clear. Example 1 (Spaces of Continuous Functions) For instance, the universal approximation theorems of [16–19] describe conditions under which any continuous function from R to R can be approximated by a feed-forward neural network. The distance function used to formulate their approximation results is defined on any two continuous functions f, g : m n R → R via sup m f(x) − g(x) x∈[−k,k] d (f, g) . ucc 2 1 + sup f(x) − g(x) x∈[−k,k] k=1 m n m n In this way, the set of continuous functions from R to R by C(R , R ) is made into a metric space when paired with d . In what follows, we make the convention of denoting ucc C(X, R) by C(X). Example 2 (Space of Integrable Functions) Not all functions encountered in practice are continuous, and the approximation of discontinuous functions by deep feed-forward 438 A. Kratsios m n networks is studied in [20, 21] for functions belonging to the space L (R , R ).Briefly, m n m n elements of L (R , R ) are equivalence classes of Borel measurable f : R → R , identified up to μ-null sets, for which the norm f f(x) dμ(x) p,μ x∈R is finite; here μ is a fixed Borel measure on R and 1 ≤ p< ∞. We follow the convention m p m m of denoting L (R , R) by L (R ) when μ is the Lebesgue measure on R . m n m n Unlike C(R , R ), the distance function on L (R , R ) is induced through a norm via (f, g) f − g . Spaces of this type simultaneously carry compatible metric and p,μ vector spaces structures. Moreover, in such a space, if every sequence converges whenever its pairwise distances asymptotically tend to zero, then the space is called a Banach space. The prototypical Banach space is R . Unlike Banach spaces or the space of Example 1, general metric spaces are non-linear. That is, there is no meaningful notion of addition, scaling, and there is no singular ref- erence point analogous to the 0 vector. Examples of non-linear metric spaces arising in machine learning are shape spaces used in neuroimaging applications (see [22]), graphs and trees arising in structured and hierarchical learning (see [23, 24]), and spaces of probability measures appearing in adversarial approaches to learning (see [25]). The lack of a reference point may always be overcome by artificially declaring a fixed element of X, denoted by 0 , to be the central point of reference in X. In this case, the triple (X, d , 0 ), is called a pointed metric space. We follow the convention of denoting X X the triple by X, whenever the context is clear. For pointed metric spaces X and Y,the class of functions f : X → Y satisfying f(0 ) = 0 and f(x ) − f(x ) L x − X Y 1 2 1 x ,for some L> 0andevery x ,x ∈ X, is denoted by Lip (X, Y ) and this class is 2 1 2 understood as mapping the structure of X into Y without too large of a distortion. In the extreme case where an f ∈ Lip (X, Y ) perfectly respects the structure of X,i.e. : when f(x ) − f(x ) x − x , we call f a pointed isometry. In this case, f(X) represents 1 2 1 2 an exact copy of X within Y . The remaining non-linear aspects of a general metric space pose no significant challenge and this is due to the following linearization feature map of [26]. Since its inception, the following method has found notable applications in clustering [27] and in optimal transport [28]. In particular, the later connects this linearization procedure with optimal transport approaches to adversarial learning of [29, 30]. Example 3 (Free-Space over X) We follow the formulation described in [28]. Let X be a metric space and for any x ∈ X,let δ be the (Borel) probability measure assigning value 1 to any Ball ⊆ X if x ∈ Ball and 0 otherwise. X X The Free-space over X is the Banach space B(X) obtained by completing the vec- tor space α δ : a ∈ R,x ∈ X, n = 1,...,N, N ∈ N with respect to the n x n n + n=1 following n n α x sup α f(x ).(1) i i i i f 1; f ∈Lip (X,R) i=1 i=1 B(X) As shown in [31, Proposition 2.1], the map δ : x → δ is a (non-linear) isometry from X to B(X).Asshown in[32], the pair (B(X), δ ) is characterized by the following linearization The Universal Approximation Property 439 property: whenever f ∈ Lip (X, Y ) and Y is a Banach space then there exists a unique continuous linear map satisfying f = F ◦ δ.(2) Thus, δ : X → B(X) can be interpreted as a minimal isometric linearizing feature map. Sometimes the feature map δ can be continuously inverted from the left. In [31]any continuous map ρ : B(X) → X is called a barycenter if it satisfies ρ ◦ δ = 1 ,where 1 X X is the identity on X. Following [31], if a barycenter exists then X is called barcycentric.Examplesof barycentric spaces are Banach spaces [33], Cartan-Hadamard manifolds described (see [34, Corollary 6.9.1]), and other structures described in [35]. Accordingly, many function spaces of potential interest contain a dense barycentric subspace. When the context is clear, we follow the convention of denoting δ simply by δ. 2.2 Topological Background Rather than using open balls to quantify closeness, it is often more convenient to work with open subsets of X;where U ⊆ X is said to be open whenever every point x ∈ U belongs to some open ball B (x, δ) contained in U . This is because open sets have many desirable properties; for example, a convergent sequence contained in the complement of an open set must also have its limit in that open set’s complement. Thus, the complement of open sets are often called closed sets since their limits cannot escape them. Unfortunately, many familiar situations arising in approximation theory cannot be described by a distance function. For example, there is no distance function describing the point-wise convergence of a sequence of functions {f } on R to any other such func- n n∈N tion f (for details [36, page 362]). In these cases, it is more convenient to work directly with topologies. A topology τ is a collection of subsets of a given set X whose members are declared as being open if τ satisfies certain algebraic conditions emulating the basic prop- erties of the typical open subsets of R (see [37, Chapter 2]). Explicitly, we require that τ contain the empty set ∅ as well as the entire space X, we require that the arbitrary union of subsets of X belonging to τ also belongs to τ , and we require that finite intersections of subsets of X belonging to τ also be a member of τ.A topological space isapairofaset X and a topology τ thereon. We follow the convention of denoting topological spaces with the same symbol as their underlying set. Most universal approximation theorems [4, 16, 17] guarantee that a particular subset of m n C(R , R ) is dense therein. In general, A ⊆ X is dense if the smallest closed subset of X containing A is X itself. Topological spaces containing a dense subset which can be put in a 1-1 correspondence with the natural numbers N is called a separable space. Many familiar m m spaces are separable, such as C(R ) and R . m n A function f : R → R is thought of as continuously depending on its inputs if small variations in its inputs can only produce small variations in its outputs; that is, for any x ∈ m −1 R 0 there exists some δ> 0 such that f [Ball n ] ⊆ Ball m (x, δ). It can R R −1 be shown, see [37], that this condition is equivalent to requiring that the pre-image f [U ] n m of any open subset U of R is open in R . This reformulation means that open sets are preserved under the inverse-image of continuous functions, and it lends itself more readily to abstraction. Thus, a function f : X → Y between arbitrary topological spaces X and Y is −1 continuous if f [U ] is open in X whenever U is open in Y .If f is a continuous bijection −1 and its inverse function f : Y → X is continuous, then f is called a homeomorphism 440 A. Kratsios and X and Y are thought of as being topologically identical. If f is a homeomorphism onto its image, f is an embedding. We illustrate the use of homeomorphisms with a learning theoretic example. Many learn- ing problems encountered empirically benefit from feature maps modifying the input a of learning model; for example, this is often the case with kernel methods (see [38–40]), in reservoir computing (see [41, 42]), and in geometric deep learning (see [23, 43]). Recently, in [44], it was shown that, a feature map φ : X → R is continuous and injective if and only if the set of all functions f ◦ φ ∈ C(X),where f ∈ C(R ) is a deep feed-forward net- work with ReLU activation, is dense in C(X). A key factor in this characterization is that the map Φ : C(R ) → C(X),given by f → f ◦ φ, is an embedding if φ is continuous and injective. The above example suggests that our study of an architecture’s approximation capa- bilities is valid on any topological space which can be mapped homeomorphically onto a well-behaved topological space. For us, a space will be well-behaved if it belongs to the broad class of Frechet ´ spaces. Briefly, these spaces have compatible topological space and vector space structures, meaning that the basic vector space operations such as addi- tion, inversion, and scalar multiplication are continuous; furthermore, their topology is induced by a complete distance function which is invariant under translation and satisfies an additional technical condition described in [45, Section 3.7]. The class of Frechet ´ spaces encompass all Hilbert and Banach spaces and they share many familiar properties with R . m n Relevant examples of a Frechet ´ space are C(R , R ), the free-space B(X) over any pointed 1 m n metric space, and L (R , R ). 2.3 Universal approximation background In the machine learning literature, universal approximation refers to a model class’ ability to generically approximate any member of a large topological space whose elements are functions, or more rigorously, equivalence classes of functions. Accordingly, in this paper, we focus on a class of topological spaces which we call function spaces. In this paper, a function space X is a topological space whose elements are equivalence classes of functions between two sets X and Y . For example, when X = R = Y then X may be C(R) or L (R). We refer to X as a function space between X and Y and we omit the dependence to X and Y if it is clear from the context. The elements in X are called functions, whereas functions between sets are referred to as set-functions. By a partial function f : X → Y we mean a binary relation between the sets X and Y which attributes at-most one output in Y to each input in X. Notational Conventions The following notational conventions are maintained throughout this paper. Only non-empty outputs of any partial function f are specified. We denote the + + + set of positive integers by N .Weset N N ∪{0}.Forany n ∈ N ,the n-fold Cartesian product of a set A with itself is denoted by A .For n ∈ N, we denote the n-fold composition n 0 of a function φ : X → X with itself by φ and the 0-fold composition φ is defined to be the identity map on X. Definition 1 (Architecture) Let X be a function space. An architecture on X is a pair (F , ) of a set of set-functions F between (possibly different) sets and a partial function : F → X , satisfying the following non-triviality condition: there exists some J ∈N f ∈ X , J ∈ N ,and f ,...,f ∈ F satisfying 1 J f = (f ) ∈ X.(3) j =1 The Universal Approximation Property 441 The set of all functions f in X for which there is some J ∈ N and some f ,...,f ∈ F 1 J (F , ) satisfying the representation (3) is denoted by NN . Many familiar structures in machine learning, such as convolutional neural networks, trees, radial basis functions, or various other structures can be formulated as architectures. To fix notation and to illustrate the scope of our results we express some familiar machine learning models in the language of Definition 1. Example 4 (Deep Feed-Forward Networks) Fix a continuous function σ : d D R → R, denote component-wise composition by •,and letAff(R , R ) be d D m n the set of affine functions from R to R .Let X = C(R , R ), F d d i i+1 (W ,W ) : W ∈ Aff(R , R ), i = 1, 2 ,and set 2 1 1 d ,d ,d ∈N 1 2 3 ((W ,W ) ) W ◦ σ • W ◦ ··· ◦ W ◦ σ • W (4) j,2 j,1 2,J 1,J 2,1 1,1 j =1 whenever the right-hand side of (4) is well-defined. Since the composition of two affine (F , ) m functions is again affine then NN is the set of deep feed-forward networks from R to R with activation function σ . Remark 1 The construction of Example 4 parallels the formulation given in [46, 47]. How- (F , ) ever, in [47]elementsof F are referred to as neural networks and functions in NN are called their realizations. Example 5 (Trees) Let X = L (R), F {(a,b,c) : a ∈ R,b, c ∈ R,b ≤ c},andlet J F , J ( ) 1 ((a ,b ,c ) ) a I . Then, NN is the set of trees in L (R). j j j j (b ,c ) j =1 j j j =1 We are interested in architectures which can generically approximate any function on their associated function space. Paraphrasing [48, page 67], any such architecture is called a universal approximator. Definition 2 (The Universal Approximation Property) An architecture (F , ) is said to (F , ) have the universal approximation property (UAP) if NN is dense in X . 3 Main Results Our first result provides a correspondence between the apriori algebraic structure of uni- (F , ) versal approximators on X and decompositions of X into subspaces on which NN contains the orbit of a topologically generic dynamical system, which are a priori of a topo- logical nature. The interchangeability of algebraic and geometric structures is a common theme, notable examples include [49–52]. Theorem 1 (Characterization: Dynamical Systems Structure of Universal Approximators) Let X be a function space which is homeomorphic to an infinite-dimensional Frec ´ het space and let (F , ) be an architecture on X . Then, the following are equivalent: (i) (F , ) is a universal approximator, (ii) There exist subspaces {X } of X , continuous functions {φ } with φ : X → X , i i∈I i i∈I i i i (F , ) and {g } ⊆ NN such that: i i∈I (a) X is dense in X , i∈I 442 A. Kratsios (b) For each i ∈ I and every pair of non-empty open U, V ⊆ X ,thereissome N ∈ N satisfying i,U,V i,U,V φ (U ) ∩ (V ) =∅, n (F , ) (c) For every i ∈ I , g ∈ X and {φ (g )} is a dense subset of NN ∩ X , i i i n∈N i (d) For each i ∈ I , X is homeomorphic to C(R). (F , ) In particular, φ (g ) : i ∈ I, n ∈ N is dense in NN . Theorem 1 describes the structure of universal approximators, however, it does not describe an explicit means of constructing them. Nevertheless, Theorem 1 (ii.a) and (ii.d) suggest that universal approximators on most function spaces can be built by combining m n multiple, non-trivial, transformations of universal approximators on C(R , R ). This is type of transformation approach to architecture construction is common in geo- metric deep learning, whereby non-Euclidean data is mapped to the input of familiar d D architectures defined between R and R using specific feature maps and that model’s out- puts are then return to the manifold by inverting the feature map. Examples include the hyperbolic feed-forward architecture of [24], and the shape space regressors of [53], and the matrix-valued regressors of [54, 55], amongst others. This transformation procedure is a particular instance of the following general construction method, which extends [44]. Theorem 2 (Construction: Universal Approximators by Transformation) Let n, m, ∈ N , m n X be a function space, (F , ) be a universal approximator on C(R , R ), and { } i i∈I m n be a non-empty set of continuous functions from C(R , R ) to X satisfying the following condition: m n Φ C(R , R ) is dense in X.(5) i∈I Then (F , ) has the UAP on X,where F F × I and {f ,i } Φ Φ Φ Φ j j j =1 Φ (f ) . I j j =1 The alternative approach to architecture development, subscribed to by authors such as [56–59], specifies the elementary functions F and the rule for combining them. Thus, this method explicitly specifies F and implicitly specifies . These competing approaches are in-fact equivalent since every universal approximator an approximately a transformation of the feed-forward architecture on C(R). Theorem 3 (Representation: Universal Approximators are Transformed Neural Networks) Let σ be a continuous, non-polynomial activation function, and let (F , ) denote the 0 0 architecture of Example 4. Let X be a function space which is homeomorphic to an infinite- dimensional Frec ´ het. If (F , ) has the UAP on X then, there exists a family { } of i i∈I (F , ) embeddings : C(R) → X such that for every 0, f ∈ NN there exists some (F , ) (F , ) 0 0 i ∈ I , g ∈ NN , and f ∈ NN satisfying −1 d ( (g )) and d g (f ) . X i ucc The previous two results describe the structure of universal approximators but they do not imply the existence of such architectures. Indeed, the existence of a universal approxi- mator on X can always be obtained by setting F = X and (f ) = f ;however,thisis (F , ) uninteresting since F is large, is trivial, and NN is intractable. Instead, the next The Universal Approximation Property 443 result shows that, for a broad range of function spaces, there are universal approximators for which F is a singleton, and the structure of is parameterized by any prespecified separable metric space. This description is possible by appealing to the free-space on X . Theorem 4 (Existence: Small Universal Approximators) Let X be a separable pointed met- ric space with at least two points, let X be a function space and a pointed metric space, and let X be a dense barycentric sub-space of X . Then, there exists a non-empty set I with pre-order ≤, {x } ⊆ X −{0 } there exist triples {(B ,φ )} of linear subspaces i i∈I X i i i i∈I B of B(X ), bounded linear isomorphisms : B(X) → B , and bounded linear maps i 0 i i φ : B(X) → B(X) satisfying: (i) B(X ) = B , 0 i i∈I (ii) For every i ≤ j , B ⊆ B , i j (iii) For every i ∈ I , ◦ φ (x ) is dense in B with respect to its subspace i i i n∈N i topology, (iv) The architecture F ={x } , and | : (x ,...,x ) ρ ◦ ◦ φ ◦ δ , i i∈I 1 J i x F i j whenever x = x for each j ≤ J , is a universal approximator on X . 1 j Furthermore, if X = X then the set I is a singleton and is the identity on B(X ). i 0 The rest of this paper is devoted to the concrete implications of these results in learning theory. 4 Applications The dynamical systems described by Theorem 1 (ii) can, in general, be complicated. How- ever, when (F , ) is the feed-forward architecture with certain specific activation functions then these dynamical systems explicitly describe the addition of deep layers to a shallow feed-forward network. We begin the next section by characterizing those activation function before outlining their approximation properties. 4.1 Depth as a transitive dynamical system The impact of different activation functions on the expressiveness of neural network archi- tectures is an active research area. For example, [60] empirically studies the effect of different activation function on expressiveness and in [61] a characterization of the activa- tion functions for which shallow feed-forward networks are universal is also obtained. The next result characterizes the activation functions which produce feed-forward networks with the UAP even when no weight or bias is trained and the matrices {A } are sparse, and n=1 the final layers of the network are slightly perturbed. Fix an activation function σ : R → R.For every m × m matrix A and b ∈ R , define the associated composition operator : f → f ◦ σ • (A ·+b), with termi- A,b nology rooted in [62]. The family of composition operators { } creates depth within A,b A,b an architecture (F , ) by extending it to include any function of the form ◦· · ·◦ A ,b N N J N ((f ) ) , for some m × m matrices {A } , {b } in R , and each f ∈ F A ,b j n n j 1 1 j =1 n=1 for j = 1,...,J . In fact, many of the results only require the following smaller extension of (F , ), denoted by (F , ),where F F × N and where deep;σ deep;σ deep;σ J J J {(f ,n )} ((f ) ) , deep;σ j j j j =1 I ,b j =1 m 444 A. Kratsios and b is any fixed element of R with positive components and I is the m × m identity matrix. Theorem 5 (Characterization of Transitivity in Deep Feed-Forward Networks) Let (F , ) m n m be an architecture on C(R , R ), σ be a continuous activation function, fix any b ∈ R with strictly positive components. Then is a well-defined continuous linear map from I ,b m n C(R , R ) to itself and the following are equivalent: (i) σ is injective and has no fixed-points, (ii) Either σ(x) > x or σ(x) < x holds for every x ∈ R m n (iii) For every g ∈ (F , ) and every δ> 0, there exists some g ˜ ∈ C(R , R ) with m n d (g, g) ˜ < δ such that, for each f ∈ C(R , R ) and each 0 there is a ucc N ∈ N satisfying d ( ˜ ucc I ,b m n + (iv) For each 0 and every f, g ∈ C(R , R ) there is some N ∈ N such that U,V ˜ ˜ (g) ˜ : d (g, ˜ g) < δ ∩ f : d ( =∅. ucc ucc I ,b Remark 2 A characterization is given in Appendix B when A = I , however, this less technical formulation is sufficient for all our applications. We call an activation function transitive if it satisfies any of the conditions (i)-(ii) in Theorem 5. Example 6 The ReLU activation function σ(x) = max{0,x} does not satisfy Theorem 5 (i). Example 7 The following variant of the Leaky-ReLU activation of [63] does satisfy Theorem 5 (i) 1.1x + .1 x ≥ 0 σ(x) 0.1x + .1 x< 0. More generally, transitive activation functions also satisfying the conditions required by the central results of [17, 61] can be build via the following. Proposition 1 (Construction of Transitive Activation Functions) Let σ ˜ : R → R be a continuous and strictly increasing function satisfying σ( ˜ 0) = 0. Fix hyper-parameters 0 < α < 1, 0 <α such that α =˜ σ (0) − 1, and define 1 2 2 σ( ˜ x) + x + α : x ≥ 0 σ(x) α x + α : x< 0. 1 2 Then, σ is continuous, injective, has no fixed-points, is non-polynomial, and is continuously differentiable with non-zero derivative on infinitely many points. In particular, σ satisfies the requirements of Theorem 5. Transitive activation functions allow one to automatically conclude that m n (F , ) has the UAP on C(R , R ) if (F , ) is only a universal approximator σ ;deep σ ;deep on some non-empty open subset thereof. The Universal Approximation Property 445 m n Corollary 1 (Local-to-Global UAP) Let X be a non-empty open subset of C(R , R ) and (F , ) be a universal approximator on X . If any of the conditions described by Lemma 3 m n (i)-(iii) hold, then (F , )[σ ; deep] is a universal approximator on C(R , R ). The function space affects which activation functions are transitive. Since most universal m n m approximation results hold in the space C(R , R ) or on L (R ), for suitable μ and p,we describe the integrable variant of transitive activation functions. 4.1.1 Integrable variants Some notation is required when expressing the integrable variants of the Theorem 5 and its consequences. Fix a σ -finite Borel measure μ on R . Unlike in the continuous case, the 1 m operators may not be well-defined or continuous from L (R ) to itself. We require A,b m m the notion of a push-forward measure by a measurable function is required. If S : R → R is Borel measurable and μ is a finite Borel measure on R , then its push-forward by S is the m −1 measure denoted by S μ and defined on Borel subsets B ⊆ R by S μ(B) μ S [B] . # # In particular, if μ is absolutely continuous with respect to the Lebesgue measure μ on R , then as discussed in [64, Chapter 2.1], S μ admits a Radon-Nikodym derivative with respect to the Lebesgue measure on R . We denote this Radon-Nikodym derivative dS μ by . A finite Borel measure μ on R is equivalent to the Lebesgue measure thereon, dμ denoted by μ if both μ and μ are absolutely continuous with one another. M M Recall that, if a function is monotone on R, then it is differentiable outside a μ -null set. We denote the μ -a.e. derivative of any such a function σ by σ . Lastly, we denote the 1 m essential supremum of any f ∈ L (R ) by f . The following Lemma is a rephrasing of [64, Corollary 2.1.2, Example 2.17]. Lemma 1 Fix a σ -finite Borel measure μ on R equivalent to the Lebesgue measure, let 1 ≤ p< ∞, b ∈ R , A be an m × m matrix, and let σ : R → R be a Borel measurable. 1 m n 1 m n Then, the composition operator : L (R ; R ) → L (R ; R ) is well-defined and A,b continuous if and only if (σ • (A ·+b)) μ is absolutely-continuous with respect to μ and d(σ • (A ·+b)) μ < ∞.(6) dμ In particular, when σ is monotone then is well-defined if and only if there exists some I ,b M> 0 such that for every x ∈ R, M ≤ σ (x + b). 1 m n 1 m n For g ∈ L (R , R ) and δ> 0, we denote the set of all functions f ∈ L (R , R ) μ μ satisfying f(x) − g(x) by Ball 1 m n (g, δ). A function is called Borel L (R ,R ) x∈R bi-measurable if both the image and pre-images of Borel sets, under that map, are again Borel sets. Corollary 2 (Transitive Activation Functions (Integrable Variant)) Let μ be a σ -finite mea- m m sure on R ,let b ∈ R with b > 0 for i = 1,...,m, and suppose that σ is injective, Borel bi-measurable, that σ(x) > x except on a Borel set of μ-measure 0, and assume that 1 m condition (6) holds. If (F , ) has the UAP on Ball(g, δ) for some f ∈ L (R ) and some 1 m (F , ) δ> 0 then, for every f ∈ L (R ) and every 0 there exists some f ∈ NN and N ∈ N such that f(x) − (f (x)) . I ,b x∈R 446 A. Kratsios We call activation functions satisfying the conditions of Corollary 2 L -transitive. The following is a sufficiency condition analogous to the characterization of Proposition 1. Corollary 3 (Construction of Transitive Activation Functions (Integrable Variant)) Let μ be a finite Borel measure on R which is equivalent to μ .Let σ ˜ :[0, ∞) →[0, ∞) be a surjective continuous and strictly increasing function satisfying σ( ˜ 0) = 0,let 0 <α < 1. Define the activation function σ( ˜ x) + x : x ≥ 0 σ(x) αx : x< 0. Then σ is Borel bi-measurable, σ(x) > x outside a μ -null-set, it is non-polynomial, and it is continuously differentiable with non-zero derivative for every x< 0. Different function spaces can have different transitive activation functions. By shifting the Leaky-ReLU variant of Example 7 we obtain an L -transitive activation function which fails to be transitive. Example 8 (Rescaled Leaky-ReLU is L -Transitive) The following variant of the Leaky- ReLU activation function 1.1xx ≥ 0 σ(x) 0.1xx< 0, is a continuous bijection on R with continuous inverse and therefore it is injective and bi- measurable. Since 0 is its only fixed point, then the set {σ(x) >x}={0} is of Lebesgue measure 0, and thus of μ measure 0 since μ and μ are equivalent. Hence, σ is injective, Borel bi-measurable, that σ(x) > x except on a Borel set of μ-measure 0, as required in (2). However, since 0 is a fixed point of σ then it does not meet the requirements of Theorem 5 (i). Our main interest with transitive activation functions is that they allow for refinements of classical universal approximation theorems, where a network’s last few layers satisfy constraints. This is interesting since constraints are common in most practical citations. 4.2 Deep networks with constrained final layers The requirement that the final few layers of a neural network to resemble the given function f is in effect a constraint on the network’s output possibilities. The next result shows that, if a transitive activation function is used, then a deep feed-forward network’s output layers may always be forced to approximately behave like f while maintaining that architecture’s universal approximation property. Moreover, the result holds even when the network’s initial layers are sparsely connected and have breadth less than the requirements of [17, 19]. Note that, the network’s final layers must be fully connected and are still required to satisfy the width constraints of [17]. For a matrix A (resp. vector b) the quantity A (resp. b ) 0 0 denotes the number of non-zero entries in A (resp. b). Corollary 4 (Feed-Forward Networks with Approximately Prescribed Output Behavior) m n Let f : R → R , 0, and let σ be a transitive activation function which is non-affine continuous and differentiable at-least at one point with non-zero derivative at that point. If m n there exists a continuous function f : R → R such that d (f , f )<δ, (7) ucc 0 0 The Universal Approximation Property 447 (F , ) + then there exists f ∈ NN , J, J ,J ∈ N , 0 ≤ J <J , and sets of composable 1 2 1 J 2 affine maps {W } , {W } such that f = W ◦σ •· · ·◦σ •W and the following hold: j j J 1 j =1 j =1 (i) d f,W ◦ σ •· · ·◦ σ • W <δ, ucc J J (ii) d f, f , ucc (iii) max A ≤ m, j =1,...,J 0 d d j j +1 (iv) W : R → R is such that d ≤ m + n + 2 if J <j ≤ J and d = m if j j 1 j 0 ≤ j ≤ J . If J = 0 we make the convention that W ◦ σ •· · ·◦ σ • W (x) = x. 1 J 1 Remark 3 Condition 7, for any δ> 0, whenever f is continuous. We consider an application of Corollary 4 to deep transfer learning. As described in [65], deep transfer learning is the practice of transferring knowledge from a pre-trained model into a neural network architecture which is to be trained on a, possibly new, learning task. Various formalizations of this paradigm are described in [66] and the next example illustrates the commonly used approach, as outlined in [67], where one first learns a feed- m n forward network f : R → R and then uses this map to initialize the final portion of a deep feed-forward network. Here, given a neural network f , typically trained on a different learning task, we seek to find a deep feed-forward network whose final layers are arbitrarily close to f while simultaneously providing an arbitrarily precise approximation to a new learning task. Example 9 (Feed-Forward Networks with Pre-Trained Final Layers are Universal) Fix a continuous activation function σ,let N> 0 be given, let (F , ) as in Example 4, let K (F , ) be a non-empty compact subset of R ,and let f ∈ NN . Corollary 4 guarantees that there is a deep feed-forward neural network f = W ◦ σ •· · ·◦ σ • W satisfying J 1 −1 (i) sup f(x) − W ◦ σ •· · ·◦ σ • W (x) <N , J J x∈K 1 −1 (ii) sup f(x) − f (x) <N , x∈K (iii) max A ≤ m, j =1,...,J 0 d d j j +1 (iv) W : R → R is such that d ≤ m + n + 2if J <j ≤ J and d = m if j j 1 j 0 ≤ j ≤ J . The structure imposed on the architecture’s final layers can also be imposed by a set of constraints. The next result shows that, for a feed-forward network with a transitive activation function, the architecture’s output can always be made to satisfy a finite num- ber of compatible constraints. These constraints are described by a finite set of continuous N m n N functionals {F } on C(R , R ) together with a set of thresholds {C } , where each n n n=1 n=1 C > 0. Corollary 5 (Feed-Forward Networks with Constrained Final Layers are Universal) Let σ be a transitive activation function which is non-affine continuous and differentiable at- least at one point with non-zero derivative at that point, let (F , ) denote the feed-forward N m n architecture of Example 4, {F } be a set of continuous functions from C(R , R ) to n=1 N m n [0, ∞), and {C } be a set of positive real numbers. If there exists some f ∈ C(R , R ) n 0 n=1 448 A. Kratsios such that for each n = 1,...,N the following holds F (f )<C , (8) n 0 n (F , ) m n then for every f ∈ C(R , R ) and every 0,there exist f ,f ∈ NN , 1 2 J m diagonal m × m-matrices {A } and b ,...,b ∈ R satisfying: j 1 J j =1 (i) f ◦ f is well-defined, 2 1 (ii) d f, f ◦ f , ucc 2 1 −1 (iii) f ∈ F [[0,C )], 2 n n=1 n (iv) f (x) = σ • (A ·+b ) ◦ ··· ◦ σ • (A x + b ). 1 n n 1 1 Next, we show that transitive activation functions can be used to extend the currently- available approximation rates for shallow feed-forward networks to their deep counterparts. 4.3 Approximation bounds for networks with transitive activation function In [68, 69], it is shown that the set of feed-forward neural networks of breadth N ∈ N , can −1 approximate any function lying in their closed convex hull of at a rate of O (N ).These results do not incorporate the impact of depth into its estimates and the next result builds 1 m on them by incorporating that effect. As in [69], the convex-hull of a subset A ⊆ L (R ) n n is the set co (A) A α f : f ∈ A, α ∈[0, 1], α = 1 and the interior of i i i i i i=1 i=1 co (A) A, denoted int(co (A) A), is the largest open subset thereof. Corollary 6 (Approximation-Bounds for Deep Networks) Let μ be a finite Borel measure m 1 m on R which is equivalent to the Lebesgue measure, F ⊆ L (R ) for which int(co (A) F ) is non-empty and co (A) F ∩ int(co (A) F ) is dense therein. If σ is a continuous non- 1 m polynomial L -transitive activation function, b ∈ R have positive entries, and that (6)is satisfied, then the following hold: 1 m 1. For each f ∈ L (R ) and every n ∈ N,there issome N ∈ N such that the following bound holds d(σ •(·+b)) μ dμM N ∞ , inf α (f ) (x)−f(x) dμ(x) ≤ √ 1+ 2μ(R ) . i i I ,b n m f ∈F , α =1,α ∈[0,1] n i i i x∈R i=1 i=1 d(σ •(·+b) μ # N 2. There exists some κ> 1 such that >κ . In particular, dμ d(σ •(·+b)) μ lim =∞, dμ N →∞ ∞ n n 3. α (f ) : N, n ∈ N,f ∈ F,α ∈[0, 1], α = 1 is dense in i i i i i i=1 i=1 I ,b 1 m L (R ). Remark 4 Unlike in [69], Corollary 6(i) holds even when the function f does not lie in the closure of co (A) F . This is entirely due to the topological transitivity of the composition operator and is therefore entirely due to the depth present in the network. In particular, I ,b Corollary 6 (iii) implies that universal approximation can be achieved even if a feed-forward networks’ output weights are all constrained to satisfy α = 1and α =[0, 1] and i i i=1 even if all but the architecture’s final two layers are sparsely connected and not trainable. The Universal Approximation Property 449 To date, we have focused on the application and interpretation of Theorem 1. Next, Theorem 3 is used to modify and improve the approximation capabilities of universal approximators on C(R). 4.4 Improving the approximation capabilities of an architecture Most currently available universal approximation results for spaces of continuous functions, provide approximation guarantees for the topology of uniform convergence on compacts. Unfortunately, this is a very local form of approximation and there is no guarantee that the approximation quality holds outside a prespecified bounded set. For example, the sequence 1−(x−n) f (x) e I converges to the constant 0 function, uniformly on compacts n |x−n|≤1 while maintaining the constant error sup f (x) − 0 1. x∈R These approximation guarantees are strengthened by modifying any given universal m n approximator on C(R , R ) to obtain a universal approximator in a smaller space of continuous functions for a much finer topology. We introduce this space as follows. Let be a finite set of non-negative-valued, continuous functions ω from [0, ∞) to m n [0, ∞) for which there is some ω ∈ satisfying ω (·) = 1. Let C (R , R ) be the set 0 0 of all continuous functions whose asymptotic growth-rate is controlled by some ω ∈ ,in m n m n m n the sense that, C (R , R ) C (R , R ),where f ∈ C (R , R ) if f ω ω ω,∞ ω∈ f(x) m n < ∞. Each C (R , R ) is a special case of the weighted spaces studied in [70], ω( x )+1 m n which are Banach spaces when equipped with the norm . Accordingly, C (R , R ) ω,∞ m n is equipped with the finest topology making each C (R , R ) into a subspace. Indeed, such a topology exists by [71, Proposition 2.6]. i m n Example 10 If ={max{t, t }} then f ∈ C (R , R ) if and only if f has asymptot- i>0 m n ically sub-polynomial growth, in the sense that, there is a polynomial p : R → R with f(x) lim < ∞. ( p(x) 1) m n Given an architecture (F , ) on C(R , R ), define its -modification to be the m n 2 architecture (F , ) on C (R , R ) given by F F × × (0, ∞) and where J 2 −|f(·)|( x b ) b J f ,α ,ω ,b ,a ω ( 1) fe + a I + a e I , j j j j j J J <b J b J J j =1 (f ,...,f ) J 1 (F , ) Therefore, the functions in NN are capable of adjusting to the different growth m n rates of functions in C (R , R ) into continuous functions of different growth rates; whereas those in (F , ) need not be. m n Theorem 6 ((F , ) is a Universal Approximator in C (R , R )) If (F , ) is a uni- m n (F , ) versal approximator on C(R , R ) for which each f ∈ NN satisfies the following growth condition sup f(x) e < ∞, (9) x∈R m n then (F , ) is a universal approximator on C (R , R ). 450 A. Kratsios Remark 5 Condition (9) is satisfied by any set of piecewise linear functions. For instance, (F , ) NN is comprised of piecewise linear functions if F is as in Example 4 and σ is the ReLU activation function. The architecture (F , ) often provides a strict improvement over (F , ). m n Proposition 2 Let (F , ) be a universal approximator on C(R , R ) such that each f ∈ (F , ) NN is either constant or sup m f(x) , and let {exp(−kt) : n ∈ N}. x∈R m n Then (F , ) is not a universal approximator on C (R , R ). 4.5 Representation of approximators on L There is currently no available universal approximation theorem describing a small archi- ∞ m n tecture on L (R , R ) with the UAP. Indeed, even trees are not dense therein since the Lebesgue measures is σ -finite and not finite. A direct consequence of Theorem 4 is the guarantee that a minimal architecture on L (R) exists and admits the following representation. Corollary 7 (Existence and Representation of Minimal Universal Approximator on L (R)) There exists a non-empty set I with pre-order ≤, a subset {x } ⊆ L (R) −{0}, i i∈I triples {(B ,φ )} of linear subspaces B of B(L ), bounded linear isomorphisms i i i i∈I i 1 1 1 : L (R) → B , and bounded linear maps φ : L (R) → L (R) such that: i i i (i) B(L ) = B , i∈I (ii) For every i ≤ j , B ⊆ B , i j (iii) For every i ∈ I , + ◦ φ (x ) is dense B for its subspace topology, i i i n∈N i (iv) The architecture (F , ) defined by F ={x } , | : (x ,...,x ) ρ ◦ ◦ φ ◦ η(x ) (10) i i∈I 1 j i i F i ∞ 1 if x = x ,foreach j ≤ J , has the UAP on L (R),where η : R → L and 1 j ∞ ∞ ρ : B(L ) → L are respectively defined as the linear extensions of the maps n n I : s> 0 1 [0,r) η(r) ρ α δ α f . i f i i −I : s< 0, n [−r,0) i=1 i=1 The contributions of this article are now summarized. 5 Conclusion In this paper, we studied the universal approximation property in a scope applicable to most architectures on most function spaces of practical interest. Our results were used to characterize, construct, and establish the existence of such structures both in many familiar and exotic function spaces. Our results were used to establish the universal approximation capabilities of deep and narrow networks with constraints on their final layers and sparsely connected initial layers. We derived approximation bounds for feed-forward networks with this activation function in terms of depth and height. We showed that the set of activation functions for which these p m m results hold is broader when the underlying functions space is L (R ) than if it is C(R ), The Universal Approximation Property 451 which showed that the choice of activation function depends on the underlying topologi- cal criterion quantifying the UAP. We characterized the activation functions for which these results hold as precisely being the set of injective, continuous, non-affine activation func- tions which are differentiable at at-least one point with non-zero derivative at that point and have no fixed points. We provided a simple direct way to construct these activation func- tions. We showed that a rescaled and shifted Leaky-ReLU activation is an example of such an activation function while the ReLU activation is not. We used our construction result to build a universal approximator in the space of continuous functions between Euclidean spaces, which have controlled growth, equipped with a uniform notion of convergence. This result strengthens the currently available guarantees for feed-forward networks, which state m n that this architecture is universal in C(R , R ) for the weaker uniform convergence on com- pacts topology. Finally, we obtained a representation of a small universal approximator on ∞ m L (R ). The results, structures, and methods introduced in this paper provide a flexible and broad toolbox to the machine learning community to build, improve, and understand uni- versal approximators. It is hoped that these tools will help others develop new, theoretically justified architectures for their learning tasks. Funding Open access funding provided by Swiss Federal Institute of Technology Zurich. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommonshorg/licenses/by/4.0/. Appendix A: Proofs of Main Results Theorem 1 is encompassed by the following broader but more technical result. Lemma 2 (Characterization of the Universal Approximation Property) Let X be a function space, E is an infinite-dimensional Frec ´ het space for which there exits some homeomor- phism : X → E, and F , be an architecture on X . Then the following are equivalent: (i) UAP: F , has the UAP, (ii) Decomposition of UAP via Subspaces: There exist subspaces {X } of X such that: i i∈I (a) X is dense in X , i∈I (b) For each i ∈ I , X ) is a separable infinite-dimensional Frec ´ het subspace (F , ) of E and NN ∩ X contains a countable, dense, and linearly- independent subset of X ), (c) For each i ∈ I , there exists a homeomorphism : X → L (R). i i (iii) Decomposition of UAP via Topologically Transitive Dynamics: There exist sub- spaces {X } of X and continuous functions {φ } with φ : X → X such i i∈I i i∈I i i i that: 452 A. Kratsios (a) X is dense in X , i∈I (b) For every pair of non-empty open subsets U, V of X and every i ∈ I,there is i,U,V some N ∈ N such that φ (U ∩ X ) ∩ (V ∩ X ) =∅, i,U,V i i (F , ) n (c) For every i ∈ I,thereis some g ∈ NN ∩ X such that {φ (g )} is a i i i n∈N (F , ) dense subset of NN ∩ X , and in particular, it is a dense subset of X , i i (d) For each i ∈ I , X is homeomorphic to C(R). (iv) Parameterization of UAP on Subspaces: There are triples {(X ,ψ )} of sep- i i i i∈I arable topological spaces X , non-constant continuous functions : X → X , and i i i functions ψ : X → X satisfying the following: i i i (a) (X ) is dense in X , i i i∈I (b) For every i ∈ I and every pair of non-empty open subsets U, V of X ,thereis i,U,V some N ∈ N such that ψ (U ∩ X ) ∩ (V ∩ X ) =∅, i,U,V i i (F , ) (c) For every i ∈ I,thereissome x ∈ NN ∩ X such that { ◦ ψ (x )} i i i i n∈N (F , ) is a dense subset of NN ∩ (X ), and in particular, it is a dense subset i i of (X ). i i Moreover, if X is separable, then I may be taken to be a singleton. Proof of Lemma 2 Suppose that (ii) holds. Since X is dense in X and since i∈I (F , ) (F , ) (F , ) NN ∩X ⊆ NN , then, it is sufficient to show that NN ∩X i i i∈I i∈I is dense in X to conclude that is is dense in X . Since each X is a subspace of X then, i i i∈I (F , ) by restriction, each X is a subspace of NN ∩ X with its relative topology. i i i∈I Let X denote the set X equipped with the finest topology making each X into a i i i∈I subspace, such a topology exists by [71, Proposition 2.6]. Since each X is also a subspace of X with its relative topology and since, by definition, that topology is no finer than i∈I (F , ) the topology of X then it is sufficient to show that NN ∩ X is dense in X to i∈I conclude that it is dense in X equipped with its relative topology. i∈I Indeed, by [71, Proposition 2.7] the space X is given by the (topological) quotient of the disjoint union X , in the sense of topological spaces (see [71, Example 3, Section 2.4]), i∈I i under the equivalence relation f ∼ f if f = f in X . Denote the corresponding quotient i j i j map by Q .Sinceasubset U of the quotient topology is open (see [71, Example 2, Section −1 2.4]) if and only if Q [U ] is an open subset of X and since a subset V of X is i∈I i i∈I i open if and only if V ∩ X is open for each i ∈ I in the topology of X then U ⊆ X is open i i −1 (F , ) if and only if Q [U]∩ X is open for each i ∈ I.Since {NN ∩ X } + is dense in i i n∈N X then for every open subset U ⊆ X i i (F , ) (F , ) ∅ = U ∩ NN ∩ X ⊆ U ∩ NN ∩ X . (11) i i i∈I In particular, (11) implies that for every open subset U ⊆ X (F , ) −1 (F , ) ∅ = NN ∩ X ∩ Q [U]∩ X ⊆ U ∩ NN ∩ X . (12) i i i i∈I (F , ) Therefore, NN ∩X is dense in X and therefore it is dense in X equipped i i i∈I i∈I with its relative topology. Hence, F has the UAP and therefore (i) holds. In the next portion of the proof, we denote the (linear algebraic) dimension of any vector space V by dim(V ). Recall, that this is the cardinality of the smallest basis for V . We follow The Universal Approximation Property 453 the Von Neumann convention and, whenever required by the context, we identify the natural number n with the ordinal {1,...,n}. Assume that (i) holds. For the first part of this proof, we would like to show that D contains a linearly independent and dense subset D .Since X is homeomorphic to some infinite-dimensional Frechet ´ space E, then there exists a homeomorphism : X → E (F , ) mapping NN to a dense subset D of E. We denote the metric on E by d . A con- sequence of [72, Theorem 3.1], discussed thereafter by the authors, implies that since E is an infinite dimensional Frechet ´ space then it has a dense Hamel basis, which we denote by {b } . By definition of the Hamel basis of E we may assume that the cardinality of A, a a∈A denoted by Card(A), is equal to dim(E).Next, we use {b } to produce a base of open a a∈A sets for the topology of E of cardinality equal to dim(E). Since E is a metric space, then its topology is generated by the open sets {Ball (b ,q)} ,where Ball (b ,r) {d(b ,x) < r} . Indeed, since Q is dense E a E a a a∈A,r∈(0,∞) in R, then for every a ∈ A and r ∈ (0, ∞) the basic open set Ball (b ,r) can be expressed E a by Ball (b ,r) = Ball (b ,q). Hence, {Ball (b ,q)} generates E a E a E a a∈A,q∈Q∩(0,∞) q∈Q∩(0,r) the topology on E. Moreover, the cardinality the indexing set A × Q is computed by Card(A×Q∩(0, ∞)) = max{Card(A), Card(Q)}= max{dim(E), Card(Q)}= dim(E), since E is infinite and therefore at-least countable. Therefore, {Ball (b ,q)} E a a∈A,q∈Q∩(0,∞) is a base for the topology on E of Cardinality equal to dim(E).Let ω be the smallest ordinal with Card(ω) = dim(E) = Card(A × Q ∩ (0, ∞)). In particular, there exists a bijection F : ω → A × Q ∩ (0, ∞) which allows us to canonically order the open sets {Ball (F (j ) ,F (j) )} , where for any j< ω we denote F(j) ∈ A and F(j) ∈ E 1 2 j ≤ω 1 2 Q ∩ (0, ∞). We construct D by transfinite induction using ω. Indeed since 1 <ω,thensince D is dense in E and {Ball (F (j ) ,F (j) )} defines a base for the topology of E, then there exists some E 1 2 j ≤ω U ∈{Ball (F (j ) ,F (j) )} containing some d ∈ D. For the inductive step, 1 E 1 2 j ≤ω 1 suppose that for all i ≤ j for some j< ω, we have constructed a linearly inde- pendent set {d } with d ∈{Ball (F (i) ,F (i) )} for every i ≤ j.Since j< i i<j i E 1 2 ω and {d } contains Card(j ) and {d } is a Hamel basis of span({x } ) then i i<j i i<j i i<j dim span({x } ) < dim(E). Hence, span({x } ) has empty interior and therefore i i<j i i<j it cannot contain any {Ball (F (j ) ,F (j) )} . In particular, there is an open subset E 1 2 j ≤ω V ⊆ Ball (F (j ) ,F (j) ) − span({x } ) and since D was assumed to be dense in E then E 1 2 i i<j theremustbesome d ∈ V ⊆ Ball (F (j ) ,F (j) ). This completes the inductive step and j E 1 2 therefore there is a linearly independent and dense subset D {d } contained in D of j j ≤ω cardinality Card(ω) = dim(E). Next, let I be the set of all countable sequences of distinct elements in ω.For every i ∈ I , let E span (d ),where A denotes the closure of a subset A ⊆ E in the topology of i j j ∈i E. Then, each E is a linear subspace of E with countable basis {d } . Since any Frechet ´ i j j ∈i space with countable basis is separable and therefore each E is a separable Frechet ´ space. Moreover, by construction, D ⊆ E ⊆ E (13) i∈I and therefore E is dense in E since D is dense in E.Since is a homeomorphism i∈I −1 then : E → X is a continuous surjection, and since the image of a dense set under any −1 continuous map is dense in the range of that map then (D ) is dense in X . Moreover, 454 A. Kratsios using the fact that inverse images commute with unions and the fact that that is a bijection, we compute that −1  −1 −1 (D ) ⊆ E = [E ] . (14) i i i∈I i∈I (F , ) Since as a bijection and D was defined as the image of NN in E under ,then (F , )  −1 D ⊂ NN and D is dense in X . In particular, (14) implies that [E ]⊆ i∈I (F , ) −1 (F , ) −1 (NN ∩ [E ]) and therefore (NN ∩ [E ]) is dense in X .In i i i∈I i∈I −1 −1 particular, [E ] is dense in X , and for each i ∈ I,ifwedefine X [E ] i i i i∈I then we obtain (ii.a). Since is a homeomorphism then it preserves dense sets and in particular since {d } i j ∈i −1 is a countable, dense, and linearly independent subset of [{d } ] then it is a dense j j ∈i countable subset of X . Hence, each X is separable. i i This gives (ii.b). Lastly, by [73] any two separable infinite-dimensional Frechet ´ space are homeomorphic. In particular, since L (R) is a separable Hilbert space is a separable Frechet space. Therefore, for each i ∈ I , there is a homeomorphism : E → L (R). i i In particular, : X → L (R) must be a homeomorphism and therefore (ii.b) holds. i i Therefore, (i) implies (ii). Suppose that (ii) holds. Then, (iii.a) holds by (ii.a). For each i ∈ I,let {d } be a n,i n∈N (F , ) countable dense subset of X ∩NN for which ({d } ) is a linearly independent, i n,i n∈N and let E = span({d } ).Let D {d } and D (D). Thus, for every i n,i n∈N n,i n∈N i∈I i ∈ I , D ∩ E is a countably infinite linearly independent and dense subset of E then by i i [74, Theorem 8.24] there exists a continuous linear operator T : D∩E → D∩E satisfying i i i T (d ) = d , n,i n+1,i for each n ∈ N and each i ∈ I . In particular, T (d ) is dense in E . For each i ∈ I , 0,i i −1 −1 define φ ◦ T ◦ and g (d ) and observe that for every n ∈ N i i i 0,i n −1 −1 −1 φ (g ) = ( ◦ T ◦ ) ◦· · ·◦ ( ◦ T ◦ )( (d )) i i i i,0 n−times −1 n ◦ T (d ). (15) 0,i Since {T (d )} is dense in E and is a homeomorphism from X to E then 0,i n∈N i i i −1 n n {T (d )} = φ (g ) 0,i n∈N i i i n∈N 2 2 is dense in X . Thus, (iii.c) holds. For any i ∈ I,definethemap ψ : L (R) → L (R) by i i −1 ψ ( ) ◦ φ ◦ ( ), i i i i and define the vector g ˜ ∈ L (R) by g ˜ (g ).Since and are homeomorphisms i i i i i and since φ is continuous then ψ is well-defined and continuous. Moreover, analogously i i n 2 2 to (15) we compute that ψ (g ˜ ) is dense in L (R).Since L (R) is a complete separa- n∈N ble metric space with no isolated points and ψ is continuous self-map of L (R) for which 2 n 2 thereisavector g ˜ ∈ L (R) such that the set of iterates {ψ (g ˜ )} is dense in L (R) then i i n∈N Birkhoff Transitivity Theorem, see the formulation of [74, Theorem 1.16], implies that for ˜ ˜ every pair of non-empty open subsets U, V ⊆ L (R) there is some n satisfying ˜ ˜ U,V ˜ ˜ U ,V ˜ ˜ φ (U) ∩ V =∅. (16) The Universal Approximation Property 455 Since is a homeomorphism, then [74, Proposition 1.13] and (16) imply that for every pair of non-empty open subsets U ,V ⊆ X there exists some n   ∈ N satisfying U ,V U ,V φ (U ) ∩ V =∅. (17) Since X is equipped with the subspace topology then every non-empty open subset U ⊆ X is of the form U ∩ X for some non-empty open subset U ⊆ X . Therefore, i i (17) implies (iii.b). Since both L (R) and C(R) are separable infinite-dimensional Frechet ´ spaces then the [73, Anderson-Kadec Theorem] implies that there exists a homeomor- phism  : L (R) → C(R). Therefore, for each i ∈ I ,  ◦ : X → C(R) is a homeomorphism and thus (ii.c) implies (iii.d). Suppose that (iii) holds. For every i ∈ I,set X X ,let 1 be the identity map i i i X on X ,set ψ φ ,and set x g . Therefore, (iv) holds. i i i i i (F , ) Suppose that (iv) holds. By (iv.c), for each i ∈ I , NN ∩ X is dense in X . i i Therefore, (F , ) (F , ) X = NN ∩ X ⊆ NN ∩ X ⊆ X . (18) i i i i∈I i∈I i∈I By (iv.a) since X is dense in X therefore its closure is X and therefore the smallest, i∈I and thus only, closed set containing X is X itself. Therefore, by (18) the smallest set i∈I (F , ) (F , ) containing NN ∩ X must be X . Therefore, NN is dense in X and (i) i∈I holds. This concludes the proof. Proof of Theorem 2 By the [73, Anderson-Kadec Theorem] there is no loss of general- m n ity in assuming that m = n = 1, since C(R , R ) and C(R) are homeomorphic. Let (C(R)).By(5), X is dense in X and since density is transitive, then it is i∈I (F , ) enough to show that (NN ) is dense in X to conclude that it is dense in X . i∈I Since each is continuous, then, the topology on X is no finer than the finest topology on (C(R)) making each continuous and by [71, Proposition 2.6] such a topol- i i i∈I ogy exists. Let X denote (C(R)) equipped with the finest topology making each i∈I (C(R)) into a subspace. By construction, if U ⊆ X is open then it is open in X and (F , ) therefore if (NN ) intersects each non-empty open subset of X then it must i∈I (F , ) do the same for X . Hence, it is enough to show that (NN ) is dense in X i∈I (F , ) to conclude that it is dense in X and therefore, (NN ) is dense in X . i∈I We proceed similarly to the proof of Lemma 2. Indeed, by [71, Proposition 2.7] the space X is given by the (topological) quotient of the disjoint union (C(R)), in the sense i∈I i of topological spaces (see [71, Example 3, Section 2.4]), under the equivalence relation f ∼ f if f = f in X . Denote the corresponding quotient map by Q . Since a subset U i j i j −1 of the quotient topology is open (see [71, Example 2, Section 2.4]) if and only if Q [U ] is an open subset of (C(R)) and since a subset V of (C(R)) is open if and only i∈I i i∈I i if V ∩ (C(R)) is open for each i ∈ I in the topology of (C(R)) then U ⊆ X is open if i i −1 (F , ) and only if Q [U ]∩ (C(R)) is open for each i ∈ I .Since {NN ∩ (C(R))} + i i n∈N is dense in (C(R)) then for every open subset U ⊆ (C(R)) i i (F , )  (F , ) ∅ = U ∩ NN ∩ (C(R)) ⊆ U ∩ NN ∩ (C(R)). (19) i i i∈I In particular, (19) implies that for every open subset U ⊆ X (F , ) −1 (F , ) ∅ = NN ∩ (C(R)) ∩ Q [U]∩ (C(R)) ⊆ U ∩ NN ∩ (C(R)). i  i i i∈I (20) 456 A. Kratsios (F , ) Therefore, NN ∩ (C(R)) is dense in X and therefore it is dense in i∈I (C(R)) equipped with its relative topology. Hence, (F , ) has the UAP on X i∈I and therefore it has the UAP on X itself. Proof of Theorem 3 Let σ be a continuous and non-polynomial activation function. Then [61] implies that the architecture F , , as defined in Example 4, is a universal 0 0 approximator on C(R). By Theorem 1, since F , has the UAP on X and since X is homeomorphic to an infinite-dimensional Frechet ´ space then there are homeomorphisms { } from C(R) onto i i∈I a family of subspaces {X } of X such that X is dense. Fix > 0and f ∈ X . i i∈I i i∈I Since X is dense in X there exists some i ∈ I and some f ∈ X such that i i i i∈I d (f, f )< . (21) X i Since is a homeomorphism then it must map dense sets to dense sets. Since F 0, 0 (F 0, 0) has the UAP on C(R) then NN is dense in C(R) and therefore, for each i ∈ I , (F 0, 0) (F 0, 0) (NN ) is dense in X . Hence, there exists some g ˜ ∈ (NN ) such that i i  i d (f , g ˜ )< .Since is a homeomorphism, it is a bijection, therefore there exists a X i  i (F 0, 0) unique g ∈ NN with (g ) =˜ g . Hence, the triangle inequality and (21)imply that d (f, (g )) ≤ d (f, f ) + d (f , (g )) <. (22) X i  X i X i i This yields the first inequality in the Theorem’s statement. (F , ) −1 By Theorem 1 since, for each i ∈ I , NN ∩ X is dense in X and since is a i i −1 (F , ) homeomorphism on X then NN ∩ X is dense in C(R). In particular, there i i −1 F , ( ) exits some f ∈ NN ∩ X satisfying d g (x), f (x) <. (23) ucc (F , ) −1 Since is a bijection then there exists a unique f ∈ NN such that (f ) = f . Therefore, (23) and the triangle inequality imply that −1 d g (x), (f )(x) <. ucc Therefore the conclusion holds. Remark 6 By the [73, Anderson-Kadec Theorem], since both L (R) and C(R) are separa- ble infinite-dimensional Frechet ´ spaces then there exists a homeomorphism : L (R) → C(R). Therefore, the proof of Corollary 3 holds (mutatis mutandis) with each replaced −1 2 by and with C(R) in place of L (R). The proof of the next result relies on some aspects of inductive limits of Banach spaces. Briefly, an inductive limit of Banach spaces is a locally convex space B for which there exists a pre-ordered set I , a set of Banach sub-spaces {B } with B ⊆ B if i ≤ j.The i i∈I i j inductive limit of this direct system is the subset B equipped with the finest topology i∈I which simultaneously makes each B into a subspace and makes B into a locally- i i i∈I convex spaces. Spaces constructed in this way are called ultrabornological spaces and more details about them can be found in [75, Chapter 6]. The Universal Approximation Property 457 Proof of Theorem 4 Since B(X ) and B(X) are both infinite-dimensional Banach spaces, then they are infinite-dimensional ultrabornological space, in the sense of [75, Defini- tion 6.1.1]. Since X is separable, then as observed in [33], B(X) is separable. Therefore, [75, Theorem 6.5.8] applies; hence, there exists a directed set I with pre-order ≤, a collec- tion of Banach subspaces {B } satisfying (i) and (ii), and a collection of continuous linear i i∈I isomorphisms : B(X) → B . Furthermore, the topology on B is coarser than the induc- i i tive limit topology lim B . Since each B(X) and B are Banach spaces, and in particular i i i∈I − → normed linear spaces, then by the results of [76, Section 2.7] the maps are bounded linear isomorphisms. Let i ∈ I ,and fixany x ∈ X −{0 } then since δ : X → B(X) is base-point preserving i X then δ = 0 and therefore there exists a linearly independent subset B of B(X) containing x i δ .Since B(X) is separable then B is countably infinite and therefore [74, Theorem 8.24] x i n X there exists a bounded linear map φ : B(X) → B(X) such that {φ (δ )} + is a dense i n∈N i x subset of B(X). Since is a continuous linear isomorphisms then it is in particular a surjective continu- ous map from B(X) onto B . Since the image of a dense set under a continuous surjection is itself dense then ◦ φ (δ ) is a dense subset of B . Moreover, this holds for each i x + i i i n∈N i ∈ I . By definition, the topology on lim B is at-least as fine as the Banach space topology − →i∈I on B(X ), since each B is a linear subspace of B(X ). Moreover, the topology on lim B 0 i 0 i i∈I − → is no finer than the finest topology on B making each B into a topological space (but i i i∈I not requiring that B be locally-convex), which exists by [77, Proposition 6]. Denote i∈I this latter space by B . Therefore, if ◦ φ (δ ) , (24) i x i i i∈I ; n∈N is dense in B then it is dense in lim B and in B(X ). Hence, we show that (24)isdense i 0 i∈I − → ˜ ˜ in B . That is, it is enough to show that every open subset of B contains an element of (24). By [71, Proposition 2.7] the space B is given by the topological quotient of the disjoint union  B , in the sense of topological spaces (see [71, Example 3, Section 2.4]), under i∈I i the equivalence relation x ∼ x for any i ≤ j if x = x in B . Denote the corresponding i j i j j quotient map by Q .Sinceasubset U of the quotient topology is open (see [71,Example −1 2, Section 2.4]) if and only if Q [U ] is an open subset of  B and since a subset V i∈I i of  B is open if and only if V ∩ B is open for each i ∈ I in the topology of B then i∈I i i i −1 U ⊆ B is open if and only if Q [U]∩ B is open for each i ∈ I.Since { ◦ φ (x )} + i i i n∈N is dense in B then for every open subset U ⊆ B i i n  n ∅ = U ∩{ ◦ φ (x )} + ⊆ U ∩ ◦ φ (δ ) . (25) i i i x n∈N i i i i∈I ; n∈N In particular, (25) implies that for every open subset U ⊆ B n −1 n ∅ ={ ◦ φ (x )} + ∩ Q [U]∩ B ⊆ ◦ φ (δ ) ∩ U . (26) i i n∈N i i x i i i i∈I ; n∈N Therefore, (24)isdense in B and, in particular, it is dense in B(X ). Since X was barycentric, then there exists a continuous linear map ρ : B(X ) → X 0 0 0 X 0 which is a left-inverse of δ . Thus, for every f ∈ X , ρ ◦ δ = f and therefore ρ is a f 458 A. Kratsios continuous surjection. Since the image of a dense set under a continuous surjection is dense and since (24) is dense then ρ ◦ ◦ φ (δ ) , (27) i x i i i∈I ; n∈N is a dense subset of X .Since X has assumed to be dense in X and since density is transitive 0 0 then (27)isdense in X . This concludes the main portion of the proof. The final remark follows from the fact that if X = X then the identity map 1 : X → 0 X X is an isometry and therefore the universal property of B(X) described in Theorem [32, Theorem 3.6] implies that 1 uniquely extends to a bounded linear isomorphism L between B(X) and B(X ) satisfying X X X −1 X X −1 X 0 0 0 L ◦ δ = δ ◦ 1 = δ and L ◦ δ = δ ◦ 1 = δ . Hence L must be the identity on B(X). Appendix B: Proof of Applications of Main Results Lemma 3 Fix some b ∈ R , and let σ : R → R be a continuous activation function. m n Then is a well-defined and continuous linear map from C(R , R ) to itself and the A,b following are equivalent: m n + (i) For each δ> 0, > 0 and each f, g ∈ C(R , R ) thereissome N ∈ N such U,V that U,V ˜ ˜ (g) ˜ : d (g, ˜ g) < δ ∩ f : d (f,f) <  =∅, ucc ucc (ii) σ is injective, A is of full-rank, and for every compact subset K ⊆[a, b] there is some N ∈ N such that S (K) ∩ K =∅, where S(x) = σ • (Ax + b). If A is the m × m-identity matrix I and b > 0 for i = 1,...,m then (i) and (ii) are m i equivalent to (iii) σ is injective and has no fixed-points. If A is the m × m-identity matrix I and b > 0 for i = 1,...,m then (iii) is equivalent to m i (iv) Either σ(x) > x or σ(x) < x for every x ∈ R. Proof Lemma 3 By [37, Theorem 46.8] the topology of uniform convergence on compacts m n is the compact-open topology on C(R , R ) and by [37, Theorem 46.11] composition is a continuous operation in the compact-open topology. Therefore, is well-defined and A,b continuous map. Its linearity follows from the fact that (af + g) = (af ) ◦ S = a(f ◦ S) + g ◦ S. A,b g Since the topology of uniform convergence on compacts is a metric topology, with met- ric d ,then ucc m n U : f ∈ C(R , R ),  > 0 defines a base for this topology, where U f, f, m n { } g ∈ C(R , R ) : d (f, g) <  . Therefore, Lemma 3 (i) is equivalent to the statement: ucc m n + for each pair of non-empty open subsets U, V ∈ C(R , R ) there is some N ∈ N such U,V U,V that (U ) ∩ V =∅. Without loss of generality, we prove this formulation instead. I,b The Universal Approximation Property 459 Next, by [78, Corollary 4.1] satisfies Theorem 1 (ii.b) if and only if S(x) σ(Ax + A,b m + b) is injective and for every compact subset K ⊆ R there exists some N ∈ N such that S (K) ∩ K =∅. (28) Therefore, A must be injective which is only possible if A is of full-rank. This gives the equivalence between (i) and (ii). We consider the equivalence between (ii) and (iii) in the case where A is the identity matrix and b > 0for i = 1,...,m.Since S(x) = (σ (x + b ), . . . , σ (x + b )) it is i 1 m sufficient to verify condition (28) in the case where m = 1. Since b > 0for1,...,m then it is clear that S is injective and has no fixed points if and only if σ is injective and has no fixed points. We show that S is injective and has no fixed points if and only if (ii) holds. Indeed, note that if S has not fixed points, then since b > 0for i = 1,...,m then S has no fixed points if and only if σ no fixed points. From here, we proceed analogously to the proof of [79, Lemma 4.1]. If S hasafixed- + N point then for every N ∈ N , S (x) ={x} which is a non-empty compact subset of R. Therefore, (28) cannot hold. Conversely, suppose that S has no fixed points. The intermediate-value theorem and the fact that S has no fixed-points that either S(x) < x or S(x) > x. Mutatis mutandis, we proceed with the first case. Since σ is injective and S has not fixed points then S must be a strictly increasing function; thus S([a, b]) =[S(a), S(b)] for every a< b. Let K be a non-empty compact subset of R. By the Heine-Borel theorem K is closed and bounded, thus it is contained in some [a, b] for a< b. Therefore, it is sufficient to show the results for the case where K =[a, b].Since S is increasing then for every n ∈ N, n n n+1 the sequence {S (a)} satisfies S (a) < S (a). If this sequence is not unbounded then n∈N there would exist some a ∈ R such that a = lim S (a). Therefore, by the continuity of 0 0 n→∞ S we would find that n n+1 n n a = lim S (a) = lim S (a) = lim S(S (a)) = S lim S (a) = S(a ), 0 0 n→∞ n→∞ n→∞ n→∞ but since S has not fixed points then there cannot exist such an a since otherwise a = 0 0 S(a ). Therefore, a does not exist and thus {S (a)} is unbounded. Hence, for every 0 0 n∈N a< b there exists some N ∈ N such that [a,b] [a,b] S ([a, b]) ∩[a, b]=∅. Thus, (ii) and (iii) are equivalent when A = I . m n Next, assume that any of (i) to (iii) hold, that X is a non-empty subset of C(R , R ),and m n that F , has the UAP on X . Then for any other non-empty open subset U ⊆ C(R , R ) there exists some N ∈ N such that X ,U X ,U [X ]∩ U =∅. (29) A,b X ,U N −1 Since is continuous then so is and therefore ( ) [U ] is a non-empty open A,b A,b A,b m n subset of C(R , R ). Since the finite intersection of open sets is again open, then we have that N N N X ,U −1 X ,U X ,U [X ]∩ U = X ∩ [U ]. (30) A,b A,b A,b X ,U m n This implies that X ∩ [U ] is a non-empty open subset of C(R , R ) contained in X . I ,b X ,U (F , ) Since F , has te UAP on X , then there exists some f ∈ NN ∩[X ∩ [U ]]. A,b N N (F σ ;deep, σ ;deep) X ,U X ,U Thus, (f ) ∈ U and, by definition, (f ) ∈ NN . 460 A. Kratsios Thus, for each U in m n g ∈ C(R , R )d (g, f ) <  , (31) ucc m n f ∈C(R ,R ),>0 + (F , ) N there exists some N ∈ N and some f ∈ NN such that (f ) ∈ U . In particu- U U U m n lar, since (31) is a base for the topology on C(R , R ) and since the intersection of open sets is again open, then every non-empty open subset of U is contained an element of (31)which, (F σ ;deep, σ ;deep) in turn, contains an element of the form (f ). Thus, NN ∩ U =∅. (F σ ;deep, σ ;deep) m n Hence, NN has the UAP on C(R , R ). Proof of Theorem 5 The equivalence between (i), (ii), and (iv) follows from Lemma 3. The equivalence between (iii) and (iv) follows from the formulation of Birkhoff’s transitivity theorem described in [74, Theorem 2.19]. Proof of Proposition 1 Since α < 1then σ(x) > x for every x< 0. Since 0 <α then 1 2 σ(0) = 0 <α . Lastly, since σ ˜ is monotone increasing then for every x> 0wehavethat σ(x) > x + α >x. Therefore, σ cannot have a fixed point. Moreover, since σ ˜ is strictly increasing it must be injective, since if x< y then σ(x) < σ(y) and therefore σ(x) = σ(y) if x = y. Hence, σ is injective. Moreover, since the sum of continuous functions is again continuous, then σ is continuous. Since α x + α is affine then it is continuously differentiable. Thus σ is continuously 1 2 differentiable on any x< 0. Lastly, setting α not equal to σ ˜ (0) − 1 ensure that σ is not differentiable at 0 and therefore it cannot be polynomial. In particular, it cannot be affine. m n m n For convenience, we denote the collection of set-functions from R to R by [R , R ]. m n m n m n Proof of Corollary 4 Since d is a metric on [R , R ] and since C(R , R ) ⊆[R , R ], ucc m n m n then the map F : C(R , R ) → C(R , R ) defined by F(g) d (f ,g) is continuous. ucc 0 −1 m n Therefore, the set F [(−∞,δ)] is an open subset of C(R , R ). In particular, (7) guar- antees that it is non-empty. Since σ is non-affine and continuously differentiable at-least at one point with non-zero derivative at that point then [17, Theorem 3.2] applies, whence the m n set X of continuous functions h : R → R with representation h(x) = W ◦ σ • ··· ◦ σ • W , J 1 d d j j +1 where W : R → R ,for j = 1,...,J − 1, are affine and n + 2 ≥ d if j ∈{1,J } j m j m n −1 and d = m,and d = n,isdense in C(R , R ). Therefore, since F [(−∞,δ)] is an 1 J m n −1 −1 open subset of C(R , R ) then X ∩ F [(−∞,δ)] is dense in F [(−∞,δ)]. Fix some b ∈ R with b > 0for i = 1,...,m.Since σ is continuous, injective, and has no fixed-points then applying Lemma 3 implies that X { (f ) : f ∈ I ,b −1 + m n F [(−∞,δ)]∩ X ,N ∈ N }, is a dense subset of C(R , R ). This gives (i). Moreover, by construction, every g ∈ X admits a representation satisfying (iii) and (iv). Furthermore, since W ◦ σ •· · ·◦ σ • W ∈ X and by construction there exists some g ∈ X for which J 1 2 1 d (W ◦ σ •· · ·◦ σ • W ,g) <δ,; then (ii) holds. ucc J 1 The Universal Approximation Property 461 Proof of Corollary 5 Since each F ,for n = 1,...,N , is a continuous function from m n −1 m n C(R , R ) to [0, ∞] then each F [[0,C )] is an open subset of C(R , R ). Since the N −1 finite intersection of open sets is itself open, then ∩ F [[0,C )] is an open subset of n=1 m n m n C(R , R ). Since there exists some f ∈ C(R , R ) satisfying (8)then U is non-empty. m n Since F , has the UAP on C(R , R ) then F , ∩ U is dense in U . Fix b ∈ R with b > 0for i = 1,...,m and set A = I . i m Since σ is a transitive activation function then Corollary 1 applies and therefore the set (F , ) N m n (f ) : f ∈ NN ∩ U is dense in C(R , R ). Therefore (i)-(iv) hold. I ,b Proof of Corollary 2 Let S(x) = σ •(x +b) and let B {x ∈ R : σ(x) > x}. By hypothe- sis B is Borel and μ(B) > 0. For each i = 1,...,m we compute σ •(x +b )>x +b ≥ x . i i i i i Therefore, for μ-a.e. every x ∈ B , N ∈ N and each i = 1,...,m S (x) ≥ x + Nb . i i i Since b > 0 then lim S (x) =∞. Therefore, the condition [80, Corollary 1.3 (C2)] is N →∞ met, and by the discussion following the result on [80, page 127], condition [80, Corollary 1 m n 1.3 (C1)] holds; i.e.: for every non-empty open subset U, V ⊆ L (R , R ) there exists some N ∈ N such that U,V U,V (U ) ∩ V =∅. (32) I ,b U,V By Lemma 1, the map and therefore the map is continuous. Thus, I ,b m I ,b N N U,V −1 1 m n X ,U −1 ) [V ] is a non-empty open subset of L (R , R ) and therefore U ∩( ) [V ] I ,b I ,b m m is a non-empty open subset of U.Taking U = Ball (g, δ) and V = 1 m n L (R ,R ) Ball (f, ) we obtain the conclusion. 1 m n L (R ,R ) Proof of Corollary 3 By Proposition 1 and the observation in its proof that σ(x) > x we only need to verify that σ is Borel bi-measurable. Indeed, since σ is continuous and injective −1 then by [81, Proposition 2.1], σ exists and is continuous on the image of σ .Since σ was −1 −1 assumed to be surjective then σ exists on all of R and is continuous thereon. Hence, σ and σ are measurable since any continuous function is measurable. Proof of Theorem 6 Fix A = I and b ∈ R with b > 0for i = 1,...,m.Since m i int (co (A) F ) is a non-empty open set then there exists some f ∈ int (co(F )) and some δ> 0forwhich 1 m Ball 1 m (f, δ) g ∈ L (R ) : f(x) − g(x)dμ(x) < δ L (R ) μ x∈R is an open subset of int (co (A) F ). Since co (A) F ∩ int(co (A) F ) is dense in int(co (A) F ) then its intersection with any non-empty open subset thereof is also dense; in particular, co(F ) ∩ Ball (f, δ) is dense in Ball (f, δ).Since σ is L -transitive 1 m 1 m L (R ) L (R ) μ μ then (iii) follows from Corollary 2. 1 1 m Since L is a metric space then Ball (g, δ) : g ∈ L (R ), δ > 0 is abasefor 1 m μ L (R ) μ the topology thereon. Therefore, Corollary 2 implies that for any two non-empty open sub- U,V 1 m sets U, V ∈ L (R ) there exists some N ∈ N satisfying (U ) ∩ V =∅. Hence, U,V I ,b 1 m is topologically transitive on L (R ), in the sense of [74, Definition 1.38]. Moreover, I ,b m μ since is a continuous linear map then Birkhoff’s transitivity theorem, as formulated I ,b 1 m in [74, Theorem 2.19], applies and therefore is a hypercylic operator on L (R ). I ,b μ 462 A. Kratsios Therefore, [74, Proposition 5.8] implies that > 1. Setting κ yields I ,b op I ,b op m m (ii). 1 m It remains to show the approximation bound of described by (i). Fix f ∈ L (R ). 1 m Since L (R ) is a Banach space then it has no isolated points and since is a hyper- I ,b μ m cylic operator then Birkhoff’s transitivity theorem, as formulated in [74, Theorem 2.19], 1 m implies that there exists a dense G -subset HC( ) ⊆ L (R ) such that for every δ I ,b m μ 1 m g ∈ HC( ) the set { (g)} is dense in L (R ). Therefore, every non-empty I ,b N ∈N m I ,b μ 1 m open subset of L (R ) contains some element of HC( ). In particular, there is some I ,b μ m 1 m g ∈ HC( ) ∩ int(co(F )) since int(co(F )) is a non-empty open subset of L (R ). I ,b m μ Since co (A) F ∩ int(co (A) F ) is dense in int(co (A) F ) then, in particular, g ∈ int(co (A) F ). Therefore, the conditions of [69, Theorem 2] and [69, Equation (23)] are met, hence, for each n ∈ N the following approximation bound holds 2μ(R ) inf α f (x) − g(x) dμ(x) ≤ √ , (33) i i f ∈F , α =1,α ∈[0,1] n i i i x∈R i=1 i=1 N 1 m Since { (g)} is dense in L (R ) then there exists some N ∈ N for which N ∈N I ,b N 1 (g) ∈ Ball 1 m f, . Thus, the following bound holds L (R ) I ,b m μ n f(x) − (g)(x)dμ(x) ≤ √ , (34) I ,b x∈R 1 m Since is a continuous linear map from the Banach space L (R ) to itself then I ,b m μ it is Lipschitz with constant ,where · denotes the operator norm, and by I ,b op op [64, Corollary 2.1.2] we have d(σ • (·+ b)) μ = . (35) I ,b m op dμ Moreover, by Lemma 1, we know that the right-hand side of (35) is finite. Therefore (34) implies that for every f ,...,f ∈ F , α ,...,α ∈[0, 1] with α = 1, the following 1 n 1 n i i=1 holds α f (x) − f(x) dμ(x) i i I ,b x∈R i=1 N N α f (x) − (g) (x) dμ(x) i i I ,b I ,b m m x∈R i=1 + f(x) − (g) (x) dμ(x) I ,b x∈R (36) α f (x) − g(x) dμ(x) i i I ,b op m x∈R i=1 (g) (x) − f(x) dμ(x) I ,b x∈R d(σ • (·+ b)) μ 1 ≤ α f (x) − g(x) dμ(x) + √ . i i dμ n M x∈R i=1 The Universal Approximation Property 463 Combining the estimates (33)–(36) we obtain inf α f (x) − f(x) dμ(x) i i I ,b n m f ∈F , α =1,α ∈[0,1] x∈R i i=1 i i i=1 d(σ • (·+ b)) μ 1 ≤ α f (x) − g(x) dμ(x) + i i dμ m n M x∈R i=1 d(σ • (·+ b)) μ 2μ(R ) 1 ≤ √ + √ dμ n n = √ 1 + 2μ(R ) . (37) Since is linear, then the right-hand side of (37) reduces and we obtain the following I ,b estimate inf α (f ) (x)−f(x) dμ(x) ≤ √ 1+ 2μ(R ) . i i I ,b n m f ∈F , α =1,α ∈[0,1] n x∈R i i=1 i i i=1 (38) Therefore, the estimate in (i) holds. For the statement of the next lemma concerns the Banach space of functions vanishing m n m at infinity. Denoted by C (R , R ), this is the set of continuous functions f from R to n m R such that, given any > 0 there exists some compact subset K ⊆ R for which m n sup f(x) <. As discussed in [82, VII], C (R , R ) is made into a Banach space x∈K by equipping with the supremum norm f  sup m f(x). x∈R Lemma 4 (Uniform Approximation of Functions Vanishing at Infinity) Suppose that m n m n F , is a universal approximator on C(R , R ), then for every f ∈ C (R , R ) and m n every > 0 there exists g ∈ C (R , R ) with representation | | 2 − g (·) (x−b) b−· f (·) = g e + a I + ae I , (39) ·<b ·≥b (F , ) the absolute value |·| is applied component-wise, g ∈ NN , and a, b > 0, and satisfying the uniform approximation bound f − f <. m n Proof of Lemma 4 Let F , be a universal approximator on C(R , R ),let f ∈ m n C (R , R ),and > 0. Since f vanishes at infinity then there exists some non-empty com- m −1 pact K ⊆ R for which f(x)≤ 2 for every x ∈ K . By the Heine-Borel theorem ,f ,f K is bounded and therefore there exists some b > 0 such that K ⊆ Ball m (0,b ) ,f ,f R {x ∈ R :x <b }. Therefore, −1 sup f(x) <2 . (40) x∈R −Ball m (0,b ) −1 1−x Since the bump function x → e I is continuous, affine functions are contin- |x|<1 m n uous, f ∈ C(R , R ), and the composition and multiplication of continuous functions is −1 b −x again continuous then the function x → f(x) − 2 e I is itself continu- x<b ous. Observe also that the set Ball(0, b ) = {x ∈ R :x≤ b } is closed and bounded, 464 A. Kratsios thus it is compact by the Heine-Borel theorem. Since F , is a universal approximator m n on C(R , R ) for the topology of uniform convergence on compacts then there exists some F , ( ) g ∈ NN satisfying −1  2 −1 b −x sup g (x) − f(x) − 2 e I <2 . (41) x<b x∈Ball(0,b ) b −x Since 0 ≤ e ≤ 1for every x ∈ R , then from (41) we compute −1 b −x sup g (x)e I  + 2 I  − f(x) x<b x<b x∈Ball(0,b ) 2 −1 b −x ≤ sup g (x)e + 2 − f(x) x∈Ball(0,b ) b b b − − 2 −1  2  2 b −x b −x b −x ≤ sup g (x)e + f(x) − 2 e e x∈Ball(0,b ) (42) b b 2 −1  2 b −x b −x ≤ sup e g (x) + f(x) − 2 e x∈Ball(0,b ) −1  2 b −x ≤ sup g (x) + f(x) − 2 e x∈Ball(0,b ) ≤ . Observe that, for every x ∈ R − Ball(0, b ) we have x− b ≥ 0, −|g (x)|≤ 0and therefore −1 −|g (x)|(x−b ) 0 ≤ 2 e ≤ . (43) Combining (40), (42), and (43) we compute the following bound 2 −1 −1 −|g (x)|(x−b) b −x sup g (x)e + 2 I + 2 e I − f(x) x<b x≥b x∈R 2 −1 −|g (x)|(x−b) b −x ≤ max sup g (x)e I  + 2 e I  − f(x) , x<b x<b x∈Ball(0,b ) 2 −1 −|g (x)|(x−b) b −x sup g (x)e I + 2 e I − f(x) x<b x<b x∈R −Ball(0,b ) 2 −1 −|g (x)|(x−b) b −x ≤ max , sup g (x)e I  + 2 e I  −f(x) x<b x<b x∈R −Ball(0,b ) −1 −|g (x)|(x−b) = max , sup 2 e I − f(x) x<b x∈R −Ball(0,b ) −1 −|g (x)|(x−b) ≤ max , sup 2 e + sup f(x) m  m x∈R −Ball(0,b ) x∈R −Ball(0,b ) −1 −1 = max{, 2 + 2 }= . (44) Thus, the result holds. The Universal Approximation Property 465 m n m n Proof of Theorem 6 For each ω ∈ ,definethemap : C (R , R ) → C (R , R ) by ω 0 ω m n (f ) (ω(·) + 1) f . For each f, g ∈ C (R , R ) we compute ω 0 (f ) − (g) ω ω (f ) − (g) = sup ω ω ω,∞ ω(·) + 1 x∈R (ω(·) + 1) f(x) − (ω(·) + 1) g(x) = sup m ω(·) + 1 (45) x∈R (ω(·) + 1) f(x) − g(x) = sup m ω(·) + 1 x∈R =f − g . Therefore, for each ω ∈ ,the map is an isometry. For each ω ∈ ,definethemap m n m m n ˜ ˜ ˜ : C (R , R ) → C (R , R) by  (f) f . For each f ∈ C (R , R ) and ω ω 0 ω ω ω(·)+1 compute 1 1 ˜ ˜ ˜ ˜ ◦  (f) = f = (ω(·) + 1) f =f . (46) ω ω ω ω(·) + 1 ω(·) + 1 Hence,  is a right-inverse of . Since every isometry is a homeomorphism onto ω ω its image and since is surjective isometry then defines a homeomorphism from ω ω m n m n m n m n C (R , R ) onto C (R , R ). In particular, (C (R , R )) = C (R , R ). Therefore, 0 ω ω 0 ω m n m n m n m n C (R , R ) = C (R , R ) = C (R , R ) = C (R , R ). ω ω 0 ω ω∈ ω∈ Hence, condition (5) holds. −x Since it was assumed that sup m f(x)e < ∞ holds, then Lemma 4 applies, x∈R whence, 2 −|f(·)|(x−b) (F , ) b−· fe + a I + ae I : 0 <b, a, f ∈ NN ·<b ·≥b m n is dense in C (R , R ). Therefore, the conditions for Theorem 2 are met. Hence, −|f(·)|(x−b) (F , ) b−· fe + a I + ae I : 0 <b, a, f ∈NN ω ·<b ·≥b ω∈ (47) (F , ) m n is dense in C (R , R ). By definition, (47) is a subset of NN and therefore (F , ) m n NN is dense in C (R , R ). Hence, F , is a universal approximator on m n C (R , R ). Proof of Proposition 2 For each k, m ∈ N with n ≤ m, we have that exp(−kt) > exp(−mt ) for every t ∈[0, ∞). Thus, m n m n C (R , R ) ⊆ C (R , R ), (48) exp(−k·) exp(−m·) and the inclusion is strict if n<m. Moreover, for n ≤ m, the inclu- k m n m n sion of each i : C (R , R ) into C (R , R ) is continuous. Thus, exp(−n·) exp(−m·) m n k C (R , R ), i is a strict inductive system of Banach spaces. Therefore, by exp(−k·) n∈N m n [83, Proposition 4.5.1] there exists a finest topology on C (R , R ) both mak- exp(−k·) k∈N m n ing it into a locally-convex space and ensuring that each C (R , R ) is a subspace. exp(−k·) m n LCS m n Denote C (R , R ) equipped with this topology by C (R , R ). exp(−k·) k∈N LCS m n If f ∈ C (R , R ) then by construction there must exist some K ∈ N such that m n f ∈ C (R , R ).By[84, Propositions 2 and 4], a sequence {f } converges exp(−K·) t t ∈N 466 A. Kratsios to some f if and only if there exists some K ∈ N and some N ∈ N such that for m n every t ≥ N every f ∈ C (R , R ) and the sub-sequence {f } converges in K t t t ≥N exp(−K·) m n m n the Banach topology of C (R , R ) to f . In particular, since C (R , R ) = exp(−K·) exp(−0·) m n m n C (R , R ) then the function f(x) (exp(−|x|),..., exp(−|x|)) ∈ C (R , R ). 0 exp(−0·) (F , ) Since each f ∈ NN is either constant of sup f(x)=∞ then for any x∈R (F , ) + sequence {f } ∈ NN there exists some N ∈ N for which the sub-sequence t t ∈N 0 m n m n {f } lies in C (R , R ) = C (R , R ) if and only if for each t ≥ N the map f t t ≥N exp(−0·) 0 0 t is constant. Therefore, for each t ≥ N we compute that f − f  =f − f  ≥ inf sup | exp(−|x|) − c| > . t exp(0·),∞ t ∞ c∈R m 2 x∈R m n Hence, f cannot converge to f in C (R , R ) and therefore F , does not have the m n UAP on C (R , R ). Proof of Corollary 7 Let X R and X X L (R). Since every Banach space is a pointed metric space with reference-point its zero vector and since R is separable then Theorem 4 applies. We only need to verify the form of η and of ρ. Indeed, the identification of B(R) with L (R) and explicit description of η is constructed in [32, Example 3.11]. The fact that L (R) is barycentric follows from the fact that it is a Banach space and by [31, Lemma 2.4]. References 1. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943) 2. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psych. Rev. 65(6), 386 (1958) 3. Hornik, K., Stinchcombe, M., White, H.: Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 3(5), 551–560 (1990) 4. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989) 5. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251– 257 (1991) 6. Kolmogorov, A.N.: On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR 114, 953–956 (1957) 7. Webb, S.: Deep learning for biology. Nature 554(7693) (2018) 8. Eraslan, G., Avsec, Z., Gagneur, J., Theis, F.J.: Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20(7), 389–403 (2019) 9. Plis, S.M.: Deep learning for neuroimaging: a validation study. Front. Neurosci. 8, 229 (2014) 10. Zhang, W.E., Sheng, Q.Z., Alhazmi, A., Li, C.: Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Trans. Intell. Syst. Technol. 11(3) (2020) 11. Buehler, H., Gonon, L., Teichmann, J., Wood, B.: Deep hedging. Quant. Finance 19(8), 1271–1291 (2019) 12. Becker, S., Cheridito, P., Jentzen, A.: Deep optimal stopping. J. Mach. Learn. Res. 20, Paper No. 74, 25 (2019) 13. Cuchiero, C., Khosrawi, W., Teichmann, J.: A generative adversarial network approach to calibration of local stochastic volatility models. Risks 8(4), 101 (2020) 14. Kratsios, A., Hyndman, C.: Deep arbitrage-free learning in a generalized HJM framework via arbitrage- regularization. Risks 8(2), 40 (2020) 15. Horvath, B., Muguruza, A., Tomas, M.: Deep learning volatility: a deep neural network perspective on pricing and calibration in (rough) volatility models. Quant. Finance 0(0), 1–17 (2020) 16. Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6(6), 861–867 (1993) The Universal Approximation Property 467 17. Kidger, P., Lyons, T. In: Abernethy, J., Agarwal, S. (eds.): Universal Approximation with Deep Narrow Networks, vol. 125, pp. 2306–2327. PMLR, USA (2020) 18. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989) 19. Park, S., Yun, C., Lee, J., Shin, J.: Minimum width for universal approximation. ICLR (2021) 20. Hanin, B.: Universal function approximation by deep neural nets with bounded width and relu activations. Math. - MDPI 7(10) (2019) 21. Lu, Z., Pu, H., Wang, F., Hu, Z., Wang, L.: The expressive power of neural networks: A view from the width. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6231–6239. Curran Associates, Inc. (2017) 22. Fletcher, P.T., Venkatasubramanian, S., Joshi, S.: The geometric median on riemannian manifolds with application to robust atlas estimation. Neuroimage 45(1), S143–S152 (2009). Mathematics in Brain Imaging 23. Keller-Ressel, M., Nargang, S.: Hydra: a method for strain-minimizing hyperbolic embedding of network- and distance-based data. J. Complex Netw. 8(1), cnaa002, 18 (2020) 24. Ganea, O., Becigneul, G., Hofmann, T.: Hyperbolic neural networks. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 5345–5355. Curran Associates, Inc. (2018) 25. Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning, pp. 7354–7363. PMLR (2019) 26. Arens, R.F., Eells, J.: On embedding uniform and topological spaces. Pacific J. Math. 6, 397–403 (1956) 27. von Luxburg, U., Bousquet, O.: Distance-based classification with Lipschitz functions. J. Mach. Learn. Res. 5, 669–695 (2003/04) 28. Ambrosio, L., Puglisi, D.: Linear extension operators between spaces of Lipschitz maps and optimal transport. J. Reine Angew. Math. 764, 1–21 (2020) 29. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks, pp. 214–223. PMLR, International Convention Centre, Sydney, Australia (2017) 30. Xu, T., Le, W., Munn, M., Acciaio, B.: Cot-gan: Generating sequential data via causal optimal transport. Advances in Neural Information Processing Systems 33 (2020) 31. Godefroy, G., Kalton, N.J.: Lipschitz-free Banach spaces. pp. 121–141. Dedicated to Professor Alek- sander Pełczynski ´ on the occasion of his 70th birthday (2003) 32. Weaver, N.: Lipschitz algebras. World Scientific Publishing Co. Pte. Ltd., Hackensack (2018) 33. Godefroy, G.: A survey on Lipschitz-free Banach spaces. Comment. Math. 55(2), 89–118 (2015) 34. Jost, J. Riemannian Geometry and Geometric Analysis, 6th edn. Universitext, Springer, Heidelberg (2011) 35. Basso, G.: Extending and improving conical bicombings. preprint 2005.13941 (2020) 36. Nagata, J. Modern general topology, revised. North-Holland Publishing Co., Amsterdam (1974). Wolters-Noordhoff Publishing, Groningen; American Elsevier Publishing Co., New York (1974). Bibliotheca Mathematica, Vol. VII 37. Munkres, J.R.: Topology. Prentice Hall, Inc., Upper Saddle River (2000). 2 38. Micchelli, C.A., Xu, Y., Zhang, H.: Universal kernels. J. Mach. Learn. Res. 7, 2651–2667 (2006) 39. Kontorovich, L., Nadler, B.: Universal kernel-based learning with applications to regular languages. J. Mach. Learn. Res. 10, 1095–1129 (2009) 40. Caponnetto, A., Micchelli, C.A., Pontil, M., Ying, Y.: Universal multi-task kernels. J. Mach. Learn. Res. 9, 1615–1646 (2008) 41. Grigoryeva, L., Ortega, J.-P.: Differentiable reservoir computing. J. Mach. Learn. Res. 20, Paper No. 179, 62 (2019) 42. Cuchiero, C., Gonon, L., Grigoryeva, L., Ortega, J.-P., Teichmann, J.: Discrete-time signatures and randomness in reservoir computing. pre-print 2010.14615 (2020) 43. Fletcher, P.T.: Geodesic regression and the theory of least squares on Riemannian manifolds. Int. J. Comput. Vis. 105(2), 171–185 (2013) 44. Kratsios, A., Bilokopytov, E.: Non-euclidean universal approximation (2020) 45. Osborne, M.S.: Locally convex spaces, Graduate Texts in Mathematics, vol. 269. Springer, Cham (2014) 46. Petersen, P., Raslan, M., Voigtlaender, F.: Topological properties of the set of functions generated by neural networks of fixed size. Found Comput Math. https://doi.org/10.1007/s10208-020-09461-0 (2020) 47. Gribonval, R., Kutyniok, G., Nielsen, M., Voigtlaender, F.: Approximation spaces of deep neural networks. Constr. Approx forthcoming (2020) 48. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge (2016) 49. Gelfand, I.: Normierte Ringe. Rec. Math. N. S. 9(51), 3–24 (1941) 468 A. Kratsios 50. Isbell, J.R.: Structure of categories. Bull. Amer. Math. Soc. 72, 619–655 (1966) 51. Dimov, G.D.: Some generalizations of the Stone duality theorem. Publ. Math. Debrecen 80(3-4), 255– 293 (2012) 52. Tuitman, J.: A refinement of a mixed sparse effective Nullstellensatz. Int. Math. Res. Not. IMRN 7, 1560–1572 (2011) 53. Fletcher, P.T.: Geodesic regression and the theory of least squares on Riemannian manifolds. Int. J. Comput. Vis. 105(2), 171–185 (2013) 54. Meyer, G., Bonnabel, S., Sepulchre, R.: Regression on fixed-rank positive semidefinite matrices: a Riemannian approach. J. Mach. Learn. Res. 12, 593–625 (2011) 55. Baes, M., Herrera, C., Neufeld, A., Ruyssen, P.: Low-rank plus sparse decomposition of covariance matrices using neural network parametrization. pre-print 1908.00461 (2019) 56. Hummel, J., Biederman, I.: Dynamic binding in a neural network for shape recognition. Psych. Rev. 99, 480–517 (1992) 57. Bishop, C.M.: Mixture density networks (1994) 58. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. ICLR (2017) 59. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. Neural Netw. Learn Syst. 20(1), 61–80 (2009) 60. PrajitRamachandran, Q.V.L.: Searching for activation functions. ICLR (2018) 61. Pinkus, A.: Approximation theory of the MLP model in neural networks 8, 143–195 (1999) 62. Koopman, B.O.: Hamiltonian systems and transformation in hilbert space. Proc. Natl. Acad. Sci. 17(5), 315–318 (1931) 63. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. ICML 30(1), 3 (2013) 64. Singh, R.K., Manhas, J.S.: Composition operators on function spaces, North-Holland Mathematics Studies, vol. 179. North-Holland Publishing Co., Amsterdam (1993) 65. Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, vol. 27, pp. 17–36. JMLR Workshop and Conference Proceedings (2012) 66. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kurk ˚ ova, ´ V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) Artificial Neural Networks and Machine Learning – ICANN 2018, pp. 270–279. Springer (2018) 67. Chollet, F. et al.: Keras. https://keras.io/guides/transfer learning/ (2015) 68. Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory 39(3), 930–945 (1993) 69. Darken, C., Donahue, M., Gurvits, L., Sontag, E.: Rate of approximation results motivated by robust neural network learning. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory, pp. 303–309. Association for Computing Machinery, New York (1993) 70. Prolla, J.B.: Weighted spaces of vector-valued continuous functions. Ann. Mat. Pura Appl. (4) 89, 145– 157 (1971) 71. Bourbaki, N.: Elements ´ de mathematique. ´ Topologie gen ´ erale. ´ Chapitres 1 a ` 4. Hermann, Paris (1971) 72. Phelps, R.R.: Subreflexive normed linear spaces. Arch. Math. (Basel) 8, 444–450 (1957) 73. Kadec, M.I.: A proof of the topological equivalence of all separable infinite-dimensional Banach spaces. Funkcional. Anal. i Prilozen. ˇ 1, 61–70 (1967) 74. Grosse-Erdmann, K.-G., Peris Manguillot, A.: Linear chaos. Universitext, Springer, London (2011) 75. Perez ´ Carreras, P., Bonet, J.: Barrelled locally convex spaces, North-Holland Mathematics Studies, vol. 131. North-Holland Publishing Co., Amsterdam. Notas de Matematica ´ [Mathematical Notes], 113 (1987) 76. Kreyszig, E.: Introductory functional analysis with applications, Wiley Classics Library. Wiley, New York (1989) 77. Bourbaki, N. Espaces vectoriels topologiques. Chapitres 1 a ` 5, New. Masson, Paris (1981). Elements ´ de mathematique ´ 78. Kalmes, T.: Dynamics of weighted composition operators on function spaces defined by local properties. Studia Math. 249(3), 259–301 (2019) 79. Przestacki, A.: Dynamical properties of weighted composition operators on the space of smooth functions. J. Math. Anal. Appl. 445(1), 1097–1113 (2017) 80. Bayart, F., Darji, U.B., Pires, B.: Topological transitivity and mixing of composition operators. J. Math. Anal. Appl. 465(1), 125–139 (2018) 81. Hoffmann, H.: On the continuity of the inverses of strictly monotonic functions. Irish Math. Soc. Bull. (75), 45–57 (2015) The Universal Approximation Property 469 82. Behrends, E., Schmidt-Bichler, U.: M -structure and the Banach-Stone theorem. Studia Math. 69(1), 33– 40 (1980/81) 83. Jarchow, H.: Locally convex spaces. B. G. Teubner, Stuttgart. Mathematische Leitfaden. ¨ [Mathematical Textbooks] (1981) ´ ´ 84. Dieudonne, J., Schwartz, L.: La dualite dans les espaces F et LF. Ann. Inst. Fourier (Grenoble) 1, 61– 101 (1949) Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Annals of Mathematics and Artificial Intelligence Springer Journals

The Universal Approximation Property

Loading next page...
 
/lp/springer-journals/the-universal-approximation-property-fCpPIjUbDX

References (90)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2021
ISSN
1012-2443
eISSN
1573-7470
DOI
10.1007/s10472-020-09723-1
Publisher site
See Article on Publisher Site

Abstract

The universal approximation property of various machine learning models is currently only understood on a case-by-case basis, limiting the rapid development of new theoretically jus- tified neural network architectures and blurring our understanding of our current models’ potential. This paper works towards overcoming these challenges by presenting a charac- terization, a representation, a construction method, and an existence result, each of which applies to any universal approximator on most function spaces of practical interest. Our characterization result is used to describe which activation functions allow the feed-forward architecture to maintain its universal approximation capabilities when multiple constraints are imposed on its final layers and its remaining layers are only sparsely connected. These include a rescaled and shifted Leaky ReLU activation function but not the ReLU activation function. Our construction and representation result is used to exhibit a simple modifica- tion of the feed-forward architecture, which can approximate any continuous function with non-pathological growth, uniformly on the entire Euclidean input space. This improves the known capabilities of the feed-forward architecture. Keywords Universal approximation · Constrained approximation · Uniform approximation · Deep learning · Topological transitivity · Composition operators Mathematics Subject Classification (2010) 68T07 47B33 · 47A16 · 68T05 · 30L05 · 46M40 1 Introduction Neural networks have their organic origins in [1]and in[2], wherein the authors pioneered a method for emulating the behavior of the human brain using digital computing. Their th mathematical roots are traced back to Hilbert’s 13 problem, which postulated that all high- dimensional continuous functions are a combination of univariate continuous functions. Anastasis Kratsios anastasis.kratsios@math.ethz.ch (ETH) Eidgenossische ¨ Technische Hochschule Zurich, ¨ Ramistrasse ¨ 101, CH-8092 Z¨ urich, Switzerland 436 A. Kratsios Arguably the second major wave of innovation in the theory of neural networks hap- pened following the universal approximation theorems of [3, 4], and of [5], which merged these two seemingly unrelated problems by demonstrating that the feed-forward architecture is capable of approximating any continuous function between any two Euclidean spaces, uniformly on compacts. This series of papers initiated the theoretical justification of the empirically observed performance of neural networks, which had up until that point only been justified by analogy with the Kolmogorov-Arnold Representation Theorem of [6]. Since then the universal approximation capabilities, of a limited number of neural net- work architectures, such as the feed-forward, residual, and convolutional neural networks has been solidified as a cornerstone of their approximation success. This, coupled with the numerous hardware advances have led neural networks to find ubiquitous use in a number of areas, ranging from biology, see [7, 8], to computer vision and imaging, see [9, 10], and to mathematical finance, see [11–15]. As a result, a variety of neural network architectures have emerged with the common thread between them being that they describe an algorith- mically generated set of complicated functions built by combining elementary functions in some manner. However, the case-by-case basis for which the universal approximation property is currently understood limits the rapid development of new theoretically justified archi- tectures. This paper works at overcoming this challenges by directly studying universal approximation property itself in the form of far-reaching characterizations, representations, construction methods, and existence results applicable to most situations encounterable in practice. The paper’s contributions are organized as follows. Section 2 overviews the analytic, topological, and learning-theoretic background required in formulating the paper’s results. Section 3 contains the paper’s main results. These include a characterization, a repre- sentation result, a construction theorem, and existence result applicable to any universal approximator in most function spaces of practical interest. The characterization result shows that an architecture has the UAP on a function space if and only if that architecture implicitly decomposes the function space into a collection of separable Banach subspaces, whereon the architecture contains the orbit of a topologically transitive dynamical system. Next, the representation result shows that any universal approximator can always be approximately realized as a transformation of the feed-forward architecture. This result reduces the prob- lem of constructing new universal architectures for identifying the correct transformation of the feed-forward architecture for the given learning task. The construction result gives con- ditions on a set of transformations of the feed-forward architecture, guaranteeing that the resultant is a universal approximator on the target function space. Lastly, we obtain a gen- eral existence and representation result for universal approximators generated by a small number of functions applicable to many function spaces. Section 4 then focuses the main theoretical results to the feed-forward architecture. Our characterization result is used to exhibit the dynamical system representation on the space of continuous functions by composing any function with an additional deep feed-forward layer whose activation function is continuous, injective, and has no fixed points. Using this representation, we show that the set of all such deep feed-forward networks constructed through this dynamical system maintain its universal approximation property even when constraints are imposed on the network’s final layers or when sparsity is imposed on the network’s connections’ initial layers. In particular, we show that feed-forward networks with ReLU activation function fail these requirements, but a simple affine transformation of the Leaky-ReLU activation function is of this type. We provide a simple and explicit method for modifying most commonly used activation functions into this form. We also show that The Universal Approximation Property 437 the conditions on the activation function are sharp for this dynamical system representation to have the desired topological transitivity properties. As an application of our construction and representation results, we build a modification of the feed-forward architecture which can uniformly approximate a large class of contin- uous functions which need not vanish at infinity. This architecture approximates uniformly on the entire input space and not only on compact subsets thereof. This refines the known guarantees for feed-forward networks (see [16, 17]) which only guarantee uniform approxi- mation on compacts subsets of the input space, and consequentially, for functions vanishing at infinity. As a final application of the results, the existence theorem is then used to pro- vide a representation of a small universal approximator on L (R), which provides the first concrete step towards obtaining a tractable universal approximator thereon. 2 Background and preliminaries This section overviews the analytic, topological, and learning-theoretic background used to in this paper. 2.1 Metric spaces Typically, two points x, y ∈ R are thought of as being near to one another if y belongs to the open ball with radius δ> 0 centered about x defined by Ball (x, δ) {z ∈ R : x − z <δ},where (x, z) x − z denotes the Euclidean distance function. The analogue can be said if we replace R by a set X on which there is a distance function d : X × X →[0, ∞) quantifying the closeness of any two members of X. Many familiar properties of the Euclidean distance function are axiomatically required of d in order to maintain many of the useful analytic properties of R ; namely, d is required to satisfy the triangle inequality, symmetry in its arguments, and it vanishes precisely when its arguments are identical. As before, two points x, y ∈ X are thought of as being close if they belong to the same open ball,Ball (x, δ) {z ∈ X : d (x, z) < δ} where δ> 0. Together, the X X pair (X, d ) is called a metric space, and this simple structure can be used to describe many familiar constructions prevalent throughout learning theory. We follow the convention of only denoting (X, d ) by X whenever the context is clear. Example 1 (Spaces of Continuous Functions) For instance, the universal approximation theorems of [16–19] describe conditions under which any continuous function from R to R can be approximated by a feed-forward neural network. The distance function used to formulate their approximation results is defined on any two continuous functions f, g : m n R → R via sup m f(x) − g(x) x∈[−k,k] d (f, g) . ucc 2 1 + sup f(x) − g(x) x∈[−k,k] k=1 m n m n In this way, the set of continuous functions from R to R by C(R , R ) is made into a metric space when paired with d . In what follows, we make the convention of denoting ucc C(X, R) by C(X). Example 2 (Space of Integrable Functions) Not all functions encountered in practice are continuous, and the approximation of discontinuous functions by deep feed-forward 438 A. Kratsios m n networks is studied in [20, 21] for functions belonging to the space L (R , R ).Briefly, m n m n elements of L (R , R ) are equivalence classes of Borel measurable f : R → R , identified up to μ-null sets, for which the norm f f(x) dμ(x) p,μ x∈R is finite; here μ is a fixed Borel measure on R and 1 ≤ p< ∞. We follow the convention m p m m of denoting L (R , R) by L (R ) when μ is the Lebesgue measure on R . m n m n Unlike C(R , R ), the distance function on L (R , R ) is induced through a norm via (f, g) f − g . Spaces of this type simultaneously carry compatible metric and p,μ vector spaces structures. Moreover, in such a space, if every sequence converges whenever its pairwise distances asymptotically tend to zero, then the space is called a Banach space. The prototypical Banach space is R . Unlike Banach spaces or the space of Example 1, general metric spaces are non-linear. That is, there is no meaningful notion of addition, scaling, and there is no singular ref- erence point analogous to the 0 vector. Examples of non-linear metric spaces arising in machine learning are shape spaces used in neuroimaging applications (see [22]), graphs and trees arising in structured and hierarchical learning (see [23, 24]), and spaces of probability measures appearing in adversarial approaches to learning (see [25]). The lack of a reference point may always be overcome by artificially declaring a fixed element of X, denoted by 0 , to be the central point of reference in X. In this case, the triple (X, d , 0 ), is called a pointed metric space. We follow the convention of denoting X X the triple by X, whenever the context is clear. For pointed metric spaces X and Y,the class of functions f : X → Y satisfying f(0 ) = 0 and f(x ) − f(x ) L x − X Y 1 2 1 x ,for some L> 0andevery x ,x ∈ X, is denoted by Lip (X, Y ) and this class is 2 1 2 understood as mapping the structure of X into Y without too large of a distortion. In the extreme case where an f ∈ Lip (X, Y ) perfectly respects the structure of X,i.e. : when f(x ) − f(x ) x − x , we call f a pointed isometry. In this case, f(X) represents 1 2 1 2 an exact copy of X within Y . The remaining non-linear aspects of a general metric space pose no significant challenge and this is due to the following linearization feature map of [26]. Since its inception, the following method has found notable applications in clustering [27] and in optimal transport [28]. In particular, the later connects this linearization procedure with optimal transport approaches to adversarial learning of [29, 30]. Example 3 (Free-Space over X) We follow the formulation described in [28]. Let X be a metric space and for any x ∈ X,let δ be the (Borel) probability measure assigning value 1 to any Ball ⊆ X if x ∈ Ball and 0 otherwise. X X The Free-space over X is the Banach space B(X) obtained by completing the vec- tor space α δ : a ∈ R,x ∈ X, n = 1,...,N, N ∈ N with respect to the n x n n + n=1 following n n α x sup α f(x ).(1) i i i i f 1; f ∈Lip (X,R) i=1 i=1 B(X) As shown in [31, Proposition 2.1], the map δ : x → δ is a (non-linear) isometry from X to B(X).Asshown in[32], the pair (B(X), δ ) is characterized by the following linearization The Universal Approximation Property 439 property: whenever f ∈ Lip (X, Y ) and Y is a Banach space then there exists a unique continuous linear map satisfying f = F ◦ δ.(2) Thus, δ : X → B(X) can be interpreted as a minimal isometric linearizing feature map. Sometimes the feature map δ can be continuously inverted from the left. In [31]any continuous map ρ : B(X) → X is called a barycenter if it satisfies ρ ◦ δ = 1 ,where 1 X X is the identity on X. Following [31], if a barycenter exists then X is called barcycentric.Examplesof barycentric spaces are Banach spaces [33], Cartan-Hadamard manifolds described (see [34, Corollary 6.9.1]), and other structures described in [35]. Accordingly, many function spaces of potential interest contain a dense barycentric subspace. When the context is clear, we follow the convention of denoting δ simply by δ. 2.2 Topological Background Rather than using open balls to quantify closeness, it is often more convenient to work with open subsets of X;where U ⊆ X is said to be open whenever every point x ∈ U belongs to some open ball B (x, δ) contained in U . This is because open sets have many desirable properties; for example, a convergent sequence contained in the complement of an open set must also have its limit in that open set’s complement. Thus, the complement of open sets are often called closed sets since their limits cannot escape them. Unfortunately, many familiar situations arising in approximation theory cannot be described by a distance function. For example, there is no distance function describing the point-wise convergence of a sequence of functions {f } on R to any other such func- n n∈N tion f (for details [36, page 362]). In these cases, it is more convenient to work directly with topologies. A topology τ is a collection of subsets of a given set X whose members are declared as being open if τ satisfies certain algebraic conditions emulating the basic prop- erties of the typical open subsets of R (see [37, Chapter 2]). Explicitly, we require that τ contain the empty set ∅ as well as the entire space X, we require that the arbitrary union of subsets of X belonging to τ also belongs to τ , and we require that finite intersections of subsets of X belonging to τ also be a member of τ.A topological space isapairofaset X and a topology τ thereon. We follow the convention of denoting topological spaces with the same symbol as their underlying set. Most universal approximation theorems [4, 16, 17] guarantee that a particular subset of m n C(R , R ) is dense therein. In general, A ⊆ X is dense if the smallest closed subset of X containing A is X itself. Topological spaces containing a dense subset which can be put in a 1-1 correspondence with the natural numbers N is called a separable space. Many familiar m m spaces are separable, such as C(R ) and R . m n A function f : R → R is thought of as continuously depending on its inputs if small variations in its inputs can only produce small variations in its outputs; that is, for any x ∈ m −1 R 0 there exists some δ> 0 such that f [Ball n ] ⊆ Ball m (x, δ). It can R R −1 be shown, see [37], that this condition is equivalent to requiring that the pre-image f [U ] n m of any open subset U of R is open in R . This reformulation means that open sets are preserved under the inverse-image of continuous functions, and it lends itself more readily to abstraction. Thus, a function f : X → Y between arbitrary topological spaces X and Y is −1 continuous if f [U ] is open in X whenever U is open in Y .If f is a continuous bijection −1 and its inverse function f : Y → X is continuous, then f is called a homeomorphism 440 A. Kratsios and X and Y are thought of as being topologically identical. If f is a homeomorphism onto its image, f is an embedding. We illustrate the use of homeomorphisms with a learning theoretic example. Many learn- ing problems encountered empirically benefit from feature maps modifying the input a of learning model; for example, this is often the case with kernel methods (see [38–40]), in reservoir computing (see [41, 42]), and in geometric deep learning (see [23, 43]). Recently, in [44], it was shown that, a feature map φ : X → R is continuous and injective if and only if the set of all functions f ◦ φ ∈ C(X),where f ∈ C(R ) is a deep feed-forward net- work with ReLU activation, is dense in C(X). A key factor in this characterization is that the map Φ : C(R ) → C(X),given by f → f ◦ φ, is an embedding if φ is continuous and injective. The above example suggests that our study of an architecture’s approximation capa- bilities is valid on any topological space which can be mapped homeomorphically onto a well-behaved topological space. For us, a space will be well-behaved if it belongs to the broad class of Frechet ´ spaces. Briefly, these spaces have compatible topological space and vector space structures, meaning that the basic vector space operations such as addi- tion, inversion, and scalar multiplication are continuous; furthermore, their topology is induced by a complete distance function which is invariant under translation and satisfies an additional technical condition described in [45, Section 3.7]. The class of Frechet ´ spaces encompass all Hilbert and Banach spaces and they share many familiar properties with R . m n Relevant examples of a Frechet ´ space are C(R , R ), the free-space B(X) over any pointed 1 m n metric space, and L (R , R ). 2.3 Universal approximation background In the machine learning literature, universal approximation refers to a model class’ ability to generically approximate any member of a large topological space whose elements are functions, or more rigorously, equivalence classes of functions. Accordingly, in this paper, we focus on a class of topological spaces which we call function spaces. In this paper, a function space X is a topological space whose elements are equivalence classes of functions between two sets X and Y . For example, when X = R = Y then X may be C(R) or L (R). We refer to X as a function space between X and Y and we omit the dependence to X and Y if it is clear from the context. The elements in X are called functions, whereas functions between sets are referred to as set-functions. By a partial function f : X → Y we mean a binary relation between the sets X and Y which attributes at-most one output in Y to each input in X. Notational Conventions The following notational conventions are maintained throughout this paper. Only non-empty outputs of any partial function f are specified. We denote the + + + set of positive integers by N .Weset N N ∪{0}.Forany n ∈ N ,the n-fold Cartesian product of a set A with itself is denoted by A .For n ∈ N, we denote the n-fold composition n 0 of a function φ : X → X with itself by φ and the 0-fold composition φ is defined to be the identity map on X. Definition 1 (Architecture) Let X be a function space. An architecture on X is a pair (F , ) of a set of set-functions F between (possibly different) sets and a partial function : F → X , satisfying the following non-triviality condition: there exists some J ∈N f ∈ X , J ∈ N ,and f ,...,f ∈ F satisfying 1 J f = (f ) ∈ X.(3) j =1 The Universal Approximation Property 441 The set of all functions f in X for which there is some J ∈ N and some f ,...,f ∈ F 1 J (F , ) satisfying the representation (3) is denoted by NN . Many familiar structures in machine learning, such as convolutional neural networks, trees, radial basis functions, or various other structures can be formulated as architectures. To fix notation and to illustrate the scope of our results we express some familiar machine learning models in the language of Definition 1. Example 4 (Deep Feed-Forward Networks) Fix a continuous function σ : d D R → R, denote component-wise composition by •,and letAff(R , R ) be d D m n the set of affine functions from R to R .Let X = C(R , R ), F d d i i+1 (W ,W ) : W ∈ Aff(R , R ), i = 1, 2 ,and set 2 1 1 d ,d ,d ∈N 1 2 3 ((W ,W ) ) W ◦ σ • W ◦ ··· ◦ W ◦ σ • W (4) j,2 j,1 2,J 1,J 2,1 1,1 j =1 whenever the right-hand side of (4) is well-defined. Since the composition of two affine (F , ) m functions is again affine then NN is the set of deep feed-forward networks from R to R with activation function σ . Remark 1 The construction of Example 4 parallels the formulation given in [46, 47]. How- (F , ) ever, in [47]elementsof F are referred to as neural networks and functions in NN are called their realizations. Example 5 (Trees) Let X = L (R), F {(a,b,c) : a ∈ R,b, c ∈ R,b ≤ c},andlet J F , J ( ) 1 ((a ,b ,c ) ) a I . Then, NN is the set of trees in L (R). j j j j (b ,c ) j =1 j j j =1 We are interested in architectures which can generically approximate any function on their associated function space. Paraphrasing [48, page 67], any such architecture is called a universal approximator. Definition 2 (The Universal Approximation Property) An architecture (F , ) is said to (F , ) have the universal approximation property (UAP) if NN is dense in X . 3 Main Results Our first result provides a correspondence between the apriori algebraic structure of uni- (F , ) versal approximators on X and decompositions of X into subspaces on which NN contains the orbit of a topologically generic dynamical system, which are a priori of a topo- logical nature. The interchangeability of algebraic and geometric structures is a common theme, notable examples include [49–52]. Theorem 1 (Characterization: Dynamical Systems Structure of Universal Approximators) Let X be a function space which is homeomorphic to an infinite-dimensional Frec ´ het space and let (F , ) be an architecture on X . Then, the following are equivalent: (i) (F , ) is a universal approximator, (ii) There exist subspaces {X } of X , continuous functions {φ } with φ : X → X , i i∈I i i∈I i i i (F , ) and {g } ⊆ NN such that: i i∈I (a) X is dense in X , i∈I 442 A. Kratsios (b) For each i ∈ I and every pair of non-empty open U, V ⊆ X ,thereissome N ∈ N satisfying i,U,V i,U,V φ (U ) ∩ (V ) =∅, n (F , ) (c) For every i ∈ I , g ∈ X and {φ (g )} is a dense subset of NN ∩ X , i i i n∈N i (d) For each i ∈ I , X is homeomorphic to C(R). (F , ) In particular, φ (g ) : i ∈ I, n ∈ N is dense in NN . Theorem 1 describes the structure of universal approximators, however, it does not describe an explicit means of constructing them. Nevertheless, Theorem 1 (ii.a) and (ii.d) suggest that universal approximators on most function spaces can be built by combining m n multiple, non-trivial, transformations of universal approximators on C(R , R ). This is type of transformation approach to architecture construction is common in geo- metric deep learning, whereby non-Euclidean data is mapped to the input of familiar d D architectures defined between R and R using specific feature maps and that model’s out- puts are then return to the manifold by inverting the feature map. Examples include the hyperbolic feed-forward architecture of [24], and the shape space regressors of [53], and the matrix-valued regressors of [54, 55], amongst others. This transformation procedure is a particular instance of the following general construction method, which extends [44]. Theorem 2 (Construction: Universal Approximators by Transformation) Let n, m, ∈ N , m n X be a function space, (F , ) be a universal approximator on C(R , R ), and { } i i∈I m n be a non-empty set of continuous functions from C(R , R ) to X satisfying the following condition: m n Φ C(R , R ) is dense in X.(5) i∈I Then (F , ) has the UAP on X,where F F × I and {f ,i } Φ Φ Φ Φ j j j =1 Φ (f ) . I j j =1 The alternative approach to architecture development, subscribed to by authors such as [56–59], specifies the elementary functions F and the rule for combining them. Thus, this method explicitly specifies F and implicitly specifies . These competing approaches are in-fact equivalent since every universal approximator an approximately a transformation of the feed-forward architecture on C(R). Theorem 3 (Representation: Universal Approximators are Transformed Neural Networks) Let σ be a continuous, non-polynomial activation function, and let (F , ) denote the 0 0 architecture of Example 4. Let X be a function space which is homeomorphic to an infinite- dimensional Frec ´ het. If (F , ) has the UAP on X then, there exists a family { } of i i∈I (F , ) embeddings : C(R) → X such that for every 0, f ∈ NN there exists some (F , ) (F , ) 0 0 i ∈ I , g ∈ NN , and f ∈ NN satisfying −1 d ( (g )) and d g (f ) . X i ucc The previous two results describe the structure of universal approximators but they do not imply the existence of such architectures. Indeed, the existence of a universal approxi- mator on X can always be obtained by setting F = X and (f ) = f ;however,thisis (F , ) uninteresting since F is large, is trivial, and NN is intractable. Instead, the next The Universal Approximation Property 443 result shows that, for a broad range of function spaces, there are universal approximators for which F is a singleton, and the structure of is parameterized by any prespecified separable metric space. This description is possible by appealing to the free-space on X . Theorem 4 (Existence: Small Universal Approximators) Let X be a separable pointed met- ric space with at least two points, let X be a function space and a pointed metric space, and let X be a dense barycentric sub-space of X . Then, there exists a non-empty set I with pre-order ≤, {x } ⊆ X −{0 } there exist triples {(B ,φ )} of linear subspaces i i∈I X i i i i∈I B of B(X ), bounded linear isomorphisms : B(X) → B , and bounded linear maps i 0 i i φ : B(X) → B(X) satisfying: (i) B(X ) = B , 0 i i∈I (ii) For every i ≤ j , B ⊆ B , i j (iii) For every i ∈ I , ◦ φ (x ) is dense in B with respect to its subspace i i i n∈N i topology, (iv) The architecture F ={x } , and | : (x ,...,x ) ρ ◦ ◦ φ ◦ δ , i i∈I 1 J i x F i j whenever x = x for each j ≤ J , is a universal approximator on X . 1 j Furthermore, if X = X then the set I is a singleton and is the identity on B(X ). i 0 The rest of this paper is devoted to the concrete implications of these results in learning theory. 4 Applications The dynamical systems described by Theorem 1 (ii) can, in general, be complicated. How- ever, when (F , ) is the feed-forward architecture with certain specific activation functions then these dynamical systems explicitly describe the addition of deep layers to a shallow feed-forward network. We begin the next section by characterizing those activation function before outlining their approximation properties. 4.1 Depth as a transitive dynamical system The impact of different activation functions on the expressiveness of neural network archi- tectures is an active research area. For example, [60] empirically studies the effect of different activation function on expressiveness and in [61] a characterization of the activa- tion functions for which shallow feed-forward networks are universal is also obtained. The next result characterizes the activation functions which produce feed-forward networks with the UAP even when no weight or bias is trained and the matrices {A } are sparse, and n=1 the final layers of the network are slightly perturbed. Fix an activation function σ : R → R.For every m × m matrix A and b ∈ R , define the associated composition operator : f → f ◦ σ • (A ·+b), with termi- A,b nology rooted in [62]. The family of composition operators { } creates depth within A,b A,b an architecture (F , ) by extending it to include any function of the form ◦· · ·◦ A ,b N N J N ((f ) ) , for some m × m matrices {A } , {b } in R , and each f ∈ F A ,b j n n j 1 1 j =1 n=1 for j = 1,...,J . In fact, many of the results only require the following smaller extension of (F , ), denoted by (F , ),where F F × N and where deep;σ deep;σ deep;σ J J J {(f ,n )} ((f ) ) , deep;σ j j j j =1 I ,b j =1 m 444 A. Kratsios and b is any fixed element of R with positive components and I is the m × m identity matrix. Theorem 5 (Characterization of Transitivity in Deep Feed-Forward Networks) Let (F , ) m n m be an architecture on C(R , R ), σ be a continuous activation function, fix any b ∈ R with strictly positive components. Then is a well-defined continuous linear map from I ,b m n C(R , R ) to itself and the following are equivalent: (i) σ is injective and has no fixed-points, (ii) Either σ(x) > x or σ(x) < x holds for every x ∈ R m n (iii) For every g ∈ (F , ) and every δ> 0, there exists some g ˜ ∈ C(R , R ) with m n d (g, g) ˜ < δ such that, for each f ∈ C(R , R ) and each 0 there is a ucc N ∈ N satisfying d ( ˜ ucc I ,b m n + (iv) For each 0 and every f, g ∈ C(R , R ) there is some N ∈ N such that U,V ˜ ˜ (g) ˜ : d (g, ˜ g) < δ ∩ f : d ( =∅. ucc ucc I ,b Remark 2 A characterization is given in Appendix B when A = I , however, this less technical formulation is sufficient for all our applications. We call an activation function transitive if it satisfies any of the conditions (i)-(ii) in Theorem 5. Example 6 The ReLU activation function σ(x) = max{0,x} does not satisfy Theorem 5 (i). Example 7 The following variant of the Leaky-ReLU activation of [63] does satisfy Theorem 5 (i) 1.1x + .1 x ≥ 0 σ(x) 0.1x + .1 x< 0. More generally, transitive activation functions also satisfying the conditions required by the central results of [17, 61] can be build via the following. Proposition 1 (Construction of Transitive Activation Functions) Let σ ˜ : R → R be a continuous and strictly increasing function satisfying σ( ˜ 0) = 0. Fix hyper-parameters 0 < α < 1, 0 <α such that α =˜ σ (0) − 1, and define 1 2 2 σ( ˜ x) + x + α : x ≥ 0 σ(x) α x + α : x< 0. 1 2 Then, σ is continuous, injective, has no fixed-points, is non-polynomial, and is continuously differentiable with non-zero derivative on infinitely many points. In particular, σ satisfies the requirements of Theorem 5. Transitive activation functions allow one to automatically conclude that m n (F , ) has the UAP on C(R , R ) if (F , ) is only a universal approximator σ ;deep σ ;deep on some non-empty open subset thereof. The Universal Approximation Property 445 m n Corollary 1 (Local-to-Global UAP) Let X be a non-empty open subset of C(R , R ) and (F , ) be a universal approximator on X . If any of the conditions described by Lemma 3 m n (i)-(iii) hold, then (F , )[σ ; deep] is a universal approximator on C(R , R ). The function space affects which activation functions are transitive. Since most universal m n m approximation results hold in the space C(R , R ) or on L (R ), for suitable μ and p,we describe the integrable variant of transitive activation functions. 4.1.1 Integrable variants Some notation is required when expressing the integrable variants of the Theorem 5 and its consequences. Fix a σ -finite Borel measure μ on R . Unlike in the continuous case, the 1 m operators may not be well-defined or continuous from L (R ) to itself. We require A,b m m the notion of a push-forward measure by a measurable function is required. If S : R → R is Borel measurable and μ is a finite Borel measure on R , then its push-forward by S is the m −1 measure denoted by S μ and defined on Borel subsets B ⊆ R by S μ(B) μ S [B] . # # In particular, if μ is absolutely continuous with respect to the Lebesgue measure μ on R , then as discussed in [64, Chapter 2.1], S μ admits a Radon-Nikodym derivative with respect to the Lebesgue measure on R . We denote this Radon-Nikodym derivative dS μ by . A finite Borel measure μ on R is equivalent to the Lebesgue measure thereon, dμ denoted by μ if both μ and μ are absolutely continuous with one another. M M Recall that, if a function is monotone on R, then it is differentiable outside a μ -null set. We denote the μ -a.e. derivative of any such a function σ by σ . Lastly, we denote the 1 m essential supremum of any f ∈ L (R ) by f . The following Lemma is a rephrasing of [64, Corollary 2.1.2, Example 2.17]. Lemma 1 Fix a σ -finite Borel measure μ on R equivalent to the Lebesgue measure, let 1 ≤ p< ∞, b ∈ R , A be an m × m matrix, and let σ : R → R be a Borel measurable. 1 m n 1 m n Then, the composition operator : L (R ; R ) → L (R ; R ) is well-defined and A,b continuous if and only if (σ • (A ·+b)) μ is absolutely-continuous with respect to μ and d(σ • (A ·+b)) μ < ∞.(6) dμ In particular, when σ is monotone then is well-defined if and only if there exists some I ,b M> 0 such that for every x ∈ R, M ≤ σ (x + b). 1 m n 1 m n For g ∈ L (R , R ) and δ> 0, we denote the set of all functions f ∈ L (R , R ) μ μ satisfying f(x) − g(x) by Ball 1 m n (g, δ). A function is called Borel L (R ,R ) x∈R bi-measurable if both the image and pre-images of Borel sets, under that map, are again Borel sets. Corollary 2 (Transitive Activation Functions (Integrable Variant)) Let μ be a σ -finite mea- m m sure on R ,let b ∈ R with b > 0 for i = 1,...,m, and suppose that σ is injective, Borel bi-measurable, that σ(x) > x except on a Borel set of μ-measure 0, and assume that 1 m condition (6) holds. If (F , ) has the UAP on Ball(g, δ) for some f ∈ L (R ) and some 1 m (F , ) δ> 0 then, for every f ∈ L (R ) and every 0 there exists some f ∈ NN and N ∈ N such that f(x) − (f (x)) . I ,b x∈R 446 A. Kratsios We call activation functions satisfying the conditions of Corollary 2 L -transitive. The following is a sufficiency condition analogous to the characterization of Proposition 1. Corollary 3 (Construction of Transitive Activation Functions (Integrable Variant)) Let μ be a finite Borel measure on R which is equivalent to μ .Let σ ˜ :[0, ∞) →[0, ∞) be a surjective continuous and strictly increasing function satisfying σ( ˜ 0) = 0,let 0 <α < 1. Define the activation function σ( ˜ x) + x : x ≥ 0 σ(x) αx : x< 0. Then σ is Borel bi-measurable, σ(x) > x outside a μ -null-set, it is non-polynomial, and it is continuously differentiable with non-zero derivative for every x< 0. Different function spaces can have different transitive activation functions. By shifting the Leaky-ReLU variant of Example 7 we obtain an L -transitive activation function which fails to be transitive. Example 8 (Rescaled Leaky-ReLU is L -Transitive) The following variant of the Leaky- ReLU activation function 1.1xx ≥ 0 σ(x) 0.1xx< 0, is a continuous bijection on R with continuous inverse and therefore it is injective and bi- measurable. Since 0 is its only fixed point, then the set {σ(x) >x}={0} is of Lebesgue measure 0, and thus of μ measure 0 since μ and μ are equivalent. Hence, σ is injective, Borel bi-measurable, that σ(x) > x except on a Borel set of μ-measure 0, as required in (2). However, since 0 is a fixed point of σ then it does not meet the requirements of Theorem 5 (i). Our main interest with transitive activation functions is that they allow for refinements of classical universal approximation theorems, where a network’s last few layers satisfy constraints. This is interesting since constraints are common in most practical citations. 4.2 Deep networks with constrained final layers The requirement that the final few layers of a neural network to resemble the given function f is in effect a constraint on the network’s output possibilities. The next result shows that, if a transitive activation function is used, then a deep feed-forward network’s output layers may always be forced to approximately behave like f while maintaining that architecture’s universal approximation property. Moreover, the result holds even when the network’s initial layers are sparsely connected and have breadth less than the requirements of [17, 19]. Note that, the network’s final layers must be fully connected and are still required to satisfy the width constraints of [17]. For a matrix A (resp. vector b) the quantity A (resp. b ) 0 0 denotes the number of non-zero entries in A (resp. b). Corollary 4 (Feed-Forward Networks with Approximately Prescribed Output Behavior) m n Let f : R → R , 0, and let σ be a transitive activation function which is non-affine continuous and differentiable at-least at one point with non-zero derivative at that point. If m n there exists a continuous function f : R → R such that d (f , f )<δ, (7) ucc 0 0 The Universal Approximation Property 447 (F , ) + then there exists f ∈ NN , J, J ,J ∈ N , 0 ≤ J <J , and sets of composable 1 2 1 J 2 affine maps {W } , {W } such that f = W ◦σ •· · ·◦σ •W and the following hold: j j J 1 j =1 j =1 (i) d f,W ◦ σ •· · ·◦ σ • W <δ, ucc J J (ii) d f, f , ucc (iii) max A ≤ m, j =1,...,J 0 d d j j +1 (iv) W : R → R is such that d ≤ m + n + 2 if J <j ≤ J and d = m if j j 1 j 0 ≤ j ≤ J . If J = 0 we make the convention that W ◦ σ •· · ·◦ σ • W (x) = x. 1 J 1 Remark 3 Condition 7, for any δ> 0, whenever f is continuous. We consider an application of Corollary 4 to deep transfer learning. As described in [65], deep transfer learning is the practice of transferring knowledge from a pre-trained model into a neural network architecture which is to be trained on a, possibly new, learning task. Various formalizations of this paradigm are described in [66] and the next example illustrates the commonly used approach, as outlined in [67], where one first learns a feed- m n forward network f : R → R and then uses this map to initialize the final portion of a deep feed-forward network. Here, given a neural network f , typically trained on a different learning task, we seek to find a deep feed-forward network whose final layers are arbitrarily close to f while simultaneously providing an arbitrarily precise approximation to a new learning task. Example 9 (Feed-Forward Networks with Pre-Trained Final Layers are Universal) Fix a continuous activation function σ,let N> 0 be given, let (F , ) as in Example 4, let K (F , ) be a non-empty compact subset of R ,and let f ∈ NN . Corollary 4 guarantees that there is a deep feed-forward neural network f = W ◦ σ •· · ·◦ σ • W satisfying J 1 −1 (i) sup f(x) − W ◦ σ •· · ·◦ σ • W (x) <N , J J x∈K 1 −1 (ii) sup f(x) − f (x) <N , x∈K (iii) max A ≤ m, j =1,...,J 0 d d j j +1 (iv) W : R → R is such that d ≤ m + n + 2if J <j ≤ J and d = m if j j 1 j 0 ≤ j ≤ J . The structure imposed on the architecture’s final layers can also be imposed by a set of constraints. The next result shows that, for a feed-forward network with a transitive activation function, the architecture’s output can always be made to satisfy a finite num- ber of compatible constraints. These constraints are described by a finite set of continuous N m n N functionals {F } on C(R , R ) together with a set of thresholds {C } , where each n n n=1 n=1 C > 0. Corollary 5 (Feed-Forward Networks with Constrained Final Layers are Universal) Let σ be a transitive activation function which is non-affine continuous and differentiable at- least at one point with non-zero derivative at that point, let (F , ) denote the feed-forward N m n architecture of Example 4, {F } be a set of continuous functions from C(R , R ) to n=1 N m n [0, ∞), and {C } be a set of positive real numbers. If there exists some f ∈ C(R , R ) n 0 n=1 448 A. Kratsios such that for each n = 1,...,N the following holds F (f )<C , (8) n 0 n (F , ) m n then for every f ∈ C(R , R ) and every 0,there exist f ,f ∈ NN , 1 2 J m diagonal m × m-matrices {A } and b ,...,b ∈ R satisfying: j 1 J j =1 (i) f ◦ f is well-defined, 2 1 (ii) d f, f ◦ f , ucc 2 1 −1 (iii) f ∈ F [[0,C )], 2 n n=1 n (iv) f (x) = σ • (A ·+b ) ◦ ··· ◦ σ • (A x + b ). 1 n n 1 1 Next, we show that transitive activation functions can be used to extend the currently- available approximation rates for shallow feed-forward networks to their deep counterparts. 4.3 Approximation bounds for networks with transitive activation function In [68, 69], it is shown that the set of feed-forward neural networks of breadth N ∈ N , can −1 approximate any function lying in their closed convex hull of at a rate of O (N ).These results do not incorporate the impact of depth into its estimates and the next result builds 1 m on them by incorporating that effect. As in [69], the convex-hull of a subset A ⊆ L (R ) n n is the set co (A) A α f : f ∈ A, α ∈[0, 1], α = 1 and the interior of i i i i i i=1 i=1 co (A) A, denoted int(co (A) A), is the largest open subset thereof. Corollary 6 (Approximation-Bounds for Deep Networks) Let μ be a finite Borel measure m 1 m on R which is equivalent to the Lebesgue measure, F ⊆ L (R ) for which int(co (A) F ) is non-empty and co (A) F ∩ int(co (A) F ) is dense therein. If σ is a continuous non- 1 m polynomial L -transitive activation function, b ∈ R have positive entries, and that (6)is satisfied, then the following hold: 1 m 1. For each f ∈ L (R ) and every n ∈ N,there issome N ∈ N such that the following bound holds d(σ •(·+b)) μ dμM N ∞ , inf α (f ) (x)−f(x) dμ(x) ≤ √ 1+ 2μ(R ) . i i I ,b n m f ∈F , α =1,α ∈[0,1] n i i i x∈R i=1 i=1 d(σ •(·+b) μ # N 2. There exists some κ> 1 such that >κ . In particular, dμ d(σ •(·+b)) μ lim =∞, dμ N →∞ ∞ n n 3. α (f ) : N, n ∈ N,f ∈ F,α ∈[0, 1], α = 1 is dense in i i i i i i=1 i=1 I ,b 1 m L (R ). Remark 4 Unlike in [69], Corollary 6(i) holds even when the function f does not lie in the closure of co (A) F . This is entirely due to the topological transitivity of the composition operator and is therefore entirely due to the depth present in the network. In particular, I ,b Corollary 6 (iii) implies that universal approximation can be achieved even if a feed-forward networks’ output weights are all constrained to satisfy α = 1and α =[0, 1] and i i i=1 even if all but the architecture’s final two layers are sparsely connected and not trainable. The Universal Approximation Property 449 To date, we have focused on the application and interpretation of Theorem 1. Next, Theorem 3 is used to modify and improve the approximation capabilities of universal approximators on C(R). 4.4 Improving the approximation capabilities of an architecture Most currently available universal approximation results for spaces of continuous functions, provide approximation guarantees for the topology of uniform convergence on compacts. Unfortunately, this is a very local form of approximation and there is no guarantee that the approximation quality holds outside a prespecified bounded set. For example, the sequence 1−(x−n) f (x) e I converges to the constant 0 function, uniformly on compacts n |x−n|≤1 while maintaining the constant error sup f (x) − 0 1. x∈R These approximation guarantees are strengthened by modifying any given universal m n approximator on C(R , R ) to obtain a universal approximator in a smaller space of continuous functions for a much finer topology. We introduce this space as follows. Let be a finite set of non-negative-valued, continuous functions ω from [0, ∞) to m n [0, ∞) for which there is some ω ∈ satisfying ω (·) = 1. Let C (R , R ) be the set 0 0 of all continuous functions whose asymptotic growth-rate is controlled by some ω ∈ ,in m n m n m n the sense that, C (R , R ) C (R , R ),where f ∈ C (R , R ) if f ω ω ω,∞ ω∈ f(x) m n < ∞. Each C (R , R ) is a special case of the weighted spaces studied in [70], ω( x )+1 m n which are Banach spaces when equipped with the norm . Accordingly, C (R , R ) ω,∞ m n is equipped with the finest topology making each C (R , R ) into a subspace. Indeed, such a topology exists by [71, Proposition 2.6]. i m n Example 10 If ={max{t, t }} then f ∈ C (R , R ) if and only if f has asymptot- i>0 m n ically sub-polynomial growth, in the sense that, there is a polynomial p : R → R with f(x) lim < ∞. ( p(x) 1) m n Given an architecture (F , ) on C(R , R ), define its -modification to be the m n 2 architecture (F , ) on C (R , R ) given by F F × × (0, ∞) and where J 2 −|f(·)|( x b ) b J f ,α ,ω ,b ,a ω ( 1) fe + a I + a e I , j j j j j J J <b J b J J j =1 (f ,...,f ) J 1 (F , ) Therefore, the functions in NN are capable of adjusting to the different growth m n rates of functions in C (R , R ) into continuous functions of different growth rates; whereas those in (F , ) need not be. m n Theorem 6 ((F , ) is a Universal Approximator in C (R , R )) If (F , ) is a uni- m n (F , ) versal approximator on C(R , R ) for which each f ∈ NN satisfies the following growth condition sup f(x) e < ∞, (9) x∈R m n then (F , ) is a universal approximator on C (R , R ). 450 A. Kratsios Remark 5 Condition (9) is satisfied by any set of piecewise linear functions. For instance, (F , ) NN is comprised of piecewise linear functions if F is as in Example 4 and σ is the ReLU activation function. The architecture (F , ) often provides a strict improvement over (F , ). m n Proposition 2 Let (F , ) be a universal approximator on C(R , R ) such that each f ∈ (F , ) NN is either constant or sup m f(x) , and let {exp(−kt) : n ∈ N}. x∈R m n Then (F , ) is not a universal approximator on C (R , R ). 4.5 Representation of approximators on L There is currently no available universal approximation theorem describing a small archi- ∞ m n tecture on L (R , R ) with the UAP. Indeed, even trees are not dense therein since the Lebesgue measures is σ -finite and not finite. A direct consequence of Theorem 4 is the guarantee that a minimal architecture on L (R) exists and admits the following representation. Corollary 7 (Existence and Representation of Minimal Universal Approximator on L (R)) There exists a non-empty set I with pre-order ≤, a subset {x } ⊆ L (R) −{0}, i i∈I triples {(B ,φ )} of linear subspaces B of B(L ), bounded linear isomorphisms i i i i∈I i 1 1 1 : L (R) → B , and bounded linear maps φ : L (R) → L (R) such that: i i i (i) B(L ) = B , i∈I (ii) For every i ≤ j , B ⊆ B , i j (iii) For every i ∈ I , + ◦ φ (x ) is dense B for its subspace topology, i i i n∈N i (iv) The architecture (F , ) defined by F ={x } , | : (x ,...,x ) ρ ◦ ◦ φ ◦ η(x ) (10) i i∈I 1 j i i F i ∞ 1 if x = x ,foreach j ≤ J , has the UAP on L (R),where η : R → L and 1 j ∞ ∞ ρ : B(L ) → L are respectively defined as the linear extensions of the maps n n I : s> 0 1 [0,r) η(r) ρ α δ α f . i f i i −I : s< 0, n [−r,0) i=1 i=1 The contributions of this article are now summarized. 5 Conclusion In this paper, we studied the universal approximation property in a scope applicable to most architectures on most function spaces of practical interest. Our results were used to characterize, construct, and establish the existence of such structures both in many familiar and exotic function spaces. Our results were used to establish the universal approximation capabilities of deep and narrow networks with constraints on their final layers and sparsely connected initial layers. We derived approximation bounds for feed-forward networks with this activation function in terms of depth and height. We showed that the set of activation functions for which these p m m results hold is broader when the underlying functions space is L (R ) than if it is C(R ), The Universal Approximation Property 451 which showed that the choice of activation function depends on the underlying topologi- cal criterion quantifying the UAP. We characterized the activation functions for which these results hold as precisely being the set of injective, continuous, non-affine activation func- tions which are differentiable at at-least one point with non-zero derivative at that point and have no fixed points. We provided a simple direct way to construct these activation func- tions. We showed that a rescaled and shifted Leaky-ReLU activation is an example of such an activation function while the ReLU activation is not. We used our construction result to build a universal approximator in the space of continuous functions between Euclidean spaces, which have controlled growth, equipped with a uniform notion of convergence. This result strengthens the currently available guarantees for feed-forward networks, which state m n that this architecture is universal in C(R , R ) for the weaker uniform convergence on com- pacts topology. Finally, we obtained a representation of a small universal approximator on ∞ m L (R ). The results, structures, and methods introduced in this paper provide a flexible and broad toolbox to the machine learning community to build, improve, and understand uni- versal approximators. It is hoped that these tools will help others develop new, theoretically justified architectures for their learning tasks. Funding Open access funding provided by Swiss Federal Institute of Technology Zurich. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommonshorg/licenses/by/4.0/. Appendix A: Proofs of Main Results Theorem 1 is encompassed by the following broader but more technical result. Lemma 2 (Characterization of the Universal Approximation Property) Let X be a function space, E is an infinite-dimensional Frec ´ het space for which there exits some homeomor- phism : X → E, and F , be an architecture on X . Then the following are equivalent: (i) UAP: F , has the UAP, (ii) Decomposition of UAP via Subspaces: There exist subspaces {X } of X such that: i i∈I (a) X is dense in X , i∈I (b) For each i ∈ I , X ) is a separable infinite-dimensional Frec ´ het subspace (F , ) of E and NN ∩ X contains a countable, dense, and linearly- independent subset of X ), (c) For each i ∈ I , there exists a homeomorphism : X → L (R). i i (iii) Decomposition of UAP via Topologically Transitive Dynamics: There exist sub- spaces {X } of X and continuous functions {φ } with φ : X → X such i i∈I i i∈I i i i that: 452 A. Kratsios (a) X is dense in X , i∈I (b) For every pair of non-empty open subsets U, V of X and every i ∈ I,there is i,U,V some N ∈ N such that φ (U ∩ X ) ∩ (V ∩ X ) =∅, i,U,V i i (F , ) n (c) For every i ∈ I,thereis some g ∈ NN ∩ X such that {φ (g )} is a i i i n∈N (F , ) dense subset of NN ∩ X , and in particular, it is a dense subset of X , i i (d) For each i ∈ I , X is homeomorphic to C(R). (iv) Parameterization of UAP on Subspaces: There are triples {(X ,ψ )} of sep- i i i i∈I arable topological spaces X , non-constant continuous functions : X → X , and i i i functions ψ : X → X satisfying the following: i i i (a) (X ) is dense in X , i i i∈I (b) For every i ∈ I and every pair of non-empty open subsets U, V of X ,thereis i,U,V some N ∈ N such that ψ (U ∩ X ) ∩ (V ∩ X ) =∅, i,U,V i i (F , ) (c) For every i ∈ I,thereissome x ∈ NN ∩ X such that { ◦ ψ (x )} i i i i n∈N (F , ) is a dense subset of NN ∩ (X ), and in particular, it is a dense subset i i of (X ). i i Moreover, if X is separable, then I may be taken to be a singleton. Proof of Lemma 2 Suppose that (ii) holds. Since X is dense in X and since i∈I (F , ) (F , ) (F , ) NN ∩X ⊆ NN , then, it is sufficient to show that NN ∩X i i i∈I i∈I is dense in X to conclude that is is dense in X . Since each X is a subspace of X then, i i i∈I (F , ) by restriction, each X is a subspace of NN ∩ X with its relative topology. i i i∈I Let X denote the set X equipped with the finest topology making each X into a i i i∈I subspace, such a topology exists by [71, Proposition 2.6]. Since each X is also a subspace of X with its relative topology and since, by definition, that topology is no finer than i∈I (F , ) the topology of X then it is sufficient to show that NN ∩ X is dense in X to i∈I conclude that it is dense in X equipped with its relative topology. i∈I Indeed, by [71, Proposition 2.7] the space X is given by the (topological) quotient of the disjoint union X , in the sense of topological spaces (see [71, Example 3, Section 2.4]), i∈I i under the equivalence relation f ∼ f if f = f in X . Denote the corresponding quotient i j i j map by Q .Sinceasubset U of the quotient topology is open (see [71, Example 2, Section −1 2.4]) if and only if Q [U ] is an open subset of X and since a subset V of X is i∈I i i∈I i open if and only if V ∩ X is open for each i ∈ I in the topology of X then U ⊆ X is open i i −1 (F , ) if and only if Q [U]∩ X is open for each i ∈ I.Since {NN ∩ X } + is dense in i i n∈N X then for every open subset U ⊆ X i i (F , ) (F , ) ∅ = U ∩ NN ∩ X ⊆ U ∩ NN ∩ X . (11) i i i∈I In particular, (11) implies that for every open subset U ⊆ X (F , ) −1 (F , ) ∅ = NN ∩ X ∩ Q [U]∩ X ⊆ U ∩ NN ∩ X . (12) i i i i∈I (F , ) Therefore, NN ∩X is dense in X and therefore it is dense in X equipped i i i∈I i∈I with its relative topology. Hence, F has the UAP and therefore (i) holds. In the next portion of the proof, we denote the (linear algebraic) dimension of any vector space V by dim(V ). Recall, that this is the cardinality of the smallest basis for V . We follow The Universal Approximation Property 453 the Von Neumann convention and, whenever required by the context, we identify the natural number n with the ordinal {1,...,n}. Assume that (i) holds. For the first part of this proof, we would like to show that D contains a linearly independent and dense subset D .Since X is homeomorphic to some infinite-dimensional Frechet ´ space E, then there exists a homeomorphism : X → E (F , ) mapping NN to a dense subset D of E. We denote the metric on E by d . A con- sequence of [72, Theorem 3.1], discussed thereafter by the authors, implies that since E is an infinite dimensional Frechet ´ space then it has a dense Hamel basis, which we denote by {b } . By definition of the Hamel basis of E we may assume that the cardinality of A, a a∈A denoted by Card(A), is equal to dim(E).Next, we use {b } to produce a base of open a a∈A sets for the topology of E of cardinality equal to dim(E). Since E is a metric space, then its topology is generated by the open sets {Ball (b ,q)} ,where Ball (b ,r) {d(b ,x) < r} . Indeed, since Q is dense E a E a a a∈A,r∈(0,∞) in R, then for every a ∈ A and r ∈ (0, ∞) the basic open set Ball (b ,r) can be expressed E a by Ball (b ,r) = Ball (b ,q). Hence, {Ball (b ,q)} generates E a E a E a a∈A,q∈Q∩(0,∞) q∈Q∩(0,r) the topology on E. Moreover, the cardinality the indexing set A × Q is computed by Card(A×Q∩(0, ∞)) = max{Card(A), Card(Q)}= max{dim(E), Card(Q)}= dim(E), since E is infinite and therefore at-least countable. Therefore, {Ball (b ,q)} E a a∈A,q∈Q∩(0,∞) is a base for the topology on E of Cardinality equal to dim(E).Let ω be the smallest ordinal with Card(ω) = dim(E) = Card(A × Q ∩ (0, ∞)). In particular, there exists a bijection F : ω → A × Q ∩ (0, ∞) which allows us to canonically order the open sets {Ball (F (j ) ,F (j) )} , where for any j< ω we denote F(j) ∈ A and F(j) ∈ E 1 2 j ≤ω 1 2 Q ∩ (0, ∞). We construct D by transfinite induction using ω. Indeed since 1 <ω,thensince D is dense in E and {Ball (F (j ) ,F (j) )} defines a base for the topology of E, then there exists some E 1 2 j ≤ω U ∈{Ball (F (j ) ,F (j) )} containing some d ∈ D. For the inductive step, 1 E 1 2 j ≤ω 1 suppose that for all i ≤ j for some j< ω, we have constructed a linearly inde- pendent set {d } with d ∈{Ball (F (i) ,F (i) )} for every i ≤ j.Since j< i i<j i E 1 2 ω and {d } contains Card(j ) and {d } is a Hamel basis of span({x } ) then i i<j i i<j i i<j dim span({x } ) < dim(E). Hence, span({x } ) has empty interior and therefore i i<j i i<j it cannot contain any {Ball (F (j ) ,F (j) )} . In particular, there is an open subset E 1 2 j ≤ω V ⊆ Ball (F (j ) ,F (j) ) − span({x } ) and since D was assumed to be dense in E then E 1 2 i i<j theremustbesome d ∈ V ⊆ Ball (F (j ) ,F (j) ). This completes the inductive step and j E 1 2 therefore there is a linearly independent and dense subset D {d } contained in D of j j ≤ω cardinality Card(ω) = dim(E). Next, let I be the set of all countable sequences of distinct elements in ω.For every i ∈ I , let E span (d ),where A denotes the closure of a subset A ⊆ E in the topology of i j j ∈i E. Then, each E is a linear subspace of E with countable basis {d } . Since any Frechet ´ i j j ∈i space with countable basis is separable and therefore each E is a separable Frechet ´ space. Moreover, by construction, D ⊆ E ⊆ E (13) i∈I and therefore E is dense in E since D is dense in E.Since is a homeomorphism i∈I −1 then : E → X is a continuous surjection, and since the image of a dense set under any −1 continuous map is dense in the range of that map then (D ) is dense in X . Moreover, 454 A. Kratsios using the fact that inverse images commute with unions and the fact that that is a bijection, we compute that −1  −1 −1 (D ) ⊆ E = [E ] . (14) i i i∈I i∈I (F , ) Since as a bijection and D was defined as the image of NN in E under ,then (F , )  −1 D ⊂ NN and D is dense in X . In particular, (14) implies that [E ]⊆ i∈I (F , ) −1 (F , ) −1 (NN ∩ [E ]) and therefore (NN ∩ [E ]) is dense in X .In i i i∈I i∈I −1 −1 particular, [E ] is dense in X , and for each i ∈ I,ifwedefine X [E ] i i i i∈I then we obtain (ii.a). Since is a homeomorphism then it preserves dense sets and in particular since {d } i j ∈i −1 is a countable, dense, and linearly independent subset of [{d } ] then it is a dense j j ∈i countable subset of X . Hence, each X is separable. i i This gives (ii.b). Lastly, by [73] any two separable infinite-dimensional Frechet ´ space are homeomorphic. In particular, since L (R) is a separable Hilbert space is a separable Frechet space. Therefore, for each i ∈ I , there is a homeomorphism : E → L (R). i i In particular, : X → L (R) must be a homeomorphism and therefore (ii.b) holds. i i Therefore, (i) implies (ii). Suppose that (ii) holds. Then, (iii.a) holds by (ii.a). For each i ∈ I,let {d } be a n,i n∈N (F , ) countable dense subset of X ∩NN for which ({d } ) is a linearly independent, i n,i n∈N and let E = span({d } ).Let D {d } and D (D). Thus, for every i n,i n∈N n,i n∈N i∈I i ∈ I , D ∩ E is a countably infinite linearly independent and dense subset of E then by i i [74, Theorem 8.24] there exists a continuous linear operator T : D∩E → D∩E satisfying i i i T (d ) = d , n,i n+1,i for each n ∈ N and each i ∈ I . In particular, T (d ) is dense in E . For each i ∈ I , 0,i i −1 −1 define φ ◦ T ◦ and g (d ) and observe that for every n ∈ N i i i 0,i n −1 −1 −1 φ (g ) = ( ◦ T ◦ ) ◦· · ·◦ ( ◦ T ◦ )( (d )) i i i i,0 n−times −1 n ◦ T (d ). (15) 0,i Since {T (d )} is dense in E and is a homeomorphism from X to E then 0,i n∈N i i i −1 n n {T (d )} = φ (g ) 0,i n∈N i i i n∈N 2 2 is dense in X . Thus, (iii.c) holds. For any i ∈ I,definethemap ψ : L (R) → L (R) by i i −1 ψ ( ) ◦ φ ◦ ( ), i i i i and define the vector g ˜ ∈ L (R) by g ˜ (g ).Since and are homeomorphisms i i i i i and since φ is continuous then ψ is well-defined and continuous. Moreover, analogously i i n 2 2 to (15) we compute that ψ (g ˜ ) is dense in L (R).Since L (R) is a complete separa- n∈N ble metric space with no isolated points and ψ is continuous self-map of L (R) for which 2 n 2 thereisavector g ˜ ∈ L (R) such that the set of iterates {ψ (g ˜ )} is dense in L (R) then i i n∈N Birkhoff Transitivity Theorem, see the formulation of [74, Theorem 1.16], implies that for ˜ ˜ every pair of non-empty open subsets U, V ⊆ L (R) there is some n satisfying ˜ ˜ U,V ˜ ˜ U ,V ˜ ˜ φ (U) ∩ V =∅. (16) The Universal Approximation Property 455 Since is a homeomorphism, then [74, Proposition 1.13] and (16) imply that for every pair of non-empty open subsets U ,V ⊆ X there exists some n   ∈ N satisfying U ,V U ,V φ (U ) ∩ V =∅. (17) Since X is equipped with the subspace topology then every non-empty open subset U ⊆ X is of the form U ∩ X for some non-empty open subset U ⊆ X . Therefore, i i (17) implies (iii.b). Since both L (R) and C(R) are separable infinite-dimensional Frechet ´ spaces then the [73, Anderson-Kadec Theorem] implies that there exists a homeomor- phism  : L (R) → C(R). Therefore, for each i ∈ I ,  ◦ : X → C(R) is a homeomorphism and thus (ii.c) implies (iii.d). Suppose that (iii) holds. For every i ∈ I,set X X ,let 1 be the identity map i i i X on X ,set ψ φ ,and set x g . Therefore, (iv) holds. i i i i i (F , ) Suppose that (iv) holds. By (iv.c), for each i ∈ I , NN ∩ X is dense in X . i i Therefore, (F , ) (F , ) X = NN ∩ X ⊆ NN ∩ X ⊆ X . (18) i i i i∈I i∈I i∈I By (iv.a) since X is dense in X therefore its closure is X and therefore the smallest, i∈I and thus only, closed set containing X is X itself. Therefore, by (18) the smallest set i∈I (F , ) (F , ) containing NN ∩ X must be X . Therefore, NN is dense in X and (i) i∈I holds. This concludes the proof. Proof of Theorem 2 By the [73, Anderson-Kadec Theorem] there is no loss of general- m n ity in assuming that m = n = 1, since C(R , R ) and C(R) are homeomorphic. Let (C(R)).By(5), X is dense in X and since density is transitive, then it is i∈I (F , ) enough to show that (NN ) is dense in X to conclude that it is dense in X . i∈I Since each is continuous, then, the topology on X is no finer than the finest topology on (C(R)) making each continuous and by [71, Proposition 2.6] such a topol- i i i∈I ogy exists. Let X denote (C(R)) equipped with the finest topology making each i∈I (C(R)) into a subspace. By construction, if U ⊆ X is open then it is open in X and (F , ) therefore if (NN ) intersects each non-empty open subset of X then it must i∈I (F , ) do the same for X . Hence, it is enough to show that (NN ) is dense in X i∈I (F , ) to conclude that it is dense in X and therefore, (NN ) is dense in X . i∈I We proceed similarly to the proof of Lemma 2. Indeed, by [71, Proposition 2.7] the space X is given by the (topological) quotient of the disjoint union (C(R)), in the sense i∈I i of topological spaces (see [71, Example 3, Section 2.4]), under the equivalence relation f ∼ f if f = f in X . Denote the corresponding quotient map by Q . Since a subset U i j i j −1 of the quotient topology is open (see [71, Example 2, Section 2.4]) if and only if Q [U ] is an open subset of (C(R)) and since a subset V of (C(R)) is open if and only i∈I i i∈I i if V ∩ (C(R)) is open for each i ∈ I in the topology of (C(R)) then U ⊆ X is open if i i −1 (F , ) and only if Q [U ]∩ (C(R)) is open for each i ∈ I .Since {NN ∩ (C(R))} + i i n∈N is dense in (C(R)) then for every open subset U ⊆ (C(R)) i i (F , )  (F , ) ∅ = U ∩ NN ∩ (C(R)) ⊆ U ∩ NN ∩ (C(R)). (19) i i i∈I In particular, (19) implies that for every open subset U ⊆ X (F , ) −1 (F , ) ∅ = NN ∩ (C(R)) ∩ Q [U]∩ (C(R)) ⊆ U ∩ NN ∩ (C(R)). i  i i i∈I (20) 456 A. Kratsios (F , ) Therefore, NN ∩ (C(R)) is dense in X and therefore it is dense in i∈I (C(R)) equipped with its relative topology. Hence, (F , ) has the UAP on X i∈I and therefore it has the UAP on X itself. Proof of Theorem 3 Let σ be a continuous and non-polynomial activation function. Then [61] implies that the architecture F , , as defined in Example 4, is a universal 0 0 approximator on C(R). By Theorem 1, since F , has the UAP on X and since X is homeomorphic to an infinite-dimensional Frechet ´ space then there are homeomorphisms { } from C(R) onto i i∈I a family of subspaces {X } of X such that X is dense. Fix > 0and f ∈ X . i i∈I i i∈I Since X is dense in X there exists some i ∈ I and some f ∈ X such that i i i i∈I d (f, f )< . (21) X i Since is a homeomorphism then it must map dense sets to dense sets. Since F 0, 0 (F 0, 0) has the UAP on C(R) then NN is dense in C(R) and therefore, for each i ∈ I , (F 0, 0) (F 0, 0) (NN ) is dense in X . Hence, there exists some g ˜ ∈ (NN ) such that i i  i d (f , g ˜ )< .Since is a homeomorphism, it is a bijection, therefore there exists a X i  i (F 0, 0) unique g ∈ NN with (g ) =˜ g . Hence, the triangle inequality and (21)imply that d (f, (g )) ≤ d (f, f ) + d (f , (g )) <. (22) X i  X i X i i This yields the first inequality in the Theorem’s statement. (F , ) −1 By Theorem 1 since, for each i ∈ I , NN ∩ X is dense in X and since is a i i −1 (F , ) homeomorphism on X then NN ∩ X is dense in C(R). In particular, there i i −1 F , ( ) exits some f ∈ NN ∩ X satisfying d g (x), f (x) <. (23) ucc (F , ) −1 Since is a bijection then there exists a unique f ∈ NN such that (f ) = f . Therefore, (23) and the triangle inequality imply that −1 d g (x), (f )(x) <. ucc Therefore the conclusion holds. Remark 6 By the [73, Anderson-Kadec Theorem], since both L (R) and C(R) are separa- ble infinite-dimensional Frechet ´ spaces then there exists a homeomorphism : L (R) → C(R). Therefore, the proof of Corollary 3 holds (mutatis mutandis) with each replaced −1 2 by and with C(R) in place of L (R). The proof of the next result relies on some aspects of inductive limits of Banach spaces. Briefly, an inductive limit of Banach spaces is a locally convex space B for which there exists a pre-ordered set I , a set of Banach sub-spaces {B } with B ⊆ B if i ≤ j.The i i∈I i j inductive limit of this direct system is the subset B equipped with the finest topology i∈I which simultaneously makes each B into a subspace and makes B into a locally- i i i∈I convex spaces. Spaces constructed in this way are called ultrabornological spaces and more details about them can be found in [75, Chapter 6]. The Universal Approximation Property 457 Proof of Theorem 4 Since B(X ) and B(X) are both infinite-dimensional Banach spaces, then they are infinite-dimensional ultrabornological space, in the sense of [75, Defini- tion 6.1.1]. Since X is separable, then as observed in [33], B(X) is separable. Therefore, [75, Theorem 6.5.8] applies; hence, there exists a directed set I with pre-order ≤, a collec- tion of Banach subspaces {B } satisfying (i) and (ii), and a collection of continuous linear i i∈I isomorphisms : B(X) → B . Furthermore, the topology on B is coarser than the induc- i i tive limit topology lim B . Since each B(X) and B are Banach spaces, and in particular i i i∈I − → normed linear spaces, then by the results of [76, Section 2.7] the maps are bounded linear isomorphisms. Let i ∈ I ,and fixany x ∈ X −{0 } then since δ : X → B(X) is base-point preserving i X then δ = 0 and therefore there exists a linearly independent subset B of B(X) containing x i δ .Since B(X) is separable then B is countably infinite and therefore [74, Theorem 8.24] x i n X there exists a bounded linear map φ : B(X) → B(X) such that {φ (δ )} + is a dense i n∈N i x subset of B(X). Since is a continuous linear isomorphisms then it is in particular a surjective continu- ous map from B(X) onto B . Since the image of a dense set under a continuous surjection is itself dense then ◦ φ (δ ) is a dense subset of B . Moreover, this holds for each i x + i i i n∈N i ∈ I . By definition, the topology on lim B is at-least as fine as the Banach space topology − →i∈I on B(X ), since each B is a linear subspace of B(X ). Moreover, the topology on lim B 0 i 0 i i∈I − → is no finer than the finest topology on B making each B into a topological space (but i i i∈I not requiring that B be locally-convex), which exists by [77, Proposition 6]. Denote i∈I this latter space by B . Therefore, if ◦ φ (δ ) , (24) i x i i i∈I ; n∈N is dense in B then it is dense in lim B and in B(X ). Hence, we show that (24)isdense i 0 i∈I − → ˜ ˜ in B . That is, it is enough to show that every open subset of B contains an element of (24). By [71, Proposition 2.7] the space B is given by the topological quotient of the disjoint union  B , in the sense of topological spaces (see [71, Example 3, Section 2.4]), under i∈I i the equivalence relation x ∼ x for any i ≤ j if x = x in B . Denote the corresponding i j i j j quotient map by Q .Sinceasubset U of the quotient topology is open (see [71,Example −1 2, Section 2.4]) if and only if Q [U ] is an open subset of  B and since a subset V i∈I i of  B is open if and only if V ∩ B is open for each i ∈ I in the topology of B then i∈I i i i −1 U ⊆ B is open if and only if Q [U]∩ B is open for each i ∈ I.Since { ◦ φ (x )} + i i i n∈N is dense in B then for every open subset U ⊆ B i i n  n ∅ = U ∩{ ◦ φ (x )} + ⊆ U ∩ ◦ φ (δ ) . (25) i i i x n∈N i i i i∈I ; n∈N In particular, (25) implies that for every open subset U ⊆ B n −1 n ∅ ={ ◦ φ (x )} + ∩ Q [U]∩ B ⊆ ◦ φ (δ ) ∩ U . (26) i i n∈N i i x i i i i∈I ; n∈N Therefore, (24)isdense in B and, in particular, it is dense in B(X ). Since X was barycentric, then there exists a continuous linear map ρ : B(X ) → X 0 0 0 X 0 which is a left-inverse of δ . Thus, for every f ∈ X , ρ ◦ δ = f and therefore ρ is a f 458 A. Kratsios continuous surjection. Since the image of a dense set under a continuous surjection is dense and since (24) is dense then ρ ◦ ◦ φ (δ ) , (27) i x i i i∈I ; n∈N is a dense subset of X .Since X has assumed to be dense in X and since density is transitive 0 0 then (27)isdense in X . This concludes the main portion of the proof. The final remark follows from the fact that if X = X then the identity map 1 : X → 0 X X is an isometry and therefore the universal property of B(X) described in Theorem [32, Theorem 3.6] implies that 1 uniquely extends to a bounded linear isomorphism L between B(X) and B(X ) satisfying X X X −1 X X −1 X 0 0 0 L ◦ δ = δ ◦ 1 = δ and L ◦ δ = δ ◦ 1 = δ . Hence L must be the identity on B(X). Appendix B: Proof of Applications of Main Results Lemma 3 Fix some b ∈ R , and let σ : R → R be a continuous activation function. m n Then is a well-defined and continuous linear map from C(R , R ) to itself and the A,b following are equivalent: m n + (i) For each δ> 0, > 0 and each f, g ∈ C(R , R ) thereissome N ∈ N such U,V that U,V ˜ ˜ (g) ˜ : d (g, ˜ g) < δ ∩ f : d (f,f) <  =∅, ucc ucc (ii) σ is injective, A is of full-rank, and for every compact subset K ⊆[a, b] there is some N ∈ N such that S (K) ∩ K =∅, where S(x) = σ • (Ax + b). If A is the m × m-identity matrix I and b > 0 for i = 1,...,m then (i) and (ii) are m i equivalent to (iii) σ is injective and has no fixed-points. If A is the m × m-identity matrix I and b > 0 for i = 1,...,m then (iii) is equivalent to m i (iv) Either σ(x) > x or σ(x) < x for every x ∈ R. Proof Lemma 3 By [37, Theorem 46.8] the topology of uniform convergence on compacts m n is the compact-open topology on C(R , R ) and by [37, Theorem 46.11] composition is a continuous operation in the compact-open topology. Therefore, is well-defined and A,b continuous map. Its linearity follows from the fact that (af + g) = (af ) ◦ S = a(f ◦ S) + g ◦ S. A,b g Since the topology of uniform convergence on compacts is a metric topology, with met- ric d ,then ucc m n U : f ∈ C(R , R ),  > 0 defines a base for this topology, where U f, f, m n { } g ∈ C(R , R ) : d (f, g) <  . Therefore, Lemma 3 (i) is equivalent to the statement: ucc m n + for each pair of non-empty open subsets U, V ∈ C(R , R ) there is some N ∈ N such U,V U,V that (U ) ∩ V =∅. Without loss of generality, we prove this formulation instead. I,b The Universal Approximation Property 459 Next, by [78, Corollary 4.1] satisfies Theorem 1 (ii.b) if and only if S(x) σ(Ax + A,b m + b) is injective and for every compact subset K ⊆ R there exists some N ∈ N such that S (K) ∩ K =∅. (28) Therefore, A must be injective which is only possible if A is of full-rank. This gives the equivalence between (i) and (ii). We consider the equivalence between (ii) and (iii) in the case where A is the identity matrix and b > 0for i = 1,...,m.Since S(x) = (σ (x + b ), . . . , σ (x + b )) it is i 1 m sufficient to verify condition (28) in the case where m = 1. Since b > 0for1,...,m then it is clear that S is injective and has no fixed points if and only if σ is injective and has no fixed points. We show that S is injective and has no fixed points if and only if (ii) holds. Indeed, note that if S has not fixed points, then since b > 0for i = 1,...,m then S has no fixed points if and only if σ no fixed points. From here, we proceed analogously to the proof of [79, Lemma 4.1]. If S hasafixed- + N point then for every N ∈ N , S (x) ={x} which is a non-empty compact subset of R. Therefore, (28) cannot hold. Conversely, suppose that S has no fixed points. The intermediate-value theorem and the fact that S has no fixed-points that either S(x) < x or S(x) > x. Mutatis mutandis, we proceed with the first case. Since σ is injective and S has not fixed points then S must be a strictly increasing function; thus S([a, b]) =[S(a), S(b)] for every a< b. Let K be a non-empty compact subset of R. By the Heine-Borel theorem K is closed and bounded, thus it is contained in some [a, b] for a< b. Therefore, it is sufficient to show the results for the case where K =[a, b].Since S is increasing then for every n ∈ N, n n n+1 the sequence {S (a)} satisfies S (a) < S (a). If this sequence is not unbounded then n∈N there would exist some a ∈ R such that a = lim S (a). Therefore, by the continuity of 0 0 n→∞ S we would find that n n+1 n n a = lim S (a) = lim S (a) = lim S(S (a)) = S lim S (a) = S(a ), 0 0 n→∞ n→∞ n→∞ n→∞ but since S has not fixed points then there cannot exist such an a since otherwise a = 0 0 S(a ). Therefore, a does not exist and thus {S (a)} is unbounded. Hence, for every 0 0 n∈N a< b there exists some N ∈ N such that [a,b] [a,b] S ([a, b]) ∩[a, b]=∅. Thus, (ii) and (iii) are equivalent when A = I . m n Next, assume that any of (i) to (iii) hold, that X is a non-empty subset of C(R , R ),and m n that F , has the UAP on X . Then for any other non-empty open subset U ⊆ C(R , R ) there exists some N ∈ N such that X ,U X ,U [X ]∩ U =∅. (29) A,b X ,U N −1 Since is continuous then so is and therefore ( ) [U ] is a non-empty open A,b A,b A,b m n subset of C(R , R ). Since the finite intersection of open sets is again open, then we have that N N N X ,U −1 X ,U X ,U [X ]∩ U = X ∩ [U ]. (30) A,b A,b A,b X ,U m n This implies that X ∩ [U ] is a non-empty open subset of C(R , R ) contained in X . I ,b X ,U (F , ) Since F , has te UAP on X , then there exists some f ∈ NN ∩[X ∩ [U ]]. A,b N N (F σ ;deep, σ ;deep) X ,U X ,U Thus, (f ) ∈ U and, by definition, (f ) ∈ NN . 460 A. Kratsios Thus, for each U in m n g ∈ C(R , R )d (g, f ) <  , (31) ucc m n f ∈C(R ,R ),>0 + (F , ) N there exists some N ∈ N and some f ∈ NN such that (f ) ∈ U . In particu- U U U m n lar, since (31) is a base for the topology on C(R , R ) and since the intersection of open sets is again open, then every non-empty open subset of U is contained an element of (31)which, (F σ ;deep, σ ;deep) in turn, contains an element of the form (f ). Thus, NN ∩ U =∅. (F σ ;deep, σ ;deep) m n Hence, NN has the UAP on C(R , R ). Proof of Theorem 5 The equivalence between (i), (ii), and (iv) follows from Lemma 3. The equivalence between (iii) and (iv) follows from the formulation of Birkhoff’s transitivity theorem described in [74, Theorem 2.19]. Proof of Proposition 1 Since α < 1then σ(x) > x for every x< 0. Since 0 <α then 1 2 σ(0) = 0 <α . Lastly, since σ ˜ is monotone increasing then for every x> 0wehavethat σ(x) > x + α >x. Therefore, σ cannot have a fixed point. Moreover, since σ ˜ is strictly increasing it must be injective, since if x< y then σ(x) < σ(y) and therefore σ(x) = σ(y) if x = y. Hence, σ is injective. Moreover, since the sum of continuous functions is again continuous, then σ is continuous. Since α x + α is affine then it is continuously differentiable. Thus σ is continuously 1 2 differentiable on any x< 0. Lastly, setting α not equal to σ ˜ (0) − 1 ensure that σ is not differentiable at 0 and therefore it cannot be polynomial. In particular, it cannot be affine. m n m n For convenience, we denote the collection of set-functions from R to R by [R , R ]. m n m n m n Proof of Corollary 4 Since d is a metric on [R , R ] and since C(R , R ) ⊆[R , R ], ucc m n m n then the map F : C(R , R ) → C(R , R ) defined by F(g) d (f ,g) is continuous. ucc 0 −1 m n Therefore, the set F [(−∞,δ)] is an open subset of C(R , R ). In particular, (7) guar- antees that it is non-empty. Since σ is non-affine and continuously differentiable at-least at one point with non-zero derivative at that point then [17, Theorem 3.2] applies, whence the m n set X of continuous functions h : R → R with representation h(x) = W ◦ σ • ··· ◦ σ • W , J 1 d d j j +1 where W : R → R ,for j = 1,...,J − 1, are affine and n + 2 ≥ d if j ∈{1,J } j m j m n −1 and d = m,and d = n,isdense in C(R , R ). Therefore, since F [(−∞,δ)] is an 1 J m n −1 −1 open subset of C(R , R ) then X ∩ F [(−∞,δ)] is dense in F [(−∞,δ)]. Fix some b ∈ R with b > 0for i = 1,...,m.Since σ is continuous, injective, and has no fixed-points then applying Lemma 3 implies that X { (f ) : f ∈ I ,b −1 + m n F [(−∞,δ)]∩ X ,N ∈ N }, is a dense subset of C(R , R ). This gives (i). Moreover, by construction, every g ∈ X admits a representation satisfying (iii) and (iv). Furthermore, since W ◦ σ •· · ·◦ σ • W ∈ X and by construction there exists some g ∈ X for which J 1 2 1 d (W ◦ σ •· · ·◦ σ • W ,g) <δ,; then (ii) holds. ucc J 1 The Universal Approximation Property 461 Proof of Corollary 5 Since each F ,for n = 1,...,N , is a continuous function from m n −1 m n C(R , R ) to [0, ∞] then each F [[0,C )] is an open subset of C(R , R ). Since the N −1 finite intersection of open sets is itself open, then ∩ F [[0,C )] is an open subset of n=1 m n m n C(R , R ). Since there exists some f ∈ C(R , R ) satisfying (8)then U is non-empty. m n Since F , has the UAP on C(R , R ) then F , ∩ U is dense in U . Fix b ∈ R with b > 0for i = 1,...,m and set A = I . i m Since σ is a transitive activation function then Corollary 1 applies and therefore the set (F , ) N m n (f ) : f ∈ NN ∩ U is dense in C(R , R ). Therefore (i)-(iv) hold. I ,b Proof of Corollary 2 Let S(x) = σ •(x +b) and let B {x ∈ R : σ(x) > x}. By hypothe- sis B is Borel and μ(B) > 0. For each i = 1,...,m we compute σ •(x +b )>x +b ≥ x . i i i i i Therefore, for μ-a.e. every x ∈ B , N ∈ N and each i = 1,...,m S (x) ≥ x + Nb . i i i Since b > 0 then lim S (x) =∞. Therefore, the condition [80, Corollary 1.3 (C2)] is N →∞ met, and by the discussion following the result on [80, page 127], condition [80, Corollary 1 m n 1.3 (C1)] holds; i.e.: for every non-empty open subset U, V ⊆ L (R , R ) there exists some N ∈ N such that U,V U,V (U ) ∩ V =∅. (32) I ,b U,V By Lemma 1, the map and therefore the map is continuous. Thus, I ,b m I ,b N N U,V −1 1 m n X ,U −1 ) [V ] is a non-empty open subset of L (R , R ) and therefore U ∩( ) [V ] I ,b I ,b m m is a non-empty open subset of U.Taking U = Ball (g, δ) and V = 1 m n L (R ,R ) Ball (f, ) we obtain the conclusion. 1 m n L (R ,R ) Proof of Corollary 3 By Proposition 1 and the observation in its proof that σ(x) > x we only need to verify that σ is Borel bi-measurable. Indeed, since σ is continuous and injective −1 then by [81, Proposition 2.1], σ exists and is continuous on the image of σ .Since σ was −1 −1 assumed to be surjective then σ exists on all of R and is continuous thereon. Hence, σ and σ are measurable since any continuous function is measurable. Proof of Theorem 6 Fix A = I and b ∈ R with b > 0for i = 1,...,m.Since m i int (co (A) F ) is a non-empty open set then there exists some f ∈ int (co(F )) and some δ> 0forwhich 1 m Ball 1 m (f, δ) g ∈ L (R ) : f(x) − g(x)dμ(x) < δ L (R ) μ x∈R is an open subset of int (co (A) F ). Since co (A) F ∩ int(co (A) F ) is dense in int(co (A) F ) then its intersection with any non-empty open subset thereof is also dense; in particular, co(F ) ∩ Ball (f, δ) is dense in Ball (f, δ).Since σ is L -transitive 1 m 1 m L (R ) L (R ) μ μ then (iii) follows from Corollary 2. 1 1 m Since L is a metric space then Ball (g, δ) : g ∈ L (R ), δ > 0 is abasefor 1 m μ L (R ) μ the topology thereon. Therefore, Corollary 2 implies that for any two non-empty open sub- U,V 1 m sets U, V ∈ L (R ) there exists some N ∈ N satisfying (U ) ∩ V =∅. Hence, U,V I ,b 1 m is topologically transitive on L (R ), in the sense of [74, Definition 1.38]. Moreover, I ,b m μ since is a continuous linear map then Birkhoff’s transitivity theorem, as formulated I ,b 1 m in [74, Theorem 2.19], applies and therefore is a hypercylic operator on L (R ). I ,b μ 462 A. Kratsios Therefore, [74, Proposition 5.8] implies that > 1. Setting κ yields I ,b op I ,b op m m (ii). 1 m It remains to show the approximation bound of described by (i). Fix f ∈ L (R ). 1 m Since L (R ) is a Banach space then it has no isolated points and since is a hyper- I ,b μ m cylic operator then Birkhoff’s transitivity theorem, as formulated in [74, Theorem 2.19], 1 m implies that there exists a dense G -subset HC( ) ⊆ L (R ) such that for every δ I ,b m μ 1 m g ∈ HC( ) the set { (g)} is dense in L (R ). Therefore, every non-empty I ,b N ∈N m I ,b μ 1 m open subset of L (R ) contains some element of HC( ). In particular, there is some I ,b μ m 1 m g ∈ HC( ) ∩ int(co(F )) since int(co(F )) is a non-empty open subset of L (R ). I ,b m μ Since co (A) F ∩ int(co (A) F ) is dense in int(co (A) F ) then, in particular, g ∈ int(co (A) F ). Therefore, the conditions of [69, Theorem 2] and [69, Equation (23)] are met, hence, for each n ∈ N the following approximation bound holds 2μ(R ) inf α f (x) − g(x) dμ(x) ≤ √ , (33) i i f ∈F , α =1,α ∈[0,1] n i i i x∈R i=1 i=1 N 1 m Since { (g)} is dense in L (R ) then there exists some N ∈ N for which N ∈N I ,b N 1 (g) ∈ Ball 1 m f, . Thus, the following bound holds L (R ) I ,b m μ n f(x) − (g)(x)dμ(x) ≤ √ , (34) I ,b x∈R 1 m Since is a continuous linear map from the Banach space L (R ) to itself then I ,b m μ it is Lipschitz with constant ,where · denotes the operator norm, and by I ,b op op [64, Corollary 2.1.2] we have d(σ • (·+ b)) μ = . (35) I ,b m op dμ Moreover, by Lemma 1, we know that the right-hand side of (35) is finite. Therefore (34) implies that for every f ,...,f ∈ F , α ,...,α ∈[0, 1] with α = 1, the following 1 n 1 n i i=1 holds α f (x) − f(x) dμ(x) i i I ,b x∈R i=1 N N α f (x) − (g) (x) dμ(x) i i I ,b I ,b m m x∈R i=1 + f(x) − (g) (x) dμ(x) I ,b x∈R (36) α f (x) − g(x) dμ(x) i i I ,b op m x∈R i=1 (g) (x) − f(x) dμ(x) I ,b x∈R d(σ • (·+ b)) μ 1 ≤ α f (x) − g(x) dμ(x) + √ . i i dμ n M x∈R i=1 The Universal Approximation Property 463 Combining the estimates (33)–(36) we obtain inf α f (x) − f(x) dμ(x) i i I ,b n m f ∈F , α =1,α ∈[0,1] x∈R i i=1 i i i=1 d(σ • (·+ b)) μ 1 ≤ α f (x) − g(x) dμ(x) + i i dμ m n M x∈R i=1 d(σ • (·+ b)) μ 2μ(R ) 1 ≤ √ + √ dμ n n = √ 1 + 2μ(R ) . (37) Since is linear, then the right-hand side of (37) reduces and we obtain the following I ,b estimate inf α (f ) (x)−f(x) dμ(x) ≤ √ 1+ 2μ(R ) . i i I ,b n m f ∈F , α =1,α ∈[0,1] n x∈R i i=1 i i i=1 (38) Therefore, the estimate in (i) holds. For the statement of the next lemma concerns the Banach space of functions vanishing m n m at infinity. Denoted by C (R , R ), this is the set of continuous functions f from R to n m R such that, given any > 0 there exists some compact subset K ⊆ R for which m n sup f(x) <. As discussed in [82, VII], C (R , R ) is made into a Banach space x∈K by equipping with the supremum norm f  sup m f(x). x∈R Lemma 4 (Uniform Approximation of Functions Vanishing at Infinity) Suppose that m n m n F , is a universal approximator on C(R , R ), then for every f ∈ C (R , R ) and m n every > 0 there exists g ∈ C (R , R ) with representation | | 2 − g (·) (x−b) b−· f (·) = g e + a I + ae I , (39) ·<b ·≥b (F , ) the absolute value |·| is applied component-wise, g ∈ NN , and a, b > 0, and satisfying the uniform approximation bound f − f <. m n Proof of Lemma 4 Let F , be a universal approximator on C(R , R ),let f ∈ m n C (R , R ),and > 0. Since f vanishes at infinity then there exists some non-empty com- m −1 pact K ⊆ R for which f(x)≤ 2 for every x ∈ K . By the Heine-Borel theorem ,f ,f K is bounded and therefore there exists some b > 0 such that K ⊆ Ball m (0,b ) ,f ,f R {x ∈ R :x <b }. Therefore, −1 sup f(x) <2 . (40) x∈R −Ball m (0,b ) −1 1−x Since the bump function x → e I is continuous, affine functions are contin- |x|<1 m n uous, f ∈ C(R , R ), and the composition and multiplication of continuous functions is −1 b −x again continuous then the function x → f(x) − 2 e I is itself continu- x<b ous. Observe also that the set Ball(0, b ) = {x ∈ R :x≤ b } is closed and bounded, 464 A. Kratsios thus it is compact by the Heine-Borel theorem. Since F , is a universal approximator m n on C(R , R ) for the topology of uniform convergence on compacts then there exists some F , ( ) g ∈ NN satisfying −1  2 −1 b −x sup g (x) − f(x) − 2 e I <2 . (41) x<b x∈Ball(0,b ) b −x Since 0 ≤ e ≤ 1for every x ∈ R , then from (41) we compute −1 b −x sup g (x)e I  + 2 I  − f(x) x<b x<b x∈Ball(0,b ) 2 −1 b −x ≤ sup g (x)e + 2 − f(x) x∈Ball(0,b ) b b b − − 2 −1  2  2 b −x b −x b −x ≤ sup g (x)e + f(x) − 2 e e x∈Ball(0,b ) (42) b b 2 −1  2 b −x b −x ≤ sup e g (x) + f(x) − 2 e x∈Ball(0,b ) −1  2 b −x ≤ sup g (x) + f(x) − 2 e x∈Ball(0,b ) ≤ . Observe that, for every x ∈ R − Ball(0, b ) we have x− b ≥ 0, −|g (x)|≤ 0and therefore −1 −|g (x)|(x−b ) 0 ≤ 2 e ≤ . (43) Combining (40), (42), and (43) we compute the following bound 2 −1 −1 −|g (x)|(x−b) b −x sup g (x)e + 2 I + 2 e I − f(x) x<b x≥b x∈R 2 −1 −|g (x)|(x−b) b −x ≤ max sup g (x)e I  + 2 e I  − f(x) , x<b x<b x∈Ball(0,b ) 2 −1 −|g (x)|(x−b) b −x sup g (x)e I + 2 e I − f(x) x<b x<b x∈R −Ball(0,b ) 2 −1 −|g (x)|(x−b) b −x ≤ max , sup g (x)e I  + 2 e I  −f(x) x<b x<b x∈R −Ball(0,b ) −1 −|g (x)|(x−b) = max , sup 2 e I − f(x) x<b x∈R −Ball(0,b ) −1 −|g (x)|(x−b) ≤ max , sup 2 e + sup f(x) m  m x∈R −Ball(0,b ) x∈R −Ball(0,b ) −1 −1 = max{, 2 + 2 }= . (44) Thus, the result holds. The Universal Approximation Property 465 m n m n Proof of Theorem 6 For each ω ∈ ,definethemap : C (R , R ) → C (R , R ) by ω 0 ω m n (f ) (ω(·) + 1) f . For each f, g ∈ C (R , R ) we compute ω 0 (f ) − (g) ω ω (f ) − (g) = sup ω ω ω,∞ ω(·) + 1 x∈R (ω(·) + 1) f(x) − (ω(·) + 1) g(x) = sup m ω(·) + 1 (45) x∈R (ω(·) + 1) f(x) − g(x) = sup m ω(·) + 1 x∈R =f − g . Therefore, for each ω ∈ ,the map is an isometry. For each ω ∈ ,definethemap m n m m n ˜ ˜ ˜ : C (R , R ) → C (R , R) by  (f) f . For each f ∈ C (R , R ) and ω ω 0 ω ω ω(·)+1 compute 1 1 ˜ ˜ ˜ ˜ ◦  (f) = f = (ω(·) + 1) f =f . (46) ω ω ω ω(·) + 1 ω(·) + 1 Hence,  is a right-inverse of . Since every isometry is a homeomorphism onto ω ω its image and since is surjective isometry then defines a homeomorphism from ω ω m n m n m n m n C (R , R ) onto C (R , R ). In particular, (C (R , R )) = C (R , R ). Therefore, 0 ω ω 0 ω m n m n m n m n C (R , R ) = C (R , R ) = C (R , R ) = C (R , R ). ω ω 0 ω ω∈ ω∈ Hence, condition (5) holds. −x Since it was assumed that sup m f(x)e < ∞ holds, then Lemma 4 applies, x∈R whence, 2 −|f(·)|(x−b) (F , ) b−· fe + a I + ae I : 0 <b, a, f ∈ NN ·<b ·≥b m n is dense in C (R , R ). Therefore, the conditions for Theorem 2 are met. Hence, −|f(·)|(x−b) (F , ) b−· fe + a I + ae I : 0 <b, a, f ∈NN ω ·<b ·≥b ω∈ (47) (F , ) m n is dense in C (R , R ). By definition, (47) is a subset of NN and therefore (F , ) m n NN is dense in C (R , R ). Hence, F , is a universal approximator on m n C (R , R ). Proof of Proposition 2 For each k, m ∈ N with n ≤ m, we have that exp(−kt) > exp(−mt ) for every t ∈[0, ∞). Thus, m n m n C (R , R ) ⊆ C (R , R ), (48) exp(−k·) exp(−m·) and the inclusion is strict if n<m. Moreover, for n ≤ m, the inclu- k m n m n sion of each i : C (R , R ) into C (R , R ) is continuous. Thus, exp(−n·) exp(−m·) m n k C (R , R ), i is a strict inductive system of Banach spaces. Therefore, by exp(−k·) n∈N m n [83, Proposition 4.5.1] there exists a finest topology on C (R , R ) both mak- exp(−k·) k∈N m n ing it into a locally-convex space and ensuring that each C (R , R ) is a subspace. exp(−k·) m n LCS m n Denote C (R , R ) equipped with this topology by C (R , R ). exp(−k·) k∈N LCS m n If f ∈ C (R , R ) then by construction there must exist some K ∈ N such that m n f ∈ C (R , R ).By[84, Propositions 2 and 4], a sequence {f } converges exp(−K·) t t ∈N 466 A. Kratsios to some f if and only if there exists some K ∈ N and some N ∈ N such that for m n every t ≥ N every f ∈ C (R , R ) and the sub-sequence {f } converges in K t t t ≥N exp(−K·) m n m n the Banach topology of C (R , R ) to f . In particular, since C (R , R ) = exp(−K·) exp(−0·) m n m n C (R , R ) then the function f(x) (exp(−|x|),..., exp(−|x|)) ∈ C (R , R ). 0 exp(−0·) (F , ) Since each f ∈ NN is either constant of sup f(x)=∞ then for any x∈R (F , ) + sequence {f } ∈ NN there exists some N ∈ N for which the sub-sequence t t ∈N 0 m n m n {f } lies in C (R , R ) = C (R , R ) if and only if for each t ≥ N the map f t t ≥N exp(−0·) 0 0 t is constant. Therefore, for each t ≥ N we compute that f − f  =f − f  ≥ inf sup | exp(−|x|) − c| > . t exp(0·),∞ t ∞ c∈R m 2 x∈R m n Hence, f cannot converge to f in C (R , R ) and therefore F , does not have the m n UAP on C (R , R ). Proof of Corollary 7 Let X R and X X L (R). Since every Banach space is a pointed metric space with reference-point its zero vector and since R is separable then Theorem 4 applies. We only need to verify the form of η and of ρ. Indeed, the identification of B(R) with L (R) and explicit description of η is constructed in [32, Example 3.11]. The fact that L (R) is barycentric follows from the fact that it is a Banach space and by [31, Lemma 2.4]. References 1. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943) 2. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psych. Rev. 65(6), 386 (1958) 3. Hornik, K., Stinchcombe, M., White, H.: Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 3(5), 551–560 (1990) 4. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989) 5. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251– 257 (1991) 6. Kolmogorov, A.N.: On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR 114, 953–956 (1957) 7. Webb, S.: Deep learning for biology. Nature 554(7693) (2018) 8. Eraslan, G., Avsec, Z., Gagneur, J., Theis, F.J.: Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20(7), 389–403 (2019) 9. Plis, S.M.: Deep learning for neuroimaging: a validation study. Front. Neurosci. 8, 229 (2014) 10. Zhang, W.E., Sheng, Q.Z., Alhazmi, A., Li, C.: Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Trans. Intell. Syst. Technol. 11(3) (2020) 11. Buehler, H., Gonon, L., Teichmann, J., Wood, B.: Deep hedging. Quant. Finance 19(8), 1271–1291 (2019) 12. Becker, S., Cheridito, P., Jentzen, A.: Deep optimal stopping. J. Mach. Learn. Res. 20, Paper No. 74, 25 (2019) 13. Cuchiero, C., Khosrawi, W., Teichmann, J.: A generative adversarial network approach to calibration of local stochastic volatility models. Risks 8(4), 101 (2020) 14. Kratsios, A., Hyndman, C.: Deep arbitrage-free learning in a generalized HJM framework via arbitrage- regularization. Risks 8(2), 40 (2020) 15. Horvath, B., Muguruza, A., Tomas, M.: Deep learning volatility: a deep neural network perspective on pricing and calibration in (rough) volatility models. Quant. Finance 0(0), 1–17 (2020) 16. Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6(6), 861–867 (1993) The Universal Approximation Property 467 17. Kidger, P., Lyons, T. In: Abernethy, J., Agarwal, S. (eds.): Universal Approximation with Deep Narrow Networks, vol. 125, pp. 2306–2327. PMLR, USA (2020) 18. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989) 19. Park, S., Yun, C., Lee, J., Shin, J.: Minimum width for universal approximation. ICLR (2021) 20. Hanin, B.: Universal function approximation by deep neural nets with bounded width and relu activations. Math. - MDPI 7(10) (2019) 21. Lu, Z., Pu, H., Wang, F., Hu, Z., Wang, L.: The expressive power of neural networks: A view from the width. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6231–6239. Curran Associates, Inc. (2017) 22. Fletcher, P.T., Venkatasubramanian, S., Joshi, S.: The geometric median on riemannian manifolds with application to robust atlas estimation. Neuroimage 45(1), S143–S152 (2009). Mathematics in Brain Imaging 23. Keller-Ressel, M., Nargang, S.: Hydra: a method for strain-minimizing hyperbolic embedding of network- and distance-based data. J. Complex Netw. 8(1), cnaa002, 18 (2020) 24. Ganea, O., Becigneul, G., Hofmann, T.: Hyperbolic neural networks. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 5345–5355. Curran Associates, Inc. (2018) 25. Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning, pp. 7354–7363. PMLR (2019) 26. Arens, R.F., Eells, J.: On embedding uniform and topological spaces. Pacific J. Math. 6, 397–403 (1956) 27. von Luxburg, U., Bousquet, O.: Distance-based classification with Lipschitz functions. J. Mach. Learn. Res. 5, 669–695 (2003/04) 28. Ambrosio, L., Puglisi, D.: Linear extension operators between spaces of Lipschitz maps and optimal transport. J. Reine Angew. Math. 764, 1–21 (2020) 29. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks, pp. 214–223. PMLR, International Convention Centre, Sydney, Australia (2017) 30. Xu, T., Le, W., Munn, M., Acciaio, B.: Cot-gan: Generating sequential data via causal optimal transport. Advances in Neural Information Processing Systems 33 (2020) 31. Godefroy, G., Kalton, N.J.: Lipschitz-free Banach spaces. pp. 121–141. Dedicated to Professor Alek- sander Pełczynski ´ on the occasion of his 70th birthday (2003) 32. Weaver, N.: Lipschitz algebras. World Scientific Publishing Co. Pte. Ltd., Hackensack (2018) 33. Godefroy, G.: A survey on Lipschitz-free Banach spaces. Comment. Math. 55(2), 89–118 (2015) 34. Jost, J. Riemannian Geometry and Geometric Analysis, 6th edn. Universitext, Springer, Heidelberg (2011) 35. Basso, G.: Extending and improving conical bicombings. preprint 2005.13941 (2020) 36. Nagata, J. Modern general topology, revised. North-Holland Publishing Co., Amsterdam (1974). Wolters-Noordhoff Publishing, Groningen; American Elsevier Publishing Co., New York (1974). Bibliotheca Mathematica, Vol. VII 37. Munkres, J.R.: Topology. Prentice Hall, Inc., Upper Saddle River (2000). 2 38. Micchelli, C.A., Xu, Y., Zhang, H.: Universal kernels. J. Mach. Learn. Res. 7, 2651–2667 (2006) 39. Kontorovich, L., Nadler, B.: Universal kernel-based learning with applications to regular languages. J. Mach. Learn. Res. 10, 1095–1129 (2009) 40. Caponnetto, A., Micchelli, C.A., Pontil, M., Ying, Y.: Universal multi-task kernels. J. Mach. Learn. Res. 9, 1615–1646 (2008) 41. Grigoryeva, L., Ortega, J.-P.: Differentiable reservoir computing. J. Mach. Learn. Res. 20, Paper No. 179, 62 (2019) 42. Cuchiero, C., Gonon, L., Grigoryeva, L., Ortega, J.-P., Teichmann, J.: Discrete-time signatures and randomness in reservoir computing. pre-print 2010.14615 (2020) 43. Fletcher, P.T.: Geodesic regression and the theory of least squares on Riemannian manifolds. Int. J. Comput. Vis. 105(2), 171–185 (2013) 44. Kratsios, A., Bilokopytov, E.: Non-euclidean universal approximation (2020) 45. Osborne, M.S.: Locally convex spaces, Graduate Texts in Mathematics, vol. 269. Springer, Cham (2014) 46. Petersen, P., Raslan, M., Voigtlaender, F.: Topological properties of the set of functions generated by neural networks of fixed size. Found Comput Math. https://doi.org/10.1007/s10208-020-09461-0 (2020) 47. Gribonval, R., Kutyniok, G., Nielsen, M., Voigtlaender, F.: Approximation spaces of deep neural networks. Constr. Approx forthcoming (2020) 48. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge (2016) 49. Gelfand, I.: Normierte Ringe. Rec. Math. N. S. 9(51), 3–24 (1941) 468 A. Kratsios 50. Isbell, J.R.: Structure of categories. Bull. Amer. Math. Soc. 72, 619–655 (1966) 51. Dimov, G.D.: Some generalizations of the Stone duality theorem. Publ. Math. Debrecen 80(3-4), 255– 293 (2012) 52. Tuitman, J.: A refinement of a mixed sparse effective Nullstellensatz. Int. Math. Res. Not. IMRN 7, 1560–1572 (2011) 53. Fletcher, P.T.: Geodesic regression and the theory of least squares on Riemannian manifolds. Int. J. Comput. Vis. 105(2), 171–185 (2013) 54. Meyer, G., Bonnabel, S., Sepulchre, R.: Regression on fixed-rank positive semidefinite matrices: a Riemannian approach. J. Mach. Learn. Res. 12, 593–625 (2011) 55. Baes, M., Herrera, C., Neufeld, A., Ruyssen, P.: Low-rank plus sparse decomposition of covariance matrices using neural network parametrization. pre-print 1908.00461 (2019) 56. Hummel, J., Biederman, I.: Dynamic binding in a neural network for shape recognition. Psych. Rev. 99, 480–517 (1992) 57. Bishop, C.M.: Mixture density networks (1994) 58. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. ICLR (2017) 59. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. Neural Netw. Learn Syst. 20(1), 61–80 (2009) 60. PrajitRamachandran, Q.V.L.: Searching for activation functions. ICLR (2018) 61. Pinkus, A.: Approximation theory of the MLP model in neural networks 8, 143–195 (1999) 62. Koopman, B.O.: Hamiltonian systems and transformation in hilbert space. Proc. Natl. Acad. Sci. 17(5), 315–318 (1931) 63. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. ICML 30(1), 3 (2013) 64. Singh, R.K., Manhas, J.S.: Composition operators on function spaces, North-Holland Mathematics Studies, vol. 179. North-Holland Publishing Co., Amsterdam (1993) 65. Bengio, Y.: Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, vol. 27, pp. 17–36. JMLR Workshop and Conference Proceedings (2012) 66. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kurk ˚ ova, ´ V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) Artificial Neural Networks and Machine Learning – ICANN 2018, pp. 270–279. Springer (2018) 67. Chollet, F. et al.: Keras. https://keras.io/guides/transfer learning/ (2015) 68. Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory 39(3), 930–945 (1993) 69. Darken, C., Donahue, M., Gurvits, L., Sontag, E.: Rate of approximation results motivated by robust neural network learning. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory, pp. 303–309. Association for Computing Machinery, New York (1993) 70. Prolla, J.B.: Weighted spaces of vector-valued continuous functions. Ann. Mat. Pura Appl. (4) 89, 145– 157 (1971) 71. Bourbaki, N.: Elements ´ de mathematique. ´ Topologie gen ´ erale. ´ Chapitres 1 a ` 4. Hermann, Paris (1971) 72. Phelps, R.R.: Subreflexive normed linear spaces. Arch. Math. (Basel) 8, 444–450 (1957) 73. Kadec, M.I.: A proof of the topological equivalence of all separable infinite-dimensional Banach spaces. Funkcional. Anal. i Prilozen. ˇ 1, 61–70 (1967) 74. Grosse-Erdmann, K.-G., Peris Manguillot, A.: Linear chaos. Universitext, Springer, London (2011) 75. Perez ´ Carreras, P., Bonet, J.: Barrelled locally convex spaces, North-Holland Mathematics Studies, vol. 131. North-Holland Publishing Co., Amsterdam. Notas de Matematica ´ [Mathematical Notes], 113 (1987) 76. Kreyszig, E.: Introductory functional analysis with applications, Wiley Classics Library. Wiley, New York (1989) 77. Bourbaki, N. Espaces vectoriels topologiques. Chapitres 1 a ` 5, New. Masson, Paris (1981). Elements ´ de mathematique ´ 78. Kalmes, T.: Dynamics of weighted composition operators on function spaces defined by local properties. Studia Math. 249(3), 259–301 (2019) 79. Przestacki, A.: Dynamical properties of weighted composition operators on the space of smooth functions. J. Math. Anal. Appl. 445(1), 1097–1113 (2017) 80. Bayart, F., Darji, U.B., Pires, B.: Topological transitivity and mixing of composition operators. J. Math. Anal. Appl. 465(1), 125–139 (2018) 81. Hoffmann, H.: On the continuity of the inverses of strictly monotonic functions. Irish Math. Soc. Bull. (75), 45–57 (2015) The Universal Approximation Property 469 82. Behrends, E., Schmidt-Bichler, U.: M -structure and the Banach-Stone theorem. Studia Math. 69(1), 33– 40 (1980/81) 83. Jarchow, H.: Locally convex spaces. B. G. Teubner, Stuttgart. Mathematische Leitfaden. ¨ [Mathematical Textbooks] (1981) ´ ´ 84. Dieudonne, J., Schwartz, L.: La dualite dans les espaces F et LF. Ann. Inst. Fourier (Grenoble) 1, 61– 101 (1949) Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Journal

Annals of Mathematics and Artificial IntelligenceSpringer Journals

Published: Jan 22, 2021

There are no references for this article.