Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Deep solution operators for variational inequalities via proximal neural networks

Deep solution operators for variational inequalities via proximal neural networks andreas.stein@sam.math.ethz.ch Seminar for Applied Following Bauschke and Combettes (Convex analysis and monotone operator theory in Mathematics, ETH Zürich, Zürich, Hilbert spaces, Springer, Cham, 2017), we introduce ProxNet, a collection of deep neural Switzerland networks with ReLU activation which emulate numerical solution operators of variational inequalities (VIs). We analyze the expression rates of ProxNets in emulating solution operators for variational inequality problems posed on closed, convex cones in real, separable Hilbert spaces, covering the classical contact problems in mechanics, and early exercise problems as arise, e.g., in valuation of American-style contracts in Black–Scholes financial market models. In the finite-dimensional setting, the VIs reduce to matrix VIs in Euclidean space, and ProxNets emulate classical projected matrix iterations, such as projected Jacobi and projected SOR methods. 1 Introduction Variational Inequalities (VIs for short) in infinite-dimensional spaces arise in variational formulations of numerous models in the sciences. We refer only to [7,17,26] and the references there for models of contact problems in continuum mechanics, [20] and the references there for applications from optimal stopping in finance (mainly option pric- ing with “American-style,” early exercise features) and [4] and the references there for resource allocation and game theoretic models. Two broad classes of approaches toward numerical solution of VIs can be identified: deterministic approaches, which are based on discretization of the VI in function space, and probabilistic approaches, which exploit stochastic numerical simulation and an interpretation of the solution of the VI as condi- tional expectations of optimally stopped sample paths. The latter approach has been used to design ML algorithms for the approximation of the solution of one instance of the VI in [3]. Deep neural network structures arise naturally in abstract variational inequality prob- lems (VIs) posed on the product of (possibly infinite-dimensional) Hilbert spaces, as review, e.g., in [5]. Therein, the activation functions correspond to proximity operators of certain potentials that define the constraints of the VI. Weak convergence of this recurrent NN structure in the limit of infinite depth to feasible solutions of the VI is shown under suitable assumptions. An independent, but related, development in recent years has been the advent of DNN-based numerical approximations which are based on encoding known, © The Author(s) 2022. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. 0123456789().,–: volV 36 Page 2 of 35 Schwab, Stein Res Math Sci (2022) 9:36 iterative solvers for discretized partial differential equations, and certain fixed point itera- tions for nonlinear operator equations. We mention only [9], that developed DNNs which emulate the ISTA iteration of [6], or the more recently proposed generalization of “deep unrolling/unfolding” methodology [22]. Closer to PDE numerics, recently [11] proposed MGNet, a neural network emulation of multilevel, iterative solvers for linear, elliptic PDEs. The general idea behind these approaches is to emulate by a DNN a contractive map, say , which is assumed to satisfy the conditions of Banach’s Fixed Point Theorem (BFPT), and whose unique fixed point is the solution of the operator equation of interest. Let us denote the approximate map realized by emulating  with a DNN by . The universality theorem for DNNs in various function classes implies (see, e.g., [16,25] and the references there) that for any ε> 0 a DNN surrogate  to the contraction map exists, which is ε-close to , uniformly on the domain of attraction of . Iteration of the DNN  being realized by composition, any finite number K of steps of the fixed point iteration can be realized by K -fold composition of the DNN surrogate ˜ ˜ . Iterating , instead of , induces an error of order O(ε/(1 − L)), uniformly in the number of iterations K , where L ∈ (0, 1) denotes the contraction constant of . Due to the contraction property of , K may be chosen as O(| log(ε)|) in order to output an approximate fixed point with accuracy ε upon termination. The K -fold composition of the ˜ ˜ surrogate DNN  is, in turn, itself a DNN of depth O(depth()| log(ε)|). This reasoning is valid also in metric spaces, since the notions of continuity and contractivity of the map do not rely on availability of a norm. Hence, a (sufficiently large) DNN  exists which may be used likewise for the iterative solution of VIs in metric spaces. Furthermore, the resulting fixed-point-iteration nets obtained in this manner naturally exhibit a recurrent structure, in the case (considered here) that the surrogate  is fixed throughout the K -fold (k) K composition (more refined constructions with stage-dependent approximations{ } k=1 of increasing emulation accuracy could be considered, but shall not be addressed here). In summary, with the geometric error reduction in FPIs which is implied by the con- traction condition, finite truncation at a prescribed emulation precision ε> 0willimply O(| log(ε)|) iterations, and exact solution representation (of the fixed point of )inthe infinite depth limit. In DNN calculus, finitely terminated FPIs can be realized via finite concatenation of the DNN approximation  of the contraction map . The corresponding DNNs exhibit depth O(| log()|), and naturally a recurrent structure due to the repetition of the Net  in their construction. Thereby, recurrent DNNs can be built which encode numerical solution maps of fixed point iterations. This idea has appeared in various incar- nations in recent work; we refer to, e.g., MGNet for the realization of Multi-grid iterative solvers of discretized elliptic PDEs [11]. The presently proposed ProxNet architectures are, in fact, DNN emulations of corresponding fixed point iterations of (discretized) vari- ational inequalities. Recent work has promoted so-called Deep Operator Nets which emulate Data-to- Solution operators for classes of PDEs. We mention only [19] and the references there. To analyze expression rates of deep neural networks (DNNs) for emulating data-to-solution operators for VIs is the purpose of the present paper. In line with recent work (e.g., [19,21] and the references there), we take the perspective of infinite-dimensional VIs, which are set on closed cones in separable Hilbert spaces. The task at hand is then the analysis of rates of expression of the approximate data-to-solution map, which relates the input data (i.e., operator, cone, etc.) to the unique solution of the VI. Schwab, Stein Res Math Sci (2022) 9:36 Page 3 of 35 36 1.1 Layout The structure of this paper is as follows. In Sect. 2, we recapitulate basic notions and definitions of proximal neural networks in infinite-dimensional, separable Hilbert spaces. A particular role is taken by so-called proximal activations, and a calculus of ProxNets, which we shall use throughout the rest of the paper to build solution operators of VIs. Section 3 addresses the conceptual use of ProxNets in the constructive solution of VIs. We build in particular ProxNet emulators of convergent fixed point iterations to construct solutions of VIs. Section 3.2 introduces quantitative bounds for perturbations of ProxNets. Section 4 emphasizes that ProxNets may be regarded as (approximate) solution operators to unilateral obstacle problems in infinite-dimensional Hilbert spaces. Section 5 presents DNN emulations of iterative solvers of matrix LCPs which arise from discretization of unilateral problems for PDEs. Section 6 presents several numerical experiments, which illustrate the foregoing developments. More precisely, we consider the numerical solution of free boundary value problems arising in the valuation of American-style options, and in parametric obstacle problems. Section 7 provides a brief summary of the main results and indicates possible directions for further research. 1.2 Notation We use standard notation. By L(H, K), we denote the Banach space of bounded, linear operators from the Banach space H into K (surjectivity will not be required). Unless explicitly stated otherwise, all Hilbert and Banach-spaces are infinite-dimensional. By bold symbols, we denote matrices resp. linear maps between finite-dimensional spaces. 0 0 We use the notation conventions ·= 0and  ·= 1 for the empty sum and i=1 i=1 empty product, respectively. Vectors in finite-dimensional, Euclidean space are always understood as column vectors, with  denoting transposition of matrices and vectors. 2 Proximal neural networks (ProxNets) We consider the following model for an artificial neural network: For finite m ∈ N,let H and (H ) be real, separable Hilbert spaces. For every i ∈{1, ... ,m},let W ∈ i 0≤i≤m i L(H , H ) be a bounded linear operator, let b ∈ H ,let R : H → H be a nonlinear, i−1 i i i i i i continuous operator, and define T : H → H,x → R (W x + b ). (1) i i−1 i i i i Moreover, let W ∈ L(H , H), W ∈ L(H , H), b ∈ H and consider the neural 0 0 m+1 m m+1 network (NN) model : H → H,x → W x + W (T ◦ ··· ◦ T )(x) + b . (2) 0 0 m+1 m 1 m+1 The operator W ∈ L(H , H) allows to include skip connections in the model, similar to 0 0 deep residual neural networks as proposed in [12,13]. This article focuses in particular on NNs with identical input and output spaces as in [5, Model 1.1], that arise as special case of model (2)with H = H = H and are of the form 0 m : H → H,x → (1 − λ)x + λ(T ◦ ··· ◦ T )(x), (3) m 1 36 Page 4 of 35 Schwab, Stein Res Math Sci (2022) 9:36 for a relaxation parameter λ> 0 to be adjusted for each application. The relation H = H = H allows us to investigate fixed points of  : H → H, which are in turn solutions to variational inequalities. The nonlinear operators R act as activation operators of the NNs and are subsequently given by suitable proximity operators on H . We refer to and  as proximal neural networks or ProxNets for short and derive sufficient conditions on the operators T ,resp. W and R , so that  defines a contraction on H. Hence, the i i i ∗ ∗ unique fixed point x = (x ) ∈ H solves a variational inequality, that is turn uniquely determined by the network parameters W ,b and R for i ∈{1, ... ,m}. On the other i i i hand, any well-posed variational inequality on H may be recast as fixed-point problem for a suitable contractive ProxNet  : H → H. As an example, consider an elliptic variational inequality on H, with solution u ∈ K ⊂ H, where K is a closed, convex set. The set of contractive mappings on H is open; therefore, we may construct a one-layer ProxNet  : H → H, such that u is the unique fixed-point of . Therein, W ∈ L(H) stems from the bilinear form of the variational inequality, λ> 0 is a relaxation parameter chosen to ensure a Lipschitz constant below one, and R is the H-orthogonal projection onto K, see Sect. 4.1 for a detailed construction. This enables us to approximate solutions to variational inequality problems as fixed- point iterations of ProxNets and derive convergence rates. Due to the contraction property ∗ ∗ of , the fixed-point iteration x = (x ),n ∈ N converges to x = (x ) for any n n−1 x ∈ H at linear rate. Moreover, as the set of contractions on H is open, the iteration is stable under small perturbations of the network parameters. As we show in Sect. 5.3 below, the latter property allows us to solve entire classes of variational inequality problems using only one ProxNet with fixed parameters. 2.1 Proximal activations Definition 2.1 Let i ∈{0, ... ,m} be a fixed index, ψ : H → R ∪{∞} and dom(ψ ):= i i i {x ∈ H |ψ (x) < ∞}.Wedenoteby (H ) the set of all proper, convex, lower semi- i i 0 i continuous functions on H , that is (H ):= 0 i ψ : H → R∪{∞} lim inf ψ (y) ≥ ψ (x) for all x ∈ H and dom(ψ ) =∅ . i i i i i i y→x For any ψ ∈ (H ), the subdifferential of ψ at x ∈ H is i 0 i i i ∂ψ (x):={v ∈ H | (y − x, v) + f (x) ≤ f (y) for all y ∈ H }⊂ H,x ∈ H , i i i i i and the proximity operator of ψ is x − y prox : H → H,x → argmin ψ (y) + . (4) i i i y∈H It is well-known that prox is a firmly nonexpansive operator, i.e., 2prox − id is non- ψ ψ i i expansive, see, e.g., [2, Proposition 12.28]. As outlined in [5, Section 2], there is a natural relation between proximity operators and activation functions in neural networks: Virtu- ally any commonly used activation function such as rectified linear unit, tanh, softmax, etc., may be expressed as proximity operator on H = R , d ∈ N, for an appropriate i Schwab, Stein Res Math Sci (2022) 9:36 Page 5 of 35 36 ψ ∈ (H )(see[5, Section 2] for examples). We consider a set of particular proximity i 0 i operators given by A(H ):={R = prox | ψ ∈ (H ) such that ψ is minimal at 0 ∈ H }, (5) i i i 0 i i i cf. [5, Definition 2.20]. Apart from being continuous and nonexpansive, any R ∈ A(H ) i i satisfies R (0) = 0[5, Proposition 2.21]. Therefore, in the case H = R, the elements in i i A(R) are also referred to as stable activation functions,cf. [10, Lemma 5.1]. With this in mind, we formally define proximal neural networks, or ProxNets. Definition 2.2 Let  : H → H be the m-layer neural network model in (2). If R ∈ 0 i A(H ) holds for any i ∈{1, ... ,m},  is called a proximal neural network or ProxNet. 2.2 ProxNet calculus Before investigating the relation of  in (3) to variational inequality models, we record several useful definitions and results for NN calculus in the more general model  from Eq. (2). (j) (j) (j) Definition 2.3 Let j ∈{1, 2}, m ∈ N,let H , H , ... , H be separable Hilbert spaces j m (1) (2) such that H = H ,and let  be m -layer ProxNets as in (2) given by j j (j) (j) (j) (j) (j) (j) : H → H ,x → W T ◦ ··· ◦ T (x) + b . j m 0 m +1 1 m+1 The concatenation of  and  is defined by the map 1 2 (2) (1) •  : H → H ,x → ( ◦  )(x). (6) 1 2 1 2 (j) Remark 2.4 Due to W ≡ 0, there are no skip connections after the last proximal acti- vation in  ; hence,  •  is in fact a ProxNet as in (2)with2m layers and no skip j 1 2 connection. (j) (j) (j) Definition 2.5 Let m ∈ N, j ∈{1, 2},let H , H , ... , H be separable Hilbert spaces (1) (2) such that H = H ,and let  be m-layer ProxNets as in (2) given by 0 0 (j) (j) (j) (j) (j) (j) (j) : H → H ,x → W x + W T ◦ ··· ◦ T (x) + b . j m 0 0 m+1 1 m+1 (1) (2) The parallelization of  and  is given for H := H = H by 1 2 0 0 0 (1) (2) P( ,  ): H → H ⊕ H ,x → ( (x),  (x)). 1 2 0 1 2 Proposition 2.6 The parallelization P( ,  ) of two ProxNets  and  as in Defini- 1 2 1 2 tion 2.5 is a ProxNet. (j) (1) (2) (j) Proof We set H := H for j ∈{1, 2},fix i ∈{1, ... ,m} and observe that H ⊕ H m+1 i i equipped with the scalar product (·,·) := (·,·) + (·,·) is again a separable (1) (2) (1) (2) H ⊕H H H i i i i 36 Page 6 of 35 Schwab, Stein Res Math Sci (2022) 9:36 Hilbert space. We define (1) (2) (1) (2) W : H → H ⊕ H,x → (W x, W x), 0 0 0 0 (1) (2) (1) (2) W : H → H ⊕ H,x → (W x, W x), 1 0 1 1 1 1 (1) (2) (1) (2) (1) (2) W : H ⊕ H → H ⊕ H , (x, y) → (W x, W y),i ∈{2, ... ,m + 1}, i−1 i−1 i i i i (1) (2) (1) (2) b := (b ,b ) ∈ H ⊕ H ,i ∈{1, ... ,m + 1}, i i i i (1) (2) (1) (2) (1) (2) R : H ⊕ H → H ⊕ H , (x, y) → (R x, R y),i ∈{0, 1, ... ,m}. i i i i i i (j) (j) Note that all W are bounded, linear operators. Moreover, if R = prox ∈ A(H ) (j) i i (j) (j) (1) (2) holds for ψ ∈ (H )and j ∈{1, 2}, then R = prox , where ψ ∈ (H ⊕ H )is 0 i i 0 i i i i i (1) (2) (1) (2) defined by ψ (x, y):= ψ (x) + ψ (y). Hence, R ∈ A(H ⊕ H ) and it holds that i i i i i i (1) (2) P( ,  ): H → H ⊕ H ,x → W x + W (T ◦ ··· ◦ T )(x) + b , 1 2 0 0 m+1 m 1 m+1 with T := R (W ·+b ) for i ∈{1, ... ,m}, which shows the claim. i i i i 3 ProxNets and variational inequalities 3.1 Contractive ProxNets We formulate sufficient conditions on the neural network model in (3) so that  : H → H is a contraction. The associated fixed-point iteration converges to the unique solution of a variational inequality, which is characterized in the following. Assumption 3.1 Let  be a ProxNet as in (3)with m ∈ N layers such that W ∈ L(H , H ), b ∈ H ,and R ∈ A(H ) for all i ∈{1, ... ,m}.Itholds that λ ∈ (0, 2) i−1 i i i i i and the operators W satisfy L := W  < min(1, 2/λ − 1). i L(H ,H ) i−1 i i=1 0 k+1 k Theorem 3.2 Let  be as in (3), let x ∈ H and define the iteration x := (x ),k ∈ N . k 0 Under Assumption 3.1,the sequence (x ,k ∈ N ) converges for any x ∈ H to the unique fixed-point x ∈ H. For any finite number k ∈ N, the error is bounded by 0 0 (x ) − x ∗ k k x − x  ≤ L ,L :=|1 − λ|+ λL ∈ [0, 1). (7) H ,λ ,λ 1 − L ,λ It holds that ∗ ∗ ∗ ∗ ∗ ∗ (x , ... ,x ):= (T x , (T ◦ T )x , ... , (T ◦ ··· ◦ T )x ,x ) ∈ H × ··· × H 1 2 1 m−1 1 1 m 1 m is the unique solution to the variational inequality problem: find x ∈ H , ... ,x = x ∈ 1 1 0 m H ,suchthat W x + b − x ∈ ∂ψ (x ),i ∈{1, ... ,m}. (8) i i−1 i i i i Schwab, Stein Res Math Sci (2022) 9:36 Page 7 of 35 36 Moreover, x is bounded by ⎛ ⎞ m m ∗ ∗ ⎝ ⎠ x  ≤ C W  b  , H j i H L(H ,H ) j−1 j i i=1 j=i+1 ⎨ 1 < ∞, λ ∈ (0, 1], ∗ 1−L C := < ∞, λ ∈ (1, 2). 2−λ(1+L ) Proof By the non-expansiveness of R : H → H for i ∈{1, ... ,m}, it follows for any i i i x, y ∈ H (x) − (y) ≤|1 − λ|x − y + λ(T ◦ ··· ◦ T )x − (T ◦ ··· ◦ T )y H H m 1 m 1 H ≤|1 − λ|x − y + λ(W ◦ (T ◦ ··· ◦ T ))x − (W ◦ (T ◦ ··· ◦ T ))y m m−1 1 m m−1 1 H ≤|1 − λ|x − y + λW  (T ◦ ··· ◦ T )x − (T ◦ ··· ◦ T )y m L(H ,H ) m−1 1 m−1 1 H m−1 m m−1 ≤|1 − λ|x − y + λ W  x − y H i L(H ,H ) H i−1 i 0 i=1 = (|1 − λ|+ λL )x − y . :=L ,λ As λ ∈ (0, 2) and L < min(1, 2/λ− 1) by Assumption 3.1, it follows that L < 1, hence, ,λ : H → H is a contraction. Existence and uniqueness of x ∈ H and the first part of the claim then follow by Banach’s fixed-point theorem for any initial value x ∈ H. By [2, Proposition 16.44], it holds for any i ∈{1, ... ,m}, x ,y ∈ H and ψ ∈ (H ) i i i i 0 i that x = prox (y ) ⇔ y − x ∈ ∂ψ (x ). i i i i i i ∗ ∗ ∗ ∗ ∗ Now, let x := x and x := (T ◦ ··· ◦ T )(x ) for i ∈{1, ... ,m}. This yields (x ) = i 1 0 i 0 ∗ ∗ ∗ ∗ ∗ (1 − λ)x + λx = x and hence, x = x . Recalling that R = prox with ψ ∈ (H ) i i 0 i m m ψ for all i ∈{1, ... ,m}, it hence follows that ∗ ∗ ∗ W x + b − x ∈ ∂ψ (x ), i i i i−1 i i cf. [5, Proposition 4.3]. Finally, to bound x ,weuse that ∗ ∗ ∗ x  ≤(x ) − (0) +(0) ≤ L x  + λ(T ◦ ··· ◦ T )(0) . H H H ,λ H m 1 H m 36 Page 8 of 35 Schwab, Stein Res Math Sci (2022) 9:36 As R ∈ A(H ), it holds R (0) = 0 and therefore, R (x) ≤x for all x ∈ H , which i i i i H H i i i in turn shows (T ◦ ··· ◦ T )(0) ≤W  (T ◦ ··· ◦ T )(0) +b m 1 H m L(H ,H ) m−1 1 H m H m m m m−1 m ≤W L(H ,H ) m m · W  (T ◦ ··· ◦ T )(0) +b  +b m−1 L(H ,H ) m−2 1 H m−1 H m H m−2 m−1 m−2 m−1 m ⎛ ⎞ m m ⎝ ⎠ ≤ W  b  . j L(H ,H ) i H j−1 j i=1 j=i+1 The claim follows with L < min(1, 2/λ − 1), since λ(1 − L ) > 0, λ ∈ (0, 1], 1 − L = ,λ 2 − λ(1 + L ) > 0, λ ∈ (1, 2). 3.2 Perturbation estimates for ProxNets We introduce a perturbed version of the ProxNet  in (3) in this subsection. Besides changing the network parameters W ,b and R , we also augment the input space H and i i i allow an architecture that approximates each nonlinear operator T itself by a multilayer network. These changes allow us to consider ProxNet as an approximate data-to-solution operator for infinite-dimensional variational inequalities and to control perturbations of the network parameters. For instance, we show in Example 3.4 that augmented ProxNets mimic the solution operator to Problem (8), that maps the bias vectors b , ... ,b to the 1 m solution x , ... ,x . 1 m Let H , ... , H be arbitrary separable Hilbert spaces and let H := H . Then, for 0 m−1 0 i ∈{0, ... ,m−1} the direct sum H ⊕ H equipped with the inner product (·,·) +(·,·) i i H i H is again a separable Hilbert space. For notational convenience, we set H :={0 ∈ H } m m and use the identification H ⊕ H = H = H. We consider the ProxNet m m m : H ⊕ H → H, (x, x) → (1 − λ)x + λ(T ◦ ··· ◦ T )(x, x), (9) m 1 where we allow that the operators T are itself multi-layer ProxNets: For any i ∈{1, ... ,m}, (i) (i) (i) (i) let m ∈ N and let H := H ⊕ H , H , ... , H ,H := H ⊕ H be separable i i−1 i−1 m i i 0 1 m −1 i (i) (i) (i) (i) Hilbert spaces. For j ∈{1, ... ,m }, consider the operators T (·) = R (W ·+b ) given i i j j j j i i i i by (i) (i) (i) (i) (i) (i) (i) R ∈ A(H ),W ∈ L(H , H ),b ∈ H . j j j j −1 j j j i i i i i i i We then define T as (i) (i) T : H ⊕ H → H ⊕ H , (x , x ) → (T ◦ ··· ◦ T )(x , x ), i i−1 i−1 i i i−1 i−1 i−1 i−1 m 1 which in turn determines  in (9). By construction,  is a ProxNet of the form (2)with m ≥ m layers. As compared to , we augmented the input and intermediate spaces i=1 by H . The composite structure of the maps T allows to choose input vectors x ∈ H i i i−1 i−1 Schwab, Stein Res Math Sci (2022) 9:36 Page 9 of 35 36 such that the first component of T (x , x ) approximates T (x ) uniformly on a subset i i−1 i−1 i i−1 of H . As we show in Sect. 5.3 below, this enables us to solve large classes of variational i−1 inequalities with only one fixed ProxNet , that in turn approximates a data-to-solution operator, instead of employing different fixed maps  : H → H for every problem. To formulate reasonable assumptions on , we denote for any i ∈{1, ... ,m − 1} by P : H ⊕ H → H , (x , x ) → x , H i i i i i i P : H ⊕ H → H , (x , x ) → x i i i i i i the projections to the first and second component for an element in H ⊕ H , respectively. i i (i) Moreover, we define the closed ball B :={x ∈ H |x  ≤ r}⊂ H with radius r > 0. r i i i H i Assumption 3.3 Let  and  be proximal neural networks defined as in Eqs. (3)and (9), respectively. There are constants L ∈ (0, 1), δ ≥ 0and ≥ ≥ > 0 such that 1 0 2 1.  satisfies Assumption 3.1 with λ ∈ (0, 1] and L ≤ L ∈ (0, 1). 2. It holds that ⎛ ⎞ ⎛ ⎞ i m m ⎝ ⎠ ⎝ ⎠ max W  + W  (b  + δ) ≤ , j L(H ,H ) 0 j L(H ,H ) i H 1 j−1 j j−1 j m i∈{0,1,...,m} j=1 i=1 j=i+1 ⎛ ⎞ m m ⎝ ⎠ W  b  ≤ (1 − L) , j L(H ,H ) i H 2 j−1 j i=1 j=i+1 as well as ⎛ ⎞ m m ⎝ ⎠ + W  ≤ . 2 j 0 L(H ,H ) j−1 j (1 − L) i=1 j=i+1 (i−1) 3. There is a vector x ∈ H , such that for i ∈{1, ... ,m},any x ∈ B ⊂ H and 0 0 i−1 i−1 x := P T (x , x )itholds i  i i−1 i−1 T (x ) − P T (x , x ) ≤ δ. i i−1 H i i−1 i−1 H i i Before we derive error bounds, we provide an example to motivate the construction of and Assumption 3.3. Example 3.4 (Bias-to-solution operator) Let  be as in Assumption 3.1 with m = 2layers and network parameters R ,W ,b for i ∈{1, 2}. We construct a ProxNet  that takes the i i i bias vectors b ,b of  as inputs to represent  for any choice of b ∈ H and therefore, 1 2 i i may be concatenated to map any choice of b ,b to the respective solution (x ,x )of(8). 1 2 1 2 In other words, we approximate the bias-to-solution operator O : H ⊕ H → H ⊕ H , (b ,b ) → (x ,x ). bias 1 2 1 2 1 2 1 2 36 Page 10 of 35 Schwab, Stein Res Math Sci (2022) 9:36 To this end, we set H = H ⊕ H , H = H , m = m = 1, b = 0 ∈ H ⊕ H and 0 1 2 1 2 1 2 i,1 i i (1) W : H ⊕ H ⊕ H → H ⊕ H , (x, x ,x ) → (W x + x ,x ) 1 2 1 2 1 2 1 1 2 (2) W : H ⊕ H → H , (x ,x ) → W x + x , 1 2 2 1 2 2 1 2 (1) R : H ⊕ H → H ⊕ H , (x ,x ) → R (x ) + x , 1 2 1 2 1 2 1 1 2 (2) R : H → H,x → R (x ). 2 2 2 2 2 (1) (1) Note that R = prox (1) with ψ (x ,x ):= ψ (x ) for any (x ,x ) ∈ H ⊕ H , where ψ 1 2 1 1 1 2 1 2 1 1 1 (1) determines R = prox . Hence, R ∈ A(H ⊕ H ), and it follows with x := (b ,b ) ∈ 1 1 1 0 1 2 ψ 1 H ⊕ H for any x ∈ H and x ∈ H that 1 2 1 1 (1) (1) T (x) = R (W x + b ) = P (R (W x + b ),b ) = P R (W (x, x )) = P T (x, x ), 1 1 1 1 H 1 1 1 2 H 0 H 1 0 1 1 1 1 1 (2) (2) (2) (2) T (x) = R (W x + b ) = R (W (x ,b )) = P R (W (x ,P T (x , x )). 2 2 2 1 2 1 2 H 1 1 1 0 1 1 2 1 1 H Therefore, the last part of Assumption 3.3 holds with δ = 0 for arbitrary large > 0and hence, the constants , , do not play any role in this example. The generalization 0 1 2 to m > 2 layers follows by a similar construction of . Now, let (x ,x ) be the solution to (8) for any choice (b ,b ) ∈ H ⊕ H . It follows from 1 2 1 2 1 2 Theorem 3.2 that the operator O : H ⊕ H → H, (b ,b ) → (·,b ,b ) • ··· • (·,b ,b )(x ) bias 1 2 1 2 1 2 1 2 k times satisfies x ≈ O (b ,b )and x ≈ T (O (b ,b )) for any fixed x ∈ H and any tuple 2 bias 1 2 1 1 bias 1 2 (b ,b ) ∈ H ⊕ H , for a sufficiently large number k of concatenations of (·,b ,b ). 1 2 1 2 1 2 The augmented ProxNet  may also be utilized to consider parametric families of obstacle problems, as shown in Example 4.4 below. Therein, the parametrization is with respect to the proximity operators R instead of the bias vectors b , and we construct an i i approximate obstacle-to-solution operator in the fashion of Example 3.4. In the finite- dimensional case (where the linear operators W correspond to matrices), the input of may even be augmented by a suitable space of operators, see Sect. 5.3 below for a detailed discussion. We conclude this section with a perturbation estimate that allows us to approximate the fixed-point of  by the augmented NN . Theorem 3.5 Let  and  be proximal neural networks as in Eqs. (3)and (9) that satisfy Assumption 3.3, and denote by x ∈ H the unique fixed-point of  from Theorem 3.2.Let (0) 0 k+1 x ∈ B be arbitrary, let  x be as in Assumption 3.3 and define the sequence  x := k 0 0 ( x , x ) for k ∈ N ,where x := x . Then, there is a constant C > 0 which is independent 0 0 of δ> 0 and x ,suchthatfor anyk ∈ N,itholds ∗ k k x − x  ≤ C L + δ , where L := (1 − λ) + λL < 1. λ Schwab, Stein Res Math Sci (2022) 9:36 Page 11 of 35 36 (0) Proof Let x ∈ B and let  x ∈ H be as in Assumption 3.3.Wedefine v := x, v := 0 0 0 i P (T ◦···◦ T )(x, x ) ∈ H for i ∈{1, ... ,m − 1},and v := (T ◦···◦ T )(x, x ) ∈ H. H i 1 0 i m m 1 0 With x := P T (x , x ) and the convention that P = id, we obtain the recursion i  i i−1 i−1 H formula v = P T (v , x ),i ∈{1, ... ,m}. (10) i H i i−1 i−1 We now show by induction that v  ≤ for i ∈{0, ... ,m}. By Assumption 3.3 it i H 1 holds v  =x 0 H H ⎛ ⎞ ⎛ ⎞ 0 0 0 ⎝ ⎠ ⎝ ⎠ = W  + W  (b  + δ) j L(H ,H ) 0 L(H ,H ) j H j−1 j −1 j j=1 j=1 =j+1 ≤ . Now, let ⎛ ⎞ ⎛ ⎞ i i i ⎝ ⎠ ⎝ ⎠ v  ≤ W  + W  (b  + δ) i H j 0 j H L(H ,H ) L(H ,H ) i j−1 j −1 j j=1 j=1 =j+1 hold for a fixed i ∈{0, ... ,m − 1}. Assumption 3.3 yields with Eq. (10) T (v ) − v  =T (v ) − P T (v , x ) ≤ δ. i+1 i i+1 H i+1 i H i+1 i 0 H i+1 i+1 i+1 Using R (x) ≤x for x ∈ H then yields together with the triangle i+1 H H i+1 i+1 i+1 inequality and the induction hypothesis v  ≤ δ +T (v ) i+1 H i+1 i H i+1 i+1 ≤ δ +W  v  +b i+1 i H i+1 H L(H ,H ) i i+1 i i+1 ⎛ ⎞ ⎛ ⎞ i+1 i+1 i+1 ⎝ ⎠ ⎝ ⎠ ≤ W  + W  (b  + δ) j L(H ,H ) 0 L(H ,H ) j H j−1 j −1 j j=1 l=1 j=l+1 ≤ , (i) and hence, v ∈ B for all i ∈{0, ... ,m}. With Assumption 3.3 and Eq. (10), we further (0) obtain for each x ∈ B (x) − (x, x ) 0 H =(T ◦ ··· ◦ T )(x) − v m 1 m H ≤(T ◦ ··· ◦ T )(x) − T (v ) +T (v ) − T (v , x ) m 1 m m−1 H m m−1 m m−1 m−1 H ≤W  (T ◦ ··· ◦ T )(x) − v  + δ, m L(H ,H ) m−1 1 m−1 H m−1 m m−1 36 Page 12 of 35 Schwab, Stein Res Math Sci (2022) 9:36 and by iterating this estimate over i ∈{1, ... ,m} ⎛ ⎞ m m ⎝ ⎠ (x) − (x, x ) ≤ λδ W  =: λδC . (11) 0 H j L(H ,H ) m j−1 j i=1 j=i+1 ∗ k k−1 Now, let x ∈ H be the unique fixed-point of  as in Theorem 3.2,let x = (x )and k k−1 0 0 0 x = ( x , x ) for any k ∈ N and a given initial value x =  x ∈ H with x  ≤ . 0 H 2 We obtain as in the proof of Theorem 3.2 1 0 x  ≤(x ) − (0) +(0) H H H ⎛ ⎞ m m ⎝ ⎠ ≤ L x  + λ W  b ,λ H j L(H ,H ) i H j−1 j i=1 j=i+1 ⎛ ⎛ ⎞ ⎞ (12) m m ⎝  ⎝ ⎠ ⎠ ≤ (1 − λ) + λ L + W  b 2 2 j i H L(H ,H ) i j−1 j i=1 j=i+1 ≤ , where we have used that L = (1− λ)+ λL ≤ (1− λ)+ λL and Assumption 3.3. Hence, ,λ k k we have x  ≤ inductively for all k ∈ N. In the next step, we show that  x  ≤ H 2 H 0 by induction over k. First, we obtain with x ≤ ≤ ,(11)and (12) that 2 0 1 0 0 0 0 x  =(x , x ) ≤(x , x ) − (x ) +(x ) ≤ λδC + . H 0 H 0 H H  2 Thus, x  ≤ follows with Assumption 3.3 on the relation of and as λ(1−L) < H 0 0 2 k−1 k k 1. Using the induction hypothesis  x − x  ≤ λδC L for a fixed k ∈ N, j=0 ,λ x  ≤ ,and L ≤ L := (1 − λ) + λL < 1 yields similarly H 2 ,λ λ k+1 k k k k k x  ≤( x , x ) − ( x ) +( x ) − (x ) +(x ) H 0 H H H k k ≤ λδC + L  x − x  + ,λ H 2 ≤ λδC L + , j=0 and hence,  x  ≤ λδC /(λ(1 − L)) + ≤ holds by induction for all k ∈ N.We H  2 0 apply the bounds from Theorem 3.2 and (11) and conclude the proof by deriving ∗ k ∗ k k−1 k−1 k−1 k−1 x − x ≤x − x +(x ) − ( x )+( x ) − ( x , x ) 1 0 x − x k k−1 k−1 ≤ L + L x − x  + λδC ,λ H ,λ 1 − L ,λ k−1 0 0 (x ) − x ≤ L + λδC L 1 − L j=0 max(2 , λC ) ≤ L + δ . 1 − L Schwab, Stein Res Math Sci (2022) 9:36 Page 13 of 35 36 4 Variational inequalities in Hilbert spaces In the previous sections, we have considered a ProxNet model and derived the associated variational inequalities. Now, we use the variational inequality as starting point and derive suitable ProxNets for its (numerical) solution. Let (H, (·,·) ) be a separable Hilbert space with topological dual space denoted by H ,and let ·.· be the associated dual pairing. H H Let a : H × H → R be a bilinear form, let f : H → R be a functional, and let K ⊂ H be a subset of H. We consider the variational inequality problem find u ∈ K : a(u, v − u) ≥ f (v − u), ∀v ∈ K. (13) Assumption 4.1 The bilinear form a : H × H → R is bounded and coercive on H, i.e., there exists constants C ,C > 0 such that for any v, w ∈ H it holds − + a(v, w) ≤ C v w and a(v, v) ≥ C v . + H H − Moreover, f ∈ H and K ⊂ H is nonempty, closed and convex. Problem (13) arises in various applications in the natural sciences, engineering and finance. It is well-known that there exists a unique solution u ∈ K under Assumption 4.1, see, e.g., [14, Theorem A.3.3] for a proof. We also mention that well-posedness of Prob- lem (13) is ensured under weaker conditions as Assumption 4.1; in particular, the coerciv- ity requirement may be relaxed as shown in [8]. For this article, however, we focus on the bounded and coercive case in order to obtain numerical convergence rates for ProxNet approximations. 4.1 Fixed-point approximation by ProxNets Theorem 4.2 Let Assumption 4.1 hold, and define H := H := H. Then, there exists 1 0 a one-layer ProxNet  as in Eq. (3)suchthatu ∈ K is the unique fixed-point of . 0 k k−1 Furthermore, for a given u ∈ H define the iteration u := (u ),k ∈ N. Then, there are constants L ∈ (0, 1) and C = C(u ) > 0 such that ,λ k k u − u ≤ CL ,k ∈ N. (14) ,λ Proof We recall the fixed-point argument, e.g., in [14, Theorem A.3.3], for proving exis- tence and uniqueness of u since it is the base for the ensuing ProxNet construction: Assumption 4.1 ensures that a(v,·),f ∈ H for any v ∈ H. The Riesz representation theorem yields the existence of A ∈ L(H)and F ∈ H such that for all v, w ∈ H (Av, w) = a(v, w)and (F, v) = f (v). H H Since K is closed convex, the H-orthogonal projection P : H → K onto K is well-defined and for any ω> 0 there holds u solves (13) ⇐⇒ u = P (ω(F − Au) + u). Hence, u is a fixed-point of the mapping T : H → H,v → P (ω(F − Av) + v). ω K 36 Page 14 of 35 Schwab, Stein Res Math Sci (2022) 9:36 By Assumption 4.1, it is now possible to choose ω> 0 sufficiently small, so that T is a contraction on H, which proves existence and uniqueness of u. The optimal relaxation ∗ 2 2 parameter in terms of the bounds C ,C is ω = C /C , leading to T ∗ = − + − ω L(H) 2 2 (1 − C /C ) < 1, see, e.g., [14, Theorem A.3.3]. 1 2 To transfer this constructive proof of existence and uniqueness of solutions to the ProxNet setting, we denote by ι the indicator function of K given by 0, if v ∈ K, ι : H → (−∞,∞],v → ∞, otherwise. Since K is closed convex, it holds that ι ∈ (H)and prox = P (cf. [2,Examples K 0 K 1.25 and 12.25]). Now, let m = 1, H = H, W := I − ωA ∈ L(H), b := ωF ∈ H,and 1 1 1 R := prox , where ω> 0 is such that I − ωA is a contraction. The ProxNet emulation  of the contraction map reads: for a parameter λ ∈ (0, 1], : H → H,v → (1 − λ)v + λ R (W v + b ) . 1 1 1 :=T (v) Since W  < 1, Assumption 3.1 is satisfied for every λ ∈ (0, 1]. Theorem 3.2 yields 1 L(H) k k−1 0 that the iteration u := (u ) converges for any u ∈ H to a unique fixed-point u ∈ H with error bounded by (14)and L := (1 − λ) + λW  ∈ (0, 1). Since ,λ 1 L(H) (v) = (1 − λ)v + λT (v), it follows that u is in turn the unique fixed-point of T , hence 1 1 u = u , which proves the claim. Remark 4.3 In the fashion of Example 3.4, we may construct an augmented ProxNet : H ⊗ H → H such that (v, F) = (v) for any v ∈ H, where F ∈ H is the Riesz representer of f ∈ H in Problem (13). The only difference is that F has to be multiplied with ω in the first linear transform to obtain b = ωF instead of F as bias vector. The parameters of  in this construction are independent of F; hence, Theorem 3.5 yields that for any f ∈ H (resp. F ∈ H)and x ∈ H it holds k k u − u ≤ CL ,k ∈ N, ,λ k k−1 where u := (u ,F). The previous remark shows that one fixed ProxNet is sufficient to solve Problem (13) for any f ∈ H . A similar result is achieved if the set K ⊂ H associated Problem (13)is parameterized by a suitable family of functions: Example 4.4 (Obstacle-to-solution operator) Let H be a Hilbert space of real-valued func- d 2 tions over a domain D ⊂ R such that C(D) ∩ H is a dense subset, e.g., H = L (D) or H = H (D), and let K :={v ∈ H| v ≥ g almost everywhere} for a sufficiently smooth function g : D → R. With this choice of K,(13)isan obstacle problem and P (v) = max(v, g) holds for any v ∈ H∩ C(D). We construct a ProxNet approximation to the obstacle-to-solution operator O : H → H,g → u corresponding to Problem (13) obs with K ={v ∈ H| v ≥ g almost everywhere}. Schwab, Stein Res Math Sci (2022) 9:36 Page 15 of 35 36 Assume (v) = P (W v + b ) for W ∈ L(H)and b ∈ H are as in Theorem 4.2 and K 1 1 1 1 let K :={v ∈ H| v ≥ 0 almost everywhere}.ToobtainaProxNetthatusesthe obstacle g ∈ H as input, we define (1) (1) : H ⊕ H → H, (v, g) → T (v, g) = (T ◦ T )(v, g) 2 1 (1) (1) (1) (1) via T (v, g):= R (W (v, g) + b ) which are, for j ∈{1, 2},definedby j j j j 1 1 1 1 (1) W : H ⊕ H → H ⊕ H, (v ,v ) → (W v − v ,v ), 1 2 1 1 2 2 (1) (1) (1) b := (b , 0) ∈ H ⊕ H,R := prox (1), ψ (v, g):= ι (v), 1 K 1 1 1 (1) (1) (1) W : H ⊕ H → H, (v ,v ) → v + v ,b := 0 ∈ H,R := id ∈ A(H). 1 2 1 2 2 2 2 (1) (1) (1) Note that this yields W ∈ L(H ⊕ H), W ∈ L(H), and R (v ,v ) = (P v ,v ) 1 2 K 1 2 1 2 1 for all v ,v ∈ H. It now follows for any given v, g ∈ H and K :={v ∈ H| v ≥ 1 2 g almost everywhere} (v) = P (W v + b ) K 1 1 = P (W v + b − g) + g K 1 1 (1) (1) (1) = R (W (P (W v + b − g),g) + b ) K 1 1 2 2 2 (1) = T (P (W v + b − g),g) K 1 1 2 0 (1) (1) (1) (1) = T ◦ (R (W (v, g) + b )) 2 1 1 1 = (v, g). As in Example 3.4, we concatenate  to obtain for a fixed choice of x ∈ H the operator O : H → H,g → (·,g) • ··· • (·,g) (x ). obs Convergence of O (g)to u for any g ∈ H (with arbitrary a-priori fixed x ∈ H)witha obs contraction rate that is uniform with respect to g ∈ H is again guaranteed as the number of concatenations tends to infinity. Therefore, as in Example 3.4, there exists one ProxNet that approximately solves a family of obstacle problems with obstacle ‘parameter’ g ∈ H. A combination of the ProxNets from Remark 4.3 and Example 4.4 enablesustoconsider both, f and K in (13), as input variables of a suitable NN  : H ⊕ H ⊕ H → H. This allows, in particular, to construct an approximation of the data-to-solution operator to Problem (13)thatmaps(F, g) ∈ H ⊕ H to u. 5 Example: linear matrix complementarity problems Common examples for Problem (13) arise in financial and engineering applications, where the bilinear form a : H × H → R stems from a second-order elliptic or parabolic differ- s s ential operator. In this case, H ⊂ H (D), where H (D) is the Sobolev space of smoothness s > 0 with respect to the spatial domain D ⊂ R , n ∈ N. Coercivity and boundedness of a as in Assumption 4.1 often arise naturally in this setting. To obtain a computationally tractable problem, it is necessary to discretize (13), for instance by a Galerkin approxima- tion with respect to a finite dimensional subspace H ⊂ H. To illustrate this, we assume d 36 Page 16 of 35 Schwab, Stein Res Math Sci (2022) 9:36 that dim(H ) = d ∈ N is a suitable finite-dimensional subspace with basis{v , ... ,v } and d 1 d consider an obstacle problem with K ={v ∈ H| v ≥ g almost everywhere} for a smooth function g ∈ H. Following Example 4.4, we introduce the set K :={v ∈ H| v ≥ 0 almost everywhere} and note that Problem (13) is equivalent to finding u = u + g ∈ K with u ∈ K such that: a(u ,v − u ) ≥ f (v − u ) − a(g, v − u ), ∀v ∈ K . (15) 0 0 0 0 0 0 0 5.1 Discretization and matrix LCP Any element v ∈ H may be expanded as v = w v for a coefficient vector w ∈ R . d i i i=1 To preserve non-negativity of the discrete approximation to (15), we assume that v ∈ K if and only if the basis coordinates are nonnegative, i.e., if w ∈ R . This property holds, for ≥0 instance, in finite element approaches. We write the discrete solution as u = x v . i i i=1 Then, u ∈ K if and only if x ∈ R . Consequently, the discrete version of (15)isto ≥0 d   d find x ∈ R :(y − x) Ax ≥ (y − x) c, ∀y ∈ R , (16) ≥0 ≥0 d×d d where the matrix A ∈ R and the vector c ∈ R are given by A := a(v ,v)and c := f, v  − a(g, v ),i,j ∈{1, ... ,d}. (17) ij j i i i H i Problem (16) is equivalent to the linear complementary problem (LCP) to find x ∈ R d×d d such that for A ∈ R and c ∈ R as in (17)itholds (18) Ax ≥ c, x ≥ 0,x (Ax − c) = 0, see, e.g., [14, Lemma 5.1.3]. If a : H × H → R is bounded and coercive as in Assump- tion 4.1, it readily follows that 2  2 d C x ≤ x Ax ≤ C x ,x ∈ R , (19) − + 2 2 where the constants C ≥ C > 0stemfromAssumption 4.1 and · is the Euclidean + − d d norm on R . This implies in particular that the LCP (18) has a unique solution x ∈ R , see [23, Theorem 4.2]. Equivalently, we may regard Problem (16), resp. (18), as varia- tional inequality on the finite-dimensional Hilbert space R equipped with the Euclidean scalar product (·,·) . Well-posedness then follows directly from Assumption 4.1 with the d d d identification H = R and the discrete bilinear form a : R × R → R, (x, y) → x Ay. 5.2 Solution of matrix LCPs by ProxNets The purpose of this section is to show that several well-known iterative algorithms to solve (finite-dimensional) LCPs may be recovered as particular cases of ProxNets in the setting of Sect. 2.Tothisend,wefix d ∈ N and use the notation H := R for convenience. We denote by {e , ... ,e }⊂ R the canonical basis of H. To approximately solve LCPs 1 d by ProxNets, and to introduce a numerical LCP solution map, we introduce the scalar and vector-valued Rectified Linear Unit (ReLU) activation function. Schwab, Stein Res Math Sci (2022) 9:36 Page 17 of 35 36 Definition 5.1 The scalar ReLU activation function  is defined as  : R → R,x → max(x, 0). The component-wise ReLU activation in R is given by (d) d d : R → R ,x → ((x, e ) )e . (20) i H i i=1 Remark 5.2 The scalar ReLU activation function  satisfies  = prox with ι ∈ [0,∞) [0,∞) (d) d (R)(see[5, Example 2.6]). This in turn yields  ∈ A(R ) for any d ∈ N by [5, Proposition 2.24]. Example 5.3 (PJORNet) Consider the LCP (18) with matrix A and triangular decompo- sition A = D + L + U, (21) d×d d×d where D ∈ R contains the diagonal entries of A,and L, U ∈ R are the (strict) lower and upper triangular parts of A, respectively. The projected Jacobi (PJOR) overrelaxation method to solve LCP (18) is given as: Algorithm 1 Projected Jacobi overrelaxation method 0 d Given: initial guess x ∈ R , relaxation parameter ω> 0 and tolerance ε> 0. 1: for k = 0, 1, 2, ... do k+1 −1 k −1 2: x = max (I − ωD A)x + ωD c, 0 k+1 k 3: if x − x  <ε then k+1 4: return x 5: end if 6: end for The max-function in Algorithm 1 acts component-wise on each entry of a vector in R . Hence, one iteration of the PJOR may be expressed as a ProxNet in Model (3)with m = 1, (d) λ = 1and  from Eq. (20)as d d (d) −1 −1 : R → R ,x → T (x):=  ((I − ωD A) x + ωD c). PJOR 1 d :=b =:W 1 If A satisfies (19) for constants C ≥ C > 0, it holds that + − 2 −1 2 W  =I − ωD A 1 d L(H) −1  2 −1  −1 = sup x x − ωx D (A + A)x + ω (xD A) D Ax x∈R ,x =1 1 1 2 2 ≤ 1 − 2ω min C + ω max A i∈{1,...,d} A i∈{1,...,d} ii A ii 2 2 ≤ 1 − 2ω + ω =: (ω). ∗ 3 2 ∗ The choice ω := C /(C A ) minimizes  such that (ω ) < 1. Moreover, (0) = 1, − 2 ∗ ∗ is strictly decreasing on [0, ω ], and increasing for ω> ω . Hence, there exists ω> 0 36 Page 18 of 35 Schwab, Stein Res Math Sci (2022) 9:36 d d such that for any ω ∈ (0, ω) the mapping  : R → R is a contraction. An application PJOR of Theorem 3.2 then shows that Algorithm (1) converges linearly for suitable ω> 0and any initial guess x . In the special case that A is strictly diagonally dominant, choosing ω = 1 is sufficient to ensure convergence, i.e., no relaxation before the activation is necessary. Example 5.4 (PSORNet) Another popular algorithm to numerically solve LCPs is the projected successive overrelaxation (PSOR) method in Algorithm 2. Algorithm 2 Projected successive overrelaxation algorithm 0 d Given: initial guess x ∈ R , relaxation parameter ω> 0 and tolerance ε> 0. 1: for k = 0, 1, 2, ... do 2: for i = 1, 2, ... ,d do k+1 1 i−1 k+1 d k 3: y = c − A x − A x i ij ij j=i+1 i A j=0 j j ii k+1 k k+1 4: x = max((1 − ω)x + ωy , 0) i i i 5: end for k+1 k 6: if x − x  <ε then k+1 7: return x 8: end if 9: end for To represent the PSOR-iteration by a ProxNet as in (3), we use the scalar ReLU activation from Definition 5.1 and define for i ∈{1, ... ,d} d d R : R → R ,x → ((x, e ) )e + x e . (22) i i H i j j j=1,j =i (d) In contrast to  in Eq. (20), the activation operator R takes the maximum only with respect to the ith entry of the input vector. Nevertheless, R ∈ A(R ) holds again by [5, d d×d Proposition 2.24]. Now, define b ∈ R and W ∈ R by i i 1 − ω l = j = i, 1 l = j ∈{1, ... ,d}\{i}, b = (0, ... , 0, ω , 0, ... , 0), (W ) = i i lj ij ii ⎪ −ω ,l = i, j ∈{1, ... ,d}\{i}, ii ith entry 0, elsewhere, k+1 k+1 d k and let T (x):= R (W x+ b ) for x ∈ R . Given the kth iterate x and x , ... ,x from i i i i 1 i−1 k+1 k+1 k,i−1 k k the inner loop of Algorithm 2, it follows for z := (x , ... ,x ,x , ... ,x ) that 1 i−1 i d k+1 k,i k,i k,i−1 x = z ,z = T (z ),i ∈{1, ... ,d},k ∈ N. (23) i i k−1,d k,0 k k+1 k As z = z = x for k ∈ N, this shows x =  (x ) for PSOR d d : R → R ,x → (T ◦ ··· ◦ T )(x). (24) PSOR d 1 Schwab, Stein Res Math Sci (2022) 9:36 Page 19 of 35 36 Provided (19) holds, we derive similarly to Example 5.3 ω ω 2    2 W  = sup x x − 2 x A x + (x A ) i [i] i [i] d A ii x∈R ,x =1 ii 1 ω ≤ 1 − 2ω C + A , ii ii ∗ 3 2 where A denotes the ith row of A. Hence, ω := C /(C A ) is sufficient to ensure that [i] − is a contraction, and convergence to a unique fixed-point follows as in Theorem 3.2. PSOR Remark 5.5 Both, the PJORNet and PSORNet from Examples 5.3 and 5.4, may be aug- mented as in 3.4 to take c ∈ R as additional input vector, and therefore to solve the LCP (18) for varying c. That is, concatenation of the PJORNet/PSORNet again yields an d d approximation to the solution operator O : R → R ,c → x associated with the RHS LCP (18) for fixed A. This is of particular interest, for instance, in the valuation of American options, where a collection of LCPs with varying model parameters has to be solved, see [14, Chapter 5] and the numerical examples in Sect. 6. Recall that c :=  f, v  −a(g, v )if i H i H i the matrix LCP stems from a discretized obstacle problem as introduced in the beginning of this section. Hence, by varying c it is possible to modify the right hand side f ,aswellas the obstacle g, of the underlying variational inequality (cf. Example 4.4 and Sect. 6.3). 5.3 Solution of parametric matrix LCPs by ProxNets In this section, we construct ProxNets that take arbitrary LCPs (A,c) in finite-dimensional, Euclidean space as input, and output approximations of the solution x to (18)withany prescribed accuracy. Consequently, these ProxNets realize approximate data-to-solution operators d d d O : {A ∈ R | there are C ,C > 0s.t. A satisfies (19)}× R → R , (A,c) → x. − + (25) Theideaistoconstruct aNNthatrealizesAlgorithm (1) that achieves prescribed error threshold ε> 0 uniformly for LCP data (A,c) from a set A , meaning the weights of the NN may not depend on A as in the previous section. To this end, we use that the multiplication of real numbers may be emulated by ReLU-NNs with controlled error and growth bounds on the layers and size of the ReLU NN. This was first shown in [27], and subsequently extended to the multiplication of an arbitrary number n ∈ N of real numbers in [24]. Proposition 5.6 [24, Proposition 2.6] For any δ ∈ (0, 1),n ∈ N and ≥ 1, there exists a ProxNet : R → R of the form (2)suchthat δ , sup  x − (x , ... ,x ) ≤ δ , i 1 n 0 δ , n 0 (x ,...,x )∈[− , ] 1 n i=1 (26) ess sup sup ∂ x − ∂ (x , ... ,x ) ≤ δ , x i x 1 n 0 j j δ , n 0 (x ,...,x )∈[− , ] j∈{1,...,n} 1 n i=1 36 Page 20 of 35 Schwab, Stein Res Math Sci (2022) 9:36 where ∂ denotes the weak derivative with respect to x . The neural network uses only x j j δ , ReLUs as in Definition 5.1 as proximal activations. There exists a constant C, independent of δ ∈ (0, 1),n ∈ N and ≥ 1, such that the number of layers m ∈ N of is 0 n,δ , 0 δ , bounded by m ≤ C 1 + log(n)log . (27) n,δ , Remark 5.7 For our purposes, it is sufficient to consider the cases n ∈{2, 3}; therefore, we assume without loss of generality that there is a constant C, independent of δ ∈ (0, 1) and ≥ 1, such that for n ∈{2, 3} it holds m ≤ C 1 + log . n,δ , Moreover, we may assume without loss of generality that m = m ,asitisalways 2,δ , 3,δ , 0 0 possible to add ReLU-layers that emulate the identity function to the shallower network (see [24, Section 2] for details). With this at hand, we are ready to prove a main result of this section. Theorem 5.8 Let ≥ 2 be a fixed constant, d ≥ 2 and define for any given ≥ 2 the set ! " −1 A satisfies (19) with ≥ C ≥ C ≥ > 0, + − d×d d A := (A,c) ∈ R × R . and c ≤ (28) −1 For the triangular decomposition A = D+L+U as in (21), define z := vec(D +L+U) ∈ 2 2 d d×d d d×d ∗ R ,where vec : R → R is the row-wise vectorization of a R -matrix. Let x be 0 d 0 the unique solution to the LCP (A,c), and let x ∈ R be arbitrary such that  x  ≤ . For any ε> 0, there exists a ProxNet d d d d : R ⊕ R ⊕ R → R (29) as in (9)andak ∈ N such that ∗ k x − x  ≤ ε k k−1 holds for the sequence  x := ( x ,z ,c) generated by  and any tuple (A,c) ∈ A . Moreover, k ≤ C (1 +| log(ε)|), where C > 0 only depends on and  has m ≤ ε 1 1 C (1+| log(ε)|+ log(d)) layers, where C > 0 is independent of . 2 2 Proof Our strategy is to approximate  from Example 5.3 for given (A,c)∈ A by PJOR (·,z ,c). We achieve this by constructing  based on the approximate multiplication NNs from Proposition 5.6 and show that  and  satisfy Assumption 3.3 to apply the PJOR error estimate from Theorem 3.5. Schwab, Stein Res Math Sci (2022) 9:36 Page 21 of 35 36 d d d d We start by defining the map  : R ⊕ R ⊕ R → R via (x, z ,c) = ⎛ ⎞ 3 2 1 1 ⎝ ⎠ max (1 − ω)x − ω x , , A + ω ,c , 0 , i j ij i δ , δ , 0 A 0 A ii ii j=1,j =i −6 − ∗ −3/2 for i ∈{1, ... ,d},0 <ω := ≤ = ω and δ ∈ (0,d ]. C A We show in the following that  is indeed a ProxNet. To bring the input into the (i) correct order for multiplication, we define for i ∈{1, ... ,d} the binary matrix W ∈ (2d+1)×(d +2d) R by 1 l = j ∈{1, ... ,d}, 1 l ∈{d + 1, ... , 2d},j = d + d(i − 1) + (l − d), (i) W := lj ⎪ 2 ⎪ 1 l = 2d + 1,j = d + d + i, 0 elsewhere. Hence, we obtain ⎛ ⎞ ⎜ ⎟ (i) W ⎝ z ⎠ = x , A , , A ,c . A ij ij i j<i j>i ii (i) 2d+1 2d+1 Now, let e , ... ,e ⊂ R be the canonical basis of R and define E := 2d+1 (i) (i) 1×(2d+1)  3×(2d+1) e ∈ R , E := [e e e ] ∈ R for j ∈{1, ... ,d}\{i} and E := d+i d+j i j d+1 3 2 2×(2d+1) [e e ] ∈ R .ByRemark 5.7, we may assume that and have d+i 2d+1 δ , δ , 0 0 an identical number of layers, denoted by m ∈ N. Moreover, it is straightforward to δ , construct a ProxNet Id : R → R with m layers that corresponds to the identity m δ , δ , 0 map, i.e., Id (x) = x for all x ∈ R. We use the concatenation from Definition 2.3 to δ , define (i) (i) (i) d +2d := Id • (E W ): R → R i δ , i (i) (i) (i) d +2d := • (E W ): R → R,j ∈{1, ... ,d}\{i}, j j δ , (i) (i) (i) d +2d := • (E W ): R → R. d+1 d+1 δ , Note that this yields 3 2 1 1 (i) (i) (i) (x, z ,c) = x ,  (x, z ,c) = x , , A ,  (x, z ,c) = ,c . A i A j ij A i i j d+1 δ , δ , 0 A 0 A ii ii (+,i) d +d (+,i) Furthermore, we set m := m + 1 and define T : R → R,x → (W x), 1 δ , m 0 1 (+,i) 1×(d+1) where  : R → R is the (scalar) ReLU activation and W ∈ R is given by 1 − ω j = i, (+,i) W := −ω j ∈{1, ... ,d}\{i}, ω j = d + 1. 36 Page 22 of 35 Schwab, Stein Res Math Sci (2022) 9:36 (i) (i) As  , ... ,  have the same input dimension, the same number of m layers, and δ , 1 d+1 no skip connections, we may parallelize as in Definition 2.5 to ensure ⎛ ⎞ 3 2 1 1 ⎝ ⎠ (x, z ,c) = max (1 − ω)x − ω x , , A + ω ,c , 0 i i j ij i δ , δ , 0 A 0 A ii ii j=1,j =i (i) (i) (+,i) = T • P  , ... ,  (x, z ,c). m 1 1 d+1 (+,i) (i) (i) It holds that  := T • P  , ... ,  is a ProxNet as in Eq. (9)with  : i m i 1 1 d+1 d +2d R → R and m = m + 1 layers for any i ∈{1, ... ,d}. We parallelize once more 1 δ , and obtain that  := P( , ... ,  ) is a ProxNet with m +1 layers that may be written 1 δ , d 0 (1) (1) (1) d d j−1 j as  = T ◦···◦T for suitable one-layer networks T : R → R and dimensions 1 1 1 d ∈ N for j ∈{0, 1, ... ,m } such that d = d + 2d and d = d. j 1 0 m −6 We now fix (A,c)∈ A and let  := R(W ·+ b )beasinExample 5.3 with ω = , PJOR 1 1 −1 −1 W = I − ωD A and b := ωD c. This shows that  has Lipschitz constant 1 d 1 PJOR −4 −8 −4 2 −4 L =W  ≤ 1 − 2 + = 1 − < 1and b  ≤ ω ≤ . 1 2 1 2 Note that |c |, 1/A ,|A |≤ for any i, j ∈{1, ... ,d}. Therefore, Proposition 5.6 yields i ii ij for x := (z ,c)and any x ∈ R with x ≤ that 0 A ∞ (x) − (x, z ,c) =T (x) − T (x, z ,c) 1 1 A ⎛ ⎞ d d 2 3 c 1 1 i  ij ⎝ ⎠ = ω − c , − x − A , ,x i j ij j δ , δ , A 0 A A 0 A ii ii ii ii i=1 j=1,j =i 2 3 2 ≤ ω d δ . −3/2 −6 Hence, since δ ∈ (0,d ]and ω = ,  and  satisfy Assumption 3.3 with 0 PJOR −4 3/2 L := 1 − ∈ (0, 1), δ := ωd δ ≥ 0, := ≥ 2, 0 1 −4 3/2 := −b  − δ ≥ − − ωd δ ≥ , 0 1 1 2 0 −6 123 1 := − δ/(1 − L) ≥ − ≥ − > 0. 2 0 0 −4 64 4 Theorem 3.5 then yields that there exists a constant C > 0 such that for all k, δ holds ∗ k k x − x  ≤ C L + δ . Here, C ≤ max(2 , 1)/(1 − L) ≤ 2 is independent of k. Given ε> 0, we choose % & ε ε min 1, min 1, log(ε) − log(2C) 2Cω 4 k =: , δ := ≥ ε 0 3/2 3/2 d d log(L) ∗ k to ensure x −  x ≤ ε. Hence, k ≤ C (1 +| log(ε)|), where C = C ( ) > 0is ε 1 1 1 −3/2 independent of d. Moreover, Inequality (27)inProposition 5.6 and the choice δ ≤ d shows that m ≤ C (1 +| log(ε)|+ log(d)), where C > 0 is independent of .The δ , 2 2 claim follows since  has m = m + 1 layers by construction. 1 δ , 0 Schwab, Stein Res Math Sci (2022) 9:36 Page 23 of 35 36 For fixed and ε,the ProxNets  emulate one step of the PJOR algorithm for any LCP (A,c) ∈ A and a given initial guess  x . This in turn allows to approximate the data-to-solution operator O from (25) to arbitrary accuracy by concatenation of suitable ProxNets. The precise statement is given in the main result of this section: Theorem 5.9 Let ≥ 2 be fixed, let A be givenasin(28), and let the data-to-solution operator O be given as in (25). Then, for any ε> 0, there is a ProxNet O : A → R such that for any LCP (A,c) ∈ A there holds O(A,c) − O (A,c) ≤ ε. ε 2 d×d Furthermore, let · denote the Frobenius norm on R . There is a constant C > 0, (1) (1) (2) (2) depending only on andd,suchthatforanyε> 0 and any two (A ,c ), (A ,c ) ∈ A there holds (1) (1) (2) (2) (1) (2) (1) (2) O (A ,c ) − O (A ,c ) ≤ C A − A  +c − c  . (30) ε ε 2 F 2 We give an explicit construction of the approximate data-to-solution operator O in the proof of Theorem 5.9 at the end of this section. To show the Lipschitz continuity of O with respect to the parametric LCPs in A , we derive an operator version of the so-called Strang Lemma: (1) (1) (2) (2) Lemma 5.10 Let ≥ 2,d ≥ 2, and let (A ,c ), (A ,c ) ∈ A .For l ∈{1, 2}, (l) (l) (l) (l) (l) let A = D + L + U be the decomposition of A as in (21) and define z := (l) (l) −1 (l) (l) d vec((D ) + L + U ) ∈ R . For target emulation accuracy ε> 0, let  be the 0 d 0 ProxNet as in (29), let x ∈ R be such that  x  ≤ and define the sequences (l),k (l),k−1 (l) (l),0 0 x := ( x ,z (l),c ),k ∈ N,  x := x ,l ∈{1, 2}. (31) Then, there is a constant C > 0, depending only on andd,suchthatfor anyk ∈ N and arbitrary, fixed ε> 0 it holds that (1),k (2),k (1) (2) (1) (2) x − x  ≤ C A − A  +c − c  . (32) 2 F 2 Proof By construction of  in Theorem 5.8,wehavefor x ∈ R , l ∈{1, 2},and i ∈ {1, ... ,d} that (l) (x, z ,c ) (l) i ⎛ ⎞ 3 2 1 1 (l) (l) ⎝ ⎠ = max (1 − ω)x − ω x , , A + ω ,c , 0 . i j ij i (l) (l) δ , δ , 0 0 A A j=1,j =i ii ii Therefore, we estimate by the triangle inequality (1) (2) |(x, z ,c ) − (x, z ,c ) | (1) (2) i i A A 3 3 1 1 (1)  (2) ≤ ω x , , A − x , , A j j ij ij (1) (2) δ , δ , 0 0 A A j=1,j =i ii ii 36 Page 24 of 35 Schwab, Stein Res Math Sci (2022) 9:36 2 2 1 1 (1) (2) + ω  ,c − ,c i i (1) (2) δ , δ , 0 0 A A ii ii 3 3 1 1 (1) (2) ≤ ω x , , A − x , , A j j ij ij (1) (1) δ , δ , 0 0 A A j=1,j =i ii ii 3 3 1 1 (2) (2) ω x , , A − x , , A j j ij ij (1) (2) δ , δ , 0 0 A A j=1,j =i ii ii 2 2 1 1 (1)  (2) + ω ,c − ,c i i (1) (1) δ , δ , 0 0 A A ii ii 2 2 1 1 (2) (2) + ω ,c − ,c . i i (1) (2) δ , δ , 0 0 A A ii ii (l) (l) (l) (l) Since (A ,c ) ∈ A for l ∈{1, 2}, it holds for any i, j ∈{1, ... ,d} that 1/A , A , ii ij (l) c ∈ [− , ]. Hence, for any x with x ≤ we obtain by ≥ 2 and the second estimate in (26) (2) (2) |(x, z ,c ) − (x, z ,c ) | (1) i (2) i A A (1) (2) ≤ ω δ + A − A ij ij (1) j=1,j =i ii 1 1 (2) + ω δ +|x A | − 0 j ij (1) (2) A A ii ii ⎛ ⎞ (1) (2) ⎝ ⎠ + ω δ + c − c i i (1) ij 1 1 (2) + ω δ +|c | − (1) (2) A A ii ii ⎛ ⎞ (1) (2)  (1) (2) 2 4 ⎝ ⎠ ≤ ω2(δ + ) A − A + c − c ij ij i i j=1 ⎛ ⎞ ⎛ ⎞ 1/2 ⎜  (1) (2)  (1) (2)⎟ 2 4 1/2 ⎝ ⎠ ≤ ω(δ + ) d A − A + c − c . ⎝    ⎠ ij ij i i j=1 We have used the mean-value theorem to obtain the bound 1 1 (1) (2) −  ≤ A − A ii ii (1) (2) A A ii ii in the second last inequality and the Cauchy–Schwarz inequality in the last step. We recall −6 −3/2 from the proof of Theorem 5.8 that ω = and δ ≤ d ; hence, there is a constant C = C( ,d) > 0, depending only on the indicated parameters, such that for any x ∈ R with x ≤ it holds (1) (2) (1) (2) (1) (2) (x, z ,c ) − (x, z ,c ) ≤ C A − A  +c − c  . (33) (1) (2) 2 F 2 A A Schwab, Stein Res Math Sci (2022) 9:36 Page 25 of 35 36 Moreover, for any x, y ∈ R such that x ,y ≤ , it holds by the mean-value ∞ ∞ theorem and the second estimate in (26) (1) (1) |(x, z (1),c ) − (y, z (1),c ) | i i A A (1) (1) −1 ≤ (x, z ,c ) − (y, z ,c ) − ((I − ωD A)(x − y)) (1) (1) i i d i A A −1 + ((I − ωD A)(x − y)) d i (1) 3 3 1 1 ij (1) (1) = ω x , , A − y , , A − (x − y ) j j j j ij ij (1) (1) (1) δ , δ , 0 0 A A A j=1,j =i ii ii ii −1 + ((I − ωD A)(x − y)) d i −1 ≤ ωδ |x − y |+ ((I − ωD A)(x − y)) . 0 j j d i j=1,j =i Hence, Young’s inequality yields for any > 0 that (1) (1) 2 (x, z ,c ) − (y, z ,c ) (1) (1) A A 2 ⎛ ⎞ d d 2 2 −1 2 ⎝ ⎠ ≤ 1 + ω δ |x − y | + (1 + )(I − ωD A)(x − y) j j d 0 2 (34) i=1 j=1,j =i 2 2 −1 2 2 ≤ 1 + ω δ d(d − 1) + (1 + )I − ωD A x − y , 0 2 2 where we have used the Cauchy–Schwarz inequality in the last step. From the proof of −6 −3/2 Theorem 5.8,wehaveasbeforethat ω = , δ ≤ d , and, furthermore I − −1 4 −4 (1) d d ωD A ≤ 1 − . Setting  := therefore shows that (·,z ,c ): R → R is 2 (1) a contraction on (R ,· ) with Lipschitz constant L > 0bounded by 2 1 −8 1/2 7 −12 −8 −1 −8 −12 −8 L ≤ + d + (1 − ) ≤ 1 + − ≤ 1 − ∈ (0, 1). 2 16 (35) Note that we have used d ≥ 2and ≥ 2 in the last two steps to obtain (35). Now, (l),k let ( x ) for l ∈{1, 2} and k ∈ N denote the iterates as defined in (31) and recall from (l),k (l),k the proof of Theorem 3.5 that  x  ≤ x  ≤ . Therefore, we may apply the ∞ 2 estimates in (33)and (34)toobtain (1),k (2),k (1),k (2),k−1 (1) (2),k (1) (2),k x − x  ≤ x − ( x ,z ,c ) +( x ,z ,c ) − x (1) (1) 2 2 2 A A (1),k−1 (2),k−1 (1) (2) (1) (2) ≤ L  x − x  + C A − A  +c − c 1 2 F 2 k−1 (1) (2) (1) (2) ≤ C A − A  +c − c  L F 2 j=1 (1) (2) (1) (2) ≤ A − A  +c − c  . F 2 1 − L The claim follows for C := < ∞,since C = C( ,d), and L is bounded indepen- 1−L dently with respect to ε and k by (35).  36 Page 26 of 35 Schwab, Stein Res Math Sci (2022) 9:36 d d d d Proof of Theorem 5.9 For fixed and ε, let the ProxNet  : R ⊕ R ⊕ R → R and k ∈ N be given as in Theorem 5.8. We define the operator O by concatenation of  via ε ε ⎡ ⎤ ⎢ ⎥ O (A,c):= (·,z ,c) • ··· • (·,z ,c) (0), (A,c) ∈ A . ⎣ ⎦ ε A A k -fold concatenation 0 d 0 1 Note that the initial value x := 0 ∈ R satisfies  x  ≤ for arbitrary > 0. Thus, 0 ∗ d applying Theorem 5.8 with x = 0 yields for any LCP (A,c) ∈ A with solution x ∈ R that ∗ k O(A,c) − O (A,c) =x − x  ≤ ε. ε 2 2 1,0 2,0 To show the second part of the claim, we set  x =  x := 0 and observe that (1),k (2),k (l),k (l) (l) ε ε ε x ,x in Lemma 5.10 are given by x = O (A ,c ) for l ∈{1, 2}. Hence, the (1) (1) (2) (2) estimate (30) follows immediately for any ε> 0and (A ,c ), (A ,c ) ∈ A from (32), by setting k = k . 6 Numerical experiments 6.1 Valuation of American options: Black–Scholes model To illustrate an application for ProxNets, we consider the valuation of an American option in the Black–Scholes model. The associated payoff function of the American option is denoted by g : R → R , and we assume a time horizon T = [0,T] for T > 0. In ≥0 ≥0 any time t ∈ T and for any spot price x ≥ 0 of the underlying stock, the value of the option is denoted by V (t, x) and defines a mapping V : T × R → R . Changing to ≥0 ≥0 time-to-maturity and log-price yields the map v : T × R → R , (t, x) → V (T − t, e ), ≥0 which is the solution to the free boundary value problem 2 2 σ σ ∂ v − ∂ v − r − ∂ v + rv ≥0in(0,T] × R, t xx x 2 2 v(t, x) ≥ g(e)in(0,T] × R, (36) 2 2 σ σ ∂ v − ∂ v − r − ∂ v + rv (g − v) =0in(0,T] × R, t xx x 2 2 x x v(0,e ) = g(e)in R, see, e.g., [14, Chapter 5.1]. The parameters σ> 0and r ∈ R are the volatility of the underlying stock and the interest rate, respectively. We assume that g ∈ H (R )and ≥0 construct in the following a ProxNet-approximation to the payoff-to-solution operator at time t ∈ T given by 1 1 O : H (R ) → H (R),g → v(t,·). (37) payoff,t ≥0 As V and v, and therefore O , are in general not known in closed-form, a common payoff,t approach to approximate v for a given payoff function g is to restrict Problem (36)to a bounded domain D ⊂ R and to discretize D by linear finite elements based on d 1 0 d 0 We could have also used any other x = 0 ∈ R such that  x  ≤ to define O for given and ε, but decided to ∞ ε fix the -independent initial guess x := 0 for simplicity. Schwab, Stein Res Math Sci (2022) 9:36 Page 27 of 35 36 equidistant nodal points. The payoff function is interpolated with respect to the nodal basis, and we collect the respective interpolation coefficients of g in the vector g ∈ R . Thetimedomain [0,T] is split by M ∈ N equidistant time steps and step size t = T /M, and the temporal derivative is approximated by a backward Euler approach. This space- time discretization of the free boundary problem (36) leads to a sequence of discrete d d d variational inequalities: Given g ∈ R and u := 0 ∈ R find u ∈ R such that for 0 m m ∈{1, ... ,M},itholds Au ≥ F ,u ≥ 0, (Au − F ) u = 0. (38) m+1 m m+1 m+1 m m+1 2 2 σ σ BS d×d BS The LCP (38) is defined by the matrices A := M + tA ∈ R , A := S + ( − 2 2 d×d BS  d r)B + rM ∈ R and right hand side F :=−t(A ) g + Mu ∈ R . The matrices m m d×d S, B, M ∈ R represent the finite element stiffness, advection and mass matrices; hence, A is tri-diagonal and asymmetric if = r. The true value of the options at time km is approximated at the nodal points via v(tm,·) ≈ u + g. This yields the discrete payoff- to-solution operator at time tm defined by d d O : R → R ,g → u + g,m ∈{1, ... ,M}. (39) payoff,tm m Problem (38) may be solved for all m using a shallow ProxNet d d d d : R ⊕ R ⊕ R → R ,x → R(W x + b ), 1 1 (d) d d with ReLU-activation R =  : R → R . The architecture of  allows to take g and u as additional inputs in each step; hence, we train only one shallow ProxNet that may be used for any payoff function g and every time horizon T. Therefore, we learn the payoff-to-solution operator O associated with Problem (36) by concatenating .The payoff,t d×3d d parameters W ∈ R and b ∈ R are learned in the training process and shall emulate 1 1 one step of the PJOR Algorithm 1, as well as the linear transformation (g,u ) → F to m m obtain the right hand side in (38). Therefore, a total of 3d + d parameters have to be learned in each example. For our experiments, we use the Python-based machine learning package PyTorch. All experiments are run on a notebook with 8 CPUs, each with 1.80 GHz, and 16 GB memory. (i) (i) (i) (i) 3d To train ,wesample N ∈ N input data points x := (x ,g ,u ) ∈ R , i ∈{1, ... ,N }, s s from a 3d-dimensional standard-normal distribution. The output-training data samples (i) (i) 0 y consist of one iteration of Algorithm 1 with ω = 1, initial value x := x ,with A as in BS  (i) (i) d (38) and right hand side given by c :=−t(A ) g + Mu ∈ R . We draw a total of N = 2· 10 input–output samples, use half of the data for training, and the other half for validation. In the training process, we use mini-batches of size N = 100 and the Adam batch −3 Optimizer [18] with initial learning rate 10 , which is reduced by 50% every 20 epochs. As error criterion, we use the mean-squared error (MSE) loss function, which is for each (i ) (i ) (i ) (i ) j j j j batch of inputs ((x ,g ,u ),j = 1, ... ,N ) and outputs (y ,j = 1, ... ,N ) batch batch https://pytorch.org/. 36 Page 28 of 35 Schwab, Stein Res Math Sci (2022) 9:36 Fig. 1 Decay of the loss function for d = 600 (left) and d = 1000 (right). In all of our experiments, the −12 training loss falls below the threshold of 10 before the 250th epoch, and training is stopped early given by (i ) (i ) (i ) (i ) (i ) (i ) 1 1 1 N N N batch batch batch Loss (x ,g ,u ),··· , (x ,g ,u ) batch (i ) (i ) (i ) (i ) 2 j j j j := (x ,g ,u ) − y  . batch j=1 −12 We stop the training process if the loss function falls below the tolerance 10 or after a maximum of 300 epochs. The number of spatial nodal points d that determines the size of the matrix LCPs is varied throughout our experiments in d ∈{200, 400, ... , 1000}. We choose the Black–Scholes parameters σ = 0.1, r = 0.01 and T = 1. Spatial and temporal refinement are balanced by using M = d time steps of size t = T /M = 1/d. The decay of the loss-curves is depicted in Fig. 1, where the reduction in the learning rate every 20 epochs explains the characteristic “steps” in the decay. This stabilizes the training −12 procedure, and we reached a loss of O(10 ) for each d before the 250th epoch. Once training is terminated, we compress the resulting weight matrix of the trained −7 single-layer ProxNet by setting all entries with absolute value lower than 10 to zero. This speeds up evaluation of the trained network, while the resulting error is negligible. As the matrix W in the trained ProxNet is close to the “true” tri-diagonal matrix A from (38), this eliminates most of the ProxNet’s O(d ) parameters, and only O(d) non-trivial entries remain. The relative validation error is estimated based on the N := 10 validation samples val via (i ) (i ) (i ) (i ) val 2 j j j j (x ,g ,u ) − y j=1 2 err := . (40) val val (i ) 2 j=1 2 The validation errors and training times for each dimension are found in Table 1 and confirm the successful training of the ProxNet. Naturally, training time increases in d, −6 while the validation error is small of order O(10 ) for all d. To test the trained neural networks on Problem (38) for the valuation of an American option, we consider a basket of 20 put options with payoff function g (x):= max(K − x, 0) i i and strikes K = 10 + 90 for i ∈{1, ... , 20}. Hence, we use the same ProxNet for 20 different payoff vectors g . Note that we did not train our networks on payoff functions, i Schwab, Stein Res Math Sci (2022) 9:36 Page 29 of 35 36 Table 1 Training times and validation errors for the ProxNets in the Black–Scholes model in several dimensions, as estimated in (40) based on N = 10 samples val d 200 400 600 800 1000 Training time in s 6.06 39.38 90.69 311.04 466.87 −6 −6 −7 −6 −6 err 1.15 · 10 1.08 · 10 8.88 · 10 1.04 · 10 1.36 · 10 val The relative error remains stable with increasing problem dimension but on random samples, and thus, we could in principle consider an arbitrary basket containing different types of payoffs. The restriction to put options is for the sake of brevity only. We denote by u for m ∈{0, ... ,M} the sequence of solutions to (38)with m,i payoff vector g and u = 0 ∈ R for each i. 0,i Concatenating  k times yields an approximation to the discrete operator O payoff,tm in (39) for any m ∈{1, ... ,M} via ⎡ ⎤ ⎢ ⎥ d d d d O : R ⊕R ⊕R → R , (x, u ,g) → (·,g, u ) • ··· • (·,g, u ) (x). ⎣ ⎦ payoff,tm m m m k-fold concatenation An approximating sequence of (u ,m ∈{0, ... ,M}) is then in turn generated by m,i u := O ( u , u ,g),  u := u = 0 ∈ R . m+1,i payoff,tm m,i m,i 0,i 0,i 0 d That is, u is given by iterating  k times with initial input x =  u ∈ R and fixed m+1,i m,i inputs and g and u . We stop for each m after k iterations if two subsequent iterates x m,i k−1 k k−1 −3 and x satisfy x − x  < 10 . The reference solution u is calculated by a Python-implementation that uses the M,i Primal-Dual Active Set (PDAS) Algorithm from [15] to solve LCP (38)withtolerance ε = −6 10 in every time step. Compared to a fixed-point iteration, the PDAS method converges (locally) superlinear according to [15, Theorem 3.1], but has to be called separately for each payoff function g . In contrast, the ProxNet  may be iterated for the entire batch of 20 payoffs at once in PyTorch. We measure the relative error err := u − u  /u i,rel M,i M,i 2 M,i 2 for each payoff vector g at the end point T = tM = 1 and report the sample mean error err := err . (41) rel i,rel i=1 Sample mean errors and computational times are depicted for d ∈{200, 400, ... , 1000} in Table 2, where we also report the number of iterations k for each d to achieve the −3 desired tolerance of 10 . The results clearly show that ProxNets significantly accelerate the valuation of American option baskets, if compared to the standard, PDAS-based implementation. This holds true for any spatial resolution, i.e., the number of grid points d, −3 −4 while the relative error is small of magnitude O(10 )or O(10 ). For d ≥ 600, we actually find that the combined times for training and evaluation of ProxNets is below the runtime of the reference solution. We further observe that computational times scale similarly 36 Page 30 of 35 Schwab, Stein Res Math Sci (2022) 9:36 Table 2 Relative errors and computational times of a ProxNet solver for a basket of American put options in the Black–Scholes model d 200 400 600 800 1000 −4 −4 −3 −3 −3 err 2.15 · 10 7.89 · 10 1.52 · 10 2.41 · 10 3.48 · 10 rel Iterations to tolerance 9 13 15 17 18 Time ProxNet in s 0.26 1.16 6.23 15.06 30.45 Time reference in s 4.37 33.17 142.01 350.86 761.10 ProxNets significantly reduce computational time, while their relative error remains sufficiently small for all d for both, ProxNet and reference solution, in d. Hence, in our experiments, ProxNets are computationally advantageous even for a fine resolution of d = 1000 nodal points. 6.2 Valuation of American options: jump-diffusion model We generalize the setting of the previous subsection from the Black–Scholes market to an exponential Lévy model. That is, the log-price of the stock evolves as a Lévy process, with jumps distributed with respect to the Lévy measure ν : B(R) → [0,∞). The option value v (in log-price and time-to-maturity) is now the solution of a partial integro-differential inequality given by ∂ v − ∂ v − γ∂ v + v(·+ z) − v − ∂ vν(dz) + rv ≥0in(0,T] × R, t xx x x v(t, x) ≥ g(e)in(0,T] × R, (42) ∂ v − ∂ v − γ∂ v + v(·+ z) − v − ∂ vν(dz) + rv (g − v) =0in(0,T] × R, t xx x x x x v(0,e ) = g(e)in R. Introducing jumps in the model hence adds a non-local integral term to Eq. (36). The 2/2 z driftisset to γ :=−σ − (e − 1 − z)ν(dz) ∈ R in order to eliminate arbitrage in the market. We discretize Problem (42) by an equidistant grid in space and time as in the previous subsection, for details, e.g., integration with respect to ν, we refer to [14,Chapter 10]. The space-time approximation yields again a sequence of LCPs of the form L L A u ≥ F ,u ≥ 0, (A u − F ) u = 0, (43) m+1 m m+1 m+1 m m+1 L Levy d×d Levy J J where A := M + tA ∈ R with A := S + A , and the matrix A stems from the integration of ν. A crucial difference to (38) is that A is not anymore tri- diagonal, but a dense matrix, due to the non-local integral term caused by the jumps. The drift γ and interest rate r are transformed into the right hand side, such that F := Levy  d −t(A ) g + Mu ∈ R , where g is the nodal interpolation of the transformed m m rkm payoff g (x):= ge (x− (γ + r)km). The inverse transformation gives an approximation −rT to the solution v of (42) at the nodal points via v(km,·− (γ + r)T) ≈ e u . We refer to [14, Chapter 10.6] for further details on the discretization of American options in Lévy models. The jumps are distributed according to the Lévy measure −β z −β z + − ν(dz) = λpβ e 1 (z) + λ(1 − p)β e 1 (z),z ∈ R. (44) + {z>0} − {z<0} That is, the jumps follow an asymmetric, double-sided exponential distribution with jump intensity λ = ν(R) ∈ (0,∞). We choose p = 0.7, β = 25, β = 20 to characterize the + − Schwab, Stein Res Math Sci (2022) 9:36 Page 31 of 35 36 Table 3 Training times and validation errors for the ProxNets in the jump-diffusion model, as estimated in (40) based on N = 10 samples val d 200 400 600 800 1000 Training time in s 6.59 37.03 88.22 300.40 461.79 −6 −6 −7 −6 −6 err 1.18 · 10 1.09 · 10 9.79 · 10 9.96 · 10 1.43 · 10 val The relative error remains stable with increasing problem dimension Table 4 Relative errors and computational times of a ProxNet solver for a basket of American put options in the jump-diffusion model d 200 400 600 800 1000 −4 −4 −4 −3 −3 err 1.55 · 10 4.97 · 10 9.62 · 10 1.52 · 10 2.09 · 10 rel Iterations to tolerance 6 7 7 7 6 Time ProxNet in s 0.21 1.04 4.81 11.62 34.27 Time reference in s 4.29 31.52 147.20 354.25 782.45 ProxNets significantly reduce computational time, while their relative error remains sufficiently small for all d tails of ν and set jump intensity to λ = 1. We further use σ = 0.1and r = 0.01 as in the Black–Scholes example. We use the same training procedure and parameters as in the previous subsection to train the shallow ProxNets. As only difference, we compress the weight matrix with −8 −7 L tolerance 10 instead of 10 (recall that A is dense). This yields slightly better relative errors in this example, while it does not affect the time to evaluate the ProxNets. Training times and validation errors are depicted in Table 3 and indicate again a successful training. The decay of the training loss is for each d very similar to Fig. 1, and training is again stopped in each case before the 300th epoch. After training, we again concatenate the shallow nets to approximate the operator O in (37), that maps the payoff function g to the corresponding option value v(t,·)at payoff,t any (discrete) point in time. We repeat the test from Sect. 6.1 in the jump-diffusion model with the identical basket of put options to test the trained ProxNets. The reference solution is again computed by a PDAS-based implementation. The results for American options in the jump-diffusion model are depicted in Table 4. Again, we see that the trained ProxNets −3 approximated the solution v to (42) for any g to an error of magnitude O(10 )orless. While keeping the relative error small, ProxNets again significantly reduce computational time and are therefore a valid alternative in more involved financial market models. We finally observe that the number of iterations to tolerance in the jump-diffusion model is stable at 6–7 for all d, whereas this number increases with d in the Black–Scholes mar- ket (compare the third row in Tables 2 and 4). The explanation for this effect is that the excess-to-payoff vector u has a smaller norm in the jump-diffusion case, but the iterations −3 terminate at the (absolute) threshold 10 in both, the Black–Scholes and jump-diffusion model. Therefore, we require less iterations in the latter scenario, although the option prices v and relative errors are of comparable magnitude in both examples. 6.3 Parametric obstacle problem To show an application for ProxNets beyond finance, we consider an elliptic obstacle 2 1 problem in the two-dimensional domain D := (−1, 1) .Wedefine H := H (D)and aim 0 36 Page 32 of 35 Schwab, Stein Res Math Sci (2022) 9:36 to find the solution u ∈ H to the partial differential inequality − u ≥ f in D,u ≥ g in D,u =0on ∂D. (45) Therein, f ∈ H is a given source term and g ∈ H is an obstacle function, for which we assume g ∈ C(D) ∩ H for simplicity in the following. We introduce the convex set K :={v ∈ H| v ≥ g almost everywhere} and the bilinear form a : H × H → R, (v, w) → ∇v ·∇w dx, and note that a, f and K satisfy Assumption 4.1. The variational inequality problem associated with (45) is then to find u ∈ K such that: a(u, v − u) ≥ f (v − u), ∀v ∈ K. (46) As for (15) at the beginning of Sect. 5, we introduce K :={v ∈ H| v ≥ 0 almost everywhere},and Problem(46) is equivalent to finding u = u + g ∈ K with u ∈ K such that: a(u ,v − u ) ≥ f (v − u ) − a(g, v − u ), ∀v ∈ K . (47) 0 0 0 0 0 0 0 As for the previous examples in this section, we use ProxNets to emulate the obstacle-to- solution operator O : H → H,g → u. (48) obs 2 2 We discretize D = [−1, 1] for d ∈ N by a (d + 2) -dimensional nodal basis of 0 0 linear finite elements, based on (d + 2) equidistant points in every dimension. Due to the homogeneous Dirichlet boundary conditions in (45), we only have to determine the discrete approximation of u within D and may restrict ourselves to a finite element basis {v , ... ,v }, for d := d , with respect to the interior nodal points. Following the procedure outlined in Sect. 5.1,wedenoteby g ∈ R again the nodal interpolation coefficients of d×d g (recall that we have assumed g ∈ C(D)) and by A ∈ R the finite element stiffness matrix with entries A := a(v ,v ) for i, j ∈{1, ... ,d} This leads to the matrix LCP to find ij j i u ∈ R such that Au ≥ c, u ≥ 0,u (Au − c) = 0, (49) d T where c ∈ R is in turn given by c := f (v )−(A g) for i ∈{1, ... ,d}. Given a fixed spatial i i i discretization based on d nodes, we again approximate the discrete obstacle-to-solution operator d d O : R → R ,g → u (50) obs d d d by concatenating shallow ProxNets  : R ⊕ R → R . The training process of the ProxNets in the obstacle problem is the same as in Sects. 6.1 and 6.2 and thus, is not further outlined here. The only difference is that we draw the input data for training now from a 2d-dimensional standard normal distribution. The output samples again correspond to one PJOR-Iteration with A and c as in (49)and Schwab, Stein Res Math Sci (2022) 9:36 Page 33 of 35 36 Table 5 Training times and validation errors for the ProxNets in the Obstacle Problem, as estimated in (40) based on N = 10 samples val d 100 400 900 1600 Training time in s 4.34 22.97 259.19 907.07 −6 −6 −7 −6 err 1.11 · 10 1.16 · 10 9.11 · 10 1.78 · 10 val The relative error remains stable with increasing problem dimension ω = 1, where the initial value and g are both replaced by the 2d-dimensional random input vector. After training, we again compress the weight matrices by setting all entries −7 with absolute value lower than 10 to zero. We test the ProxNets for LCPs of dimension 2 2 2 2 d ∈{10 , 20 , 30 , 40 } and report training times and validation errors in Table 5.As before, training is successful and aborted early for each d, since the loss function falls −12 below 10 before the 300th epoch. d d d d Once  : R ⊕ R → R is trained for given d, we use the initial value zero x = 0 ∈ R and concatenate  k times to obtain for any g the approximate discrete obstacle-to- solution operator ⎡ ⎤ ⎢ ⎥ d d O : R → R ,g → (·,g) • ··· • (·,g) (0). ⎣ ⎦ obs k-fold concatenation This yields u = O (g) ≈ u := O (g). We test the trained ProxNets on the parametric obs obs family of obstacles (g ,r > 0) ⊂ H, given by 2 1 1 −rx g (x):= min max e − , 0 , ,x ∈ D. (51) 2 4 For given r > 0, let g ∈ R denote the nodal interpolation of g ,andlet u be discrete solu- tion to the corresponding obstacle problem. We approximate the solutions u to (49) for 4i a basket of 100 obstacles g with r ∈ R :={1+ | i ∈{0, ... , 99}}. For this, we iterate the ProxNets  again on the entire batch of obstacles and denote by u the kth iterate for any k k−1 −4 r ∈ R. We stop the concatenation of  after k iterations if max u − u  < 10 , r∈R 2 r r andreportonthe valueof k for each d. The lower absolute tolerance is necessary in the obstacle problem, since the solutions u now have lower absolute magnitude as compared to the previous examples. The reference solution is again calculated by solving (49)with the PDAS algorithm, which has to be called separately for each obstacle in (g ,r ∈ R). A sample of g together with the associated discrete solution u and its ProxNet approxima- tion u is depicted in Fig. 2. The relative error of the ProxNet approximation, the number of iterations and the com- putational times are depicted in Table 6. ProxNets approximate the discrete solutions −4 well with relative errors of magnitude O(10 ) for all d. However, compared to the exam- ples in Sects. 6.1 and 6.2, we observe that significantly more iterations are necessary to −4 achieve the absolute tolerance of 10 . This is due to the larger contraction constants in the obstacle problem, which are very close to one for all d. The lower absolute tolerance −4 of 10 adds more iterations, but is not the main reason why we observe larger values of k in the obstacle problem. Nevertheless, ProxNets still outperform the reference solver in terms of computational time, with a relative error of at most 0.1% for large d. 36 Page 34 of 35 Schwab, Stein Res Math Sci (2022) 9:36 Fig. 2 From left to right: Obstacle g as in (51) with scale parameter r = 1.7677, the corresponding discrete 2 2 solution u with refinement parameter h := in each spatial dimension (corresponds to d = 40 interior nodal points in D), and its ProxNet approximation u based on k = 698 iterations Table 6 Relative errors and computational times of a ProxNet solver for a family of parametric obstacle problems d 100 400 900 1600 −4 −4 −4 −3 err 3.69 · 10 5.89 · 10 9.20 · 10 1.14 · 10 rel Iterations to tolerance 56 206 416 698 Time ProxNet in s 0.01 0.07 0.50 2.71 Time reference in s 0.08 0.51 3.13 26.67 ProxNets again reduce computational time, while keeping the relative error sufficiently small for all d. The number of iterations to tolerance is now significantly larger as in the previous examples 7Conclusions We proposed deep neural networks which realize approximate input-to-solution opera- tors for unilateral, inequality problems in separable Hilbert spaces. Their construction was based on realizing approximate solution constructions in the continuous (infinite dimen- sional) setting, via proximinal and contractive maps. As particular cases, several classes of finite-dimensional projection maps (PSOR, PJOR) were shown to be representable by the proposed ProxNet DNN architecture. The general construction principle behind ProxNet introduced in the present paper can be employed to realize further DNN architectures, also in more general settings. We refer to [1] for multilevel and multigrid methods to solve (discretized) variational inequality problems. The algorithms in this reference may also be realized as concatenation of ProxNets, similarly to the PJOR-Net and PSOR-Net from Examples 5.3 and 5.4. The analysis and representation of multigrid methods as ProxNets will be considered in a forthcoming work. Acknowledgements The preparation of this work benefited from the participation of ChS in the thematic period “Mathematics of Deep Learning (MDL)” from 1 July to 17 December 2021, at the Isaac Newton Institute, Cambridge, UK. AS has been funded in part by ETH Foundations of Data Science (ETH-FDS), and it is greatly appreciated. Data availability The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. Received: 1 October 2021 Accepted: 4 April 2022 References 1. Badea, L.: Convergence rate of some hybrid multigrid methods for variational inequalities. J. Numer. Math. 23(3), 195–210 (2015) 2. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, 2nd edn. Springer, Cham (2017) (With a foreword by Hédy Attouch) Schwab, Stein Res Math Sci (2022) 9:36 Page 35 of 35 36 3. Becker, S., Cheridito, P., Jentzen, A.: Deep optimal stopping. JMLR 20, 74 (2019) 4. Borwein, J.M., Lewis, A.S.: Convex Analysis and Nonlinear Optimization, volume 3 of CMS Books in Mathemat- ics/Ouvrages de Mathématiques de la SMC, 2nd edn. Springer, New York (2006) (Theory and examples) 5. Combettes, P.L., Pesquet, J.-C.: Deep neural network structures solving variational inequalities. Set-Valued Var. Anal. 28(3), 491–518 (2020) 6. Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57(11), 1413–1457 (2004) 7. Duvaut, G., Lions, J.-L.: Inequalities in Mechanics and Physics, volume 219 of Grundlehren der Mathematischen Wissenschaften. Springer, Berlin (1976) (Translated from the French by C. W. John) 8. Glas, S., Urban, K.: On noncoercive variational inequalities. SIAM J. Numer. Anal. 52(5), 2250–2271 (2014) 9. Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: International Conference on Machine Learning. PMLR, pp. 1–8 (2010) 10. Hasannasab, M., Hertrich, J., Neumayer, S., Plonka, G., Setzer, S., Steidl, G.: Parseval proximal neural networks. J. Fourier Anal. Appl. 26(4), 31 (2020) 11. He, J., Xu, J.: MgNet: a unified framework of multigrid and convolutional neural network. Sci. China Math. 62(7), 1331–1354 (2019) 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 13. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision, pp. 630–645. Springer (2016) 14. Hilber, N., Reichmann, O., Schwab, C., Winter, C.: Computational Methods for Quantitative Finance: Finite Element Methods for Derivative Pricing. Springer, Berlin (2013) 15. Hintermüller, M., Ito, K., Kunisch, K.: The primal-dual active set strategy as a semismooth Newton method. SIAM J. Optim. 13(3), 865–888 (2002) 16. Hornik, K., Stinchcombe, M., White, H.: Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 3(5), 551–560 (1990) 17. Kinderlehrer, D., Stampacchia, G.: An Introduction to Variational Inequalities and Their Applications, volume 31 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA (2000) (Reprint of the 1980 original) 18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 19. Kovachki, N., Li, Z., Liu, B., Azizzadenesheli, K., Bhattacharya, K., Stuart, A., Anandkumar, A.: Neural operator: learning maps between function spaces. arXiv preprint arXiv:2108.08481 (2021) 20. Lamberton, D., Lapeyre, B.: Introduction to Stochastic Calculus Applied to Finance. Chapman & Hall/CRC Financial Mathematics Series, 2nd edn. Chapman & Hall/CRC, Boca Raton, FL (2008) 21. Lu, L., Jin, P., Karniadakis, G.E.: Deeponet: learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv preprint arXiv:1910.03193 (2019) 22. Monga, V., Li, Y., Eldar, Y.C.: Algorithm unrolling: interpretable, efficient deep learning for signal and image processing. IEEE Signal Process. Mag. 38(2), 18–44 (2021) 23. Murty, K.G.: On the number of solutions to the complementarity problem and spanning properties of complementary cones. Linear Algebra Appl. 5(1), 65–108 (1972) 24. Opschoor, J.A.A., Schwab, C., Zech, J.: Exponential ReLU DNN expression of holomorphic maps in high dimension. Constructive Approximation 55, 537–582 (2019) (Report SAM 2019-35 (revised)) 25. Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999) 26. Wohlmuth, B.: Variationally consistent discretization schemes and numerical algorithms for contact problems. Acta Numer. 20, 569–734 (2011) 27. Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017) Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Research in the Mathematical Sciences Springer Journals

Deep solution operators for variational inequalities via proximal neural networks

Loading next page...
 
/lp/springer-journals/deep-solution-operators-for-variational-inequalities-via-proximal-E5ORq8RvxT
Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2022
eISSN
2197-9847
DOI
10.1007/s40687-022-00327-1
Publisher site
See Article on Publisher Site

Abstract

andreas.stein@sam.math.ethz.ch Seminar for Applied Following Bauschke and Combettes (Convex analysis and monotone operator theory in Mathematics, ETH Zürich, Zürich, Hilbert spaces, Springer, Cham, 2017), we introduce ProxNet, a collection of deep neural Switzerland networks with ReLU activation which emulate numerical solution operators of variational inequalities (VIs). We analyze the expression rates of ProxNets in emulating solution operators for variational inequality problems posed on closed, convex cones in real, separable Hilbert spaces, covering the classical contact problems in mechanics, and early exercise problems as arise, e.g., in valuation of American-style contracts in Black–Scholes financial market models. In the finite-dimensional setting, the VIs reduce to matrix VIs in Euclidean space, and ProxNets emulate classical projected matrix iterations, such as projected Jacobi and projected SOR methods. 1 Introduction Variational Inequalities (VIs for short) in infinite-dimensional spaces arise in variational formulations of numerous models in the sciences. We refer only to [7,17,26] and the references there for models of contact problems in continuum mechanics, [20] and the references there for applications from optimal stopping in finance (mainly option pric- ing with “American-style,” early exercise features) and [4] and the references there for resource allocation and game theoretic models. Two broad classes of approaches toward numerical solution of VIs can be identified: deterministic approaches, which are based on discretization of the VI in function space, and probabilistic approaches, which exploit stochastic numerical simulation and an interpretation of the solution of the VI as condi- tional expectations of optimally stopped sample paths. The latter approach has been used to design ML algorithms for the approximation of the solution of one instance of the VI in [3]. Deep neural network structures arise naturally in abstract variational inequality prob- lems (VIs) posed on the product of (possibly infinite-dimensional) Hilbert spaces, as review, e.g., in [5]. Therein, the activation functions correspond to proximity operators of certain potentials that define the constraints of the VI. Weak convergence of this recurrent NN structure in the limit of infinite depth to feasible solutions of the VI is shown under suitable assumptions. An independent, but related, development in recent years has been the advent of DNN-based numerical approximations which are based on encoding known, © The Author(s) 2022. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. 0123456789().,–: volV 36 Page 2 of 35 Schwab, Stein Res Math Sci (2022) 9:36 iterative solvers for discretized partial differential equations, and certain fixed point itera- tions for nonlinear operator equations. We mention only [9], that developed DNNs which emulate the ISTA iteration of [6], or the more recently proposed generalization of “deep unrolling/unfolding” methodology [22]. Closer to PDE numerics, recently [11] proposed MGNet, a neural network emulation of multilevel, iterative solvers for linear, elliptic PDEs. The general idea behind these approaches is to emulate by a DNN a contractive map, say , which is assumed to satisfy the conditions of Banach’s Fixed Point Theorem (BFPT), and whose unique fixed point is the solution of the operator equation of interest. Let us denote the approximate map realized by emulating  with a DNN by . The universality theorem for DNNs in various function classes implies (see, e.g., [16,25] and the references there) that for any ε> 0 a DNN surrogate  to the contraction map exists, which is ε-close to , uniformly on the domain of attraction of . Iteration of the DNN  being realized by composition, any finite number K of steps of the fixed point iteration can be realized by K -fold composition of the DNN surrogate ˜ ˜ . Iterating , instead of , induces an error of order O(ε/(1 − L)), uniformly in the number of iterations K , where L ∈ (0, 1) denotes the contraction constant of . Due to the contraction property of , K may be chosen as O(| log(ε)|) in order to output an approximate fixed point with accuracy ε upon termination. The K -fold composition of the ˜ ˜ surrogate DNN  is, in turn, itself a DNN of depth O(depth()| log(ε)|). This reasoning is valid also in metric spaces, since the notions of continuity and contractivity of the map do not rely on availability of a norm. Hence, a (sufficiently large) DNN  exists which may be used likewise for the iterative solution of VIs in metric spaces. Furthermore, the resulting fixed-point-iteration nets obtained in this manner naturally exhibit a recurrent structure, in the case (considered here) that the surrogate  is fixed throughout the K -fold (k) K composition (more refined constructions with stage-dependent approximations{ } k=1 of increasing emulation accuracy could be considered, but shall not be addressed here). In summary, with the geometric error reduction in FPIs which is implied by the con- traction condition, finite truncation at a prescribed emulation precision ε> 0willimply O(| log(ε)|) iterations, and exact solution representation (of the fixed point of )inthe infinite depth limit. In DNN calculus, finitely terminated FPIs can be realized via finite concatenation of the DNN approximation  of the contraction map . The corresponding DNNs exhibit depth O(| log()|), and naturally a recurrent structure due to the repetition of the Net  in their construction. Thereby, recurrent DNNs can be built which encode numerical solution maps of fixed point iterations. This idea has appeared in various incar- nations in recent work; we refer to, e.g., MGNet for the realization of Multi-grid iterative solvers of discretized elliptic PDEs [11]. The presently proposed ProxNet architectures are, in fact, DNN emulations of corresponding fixed point iterations of (discretized) vari- ational inequalities. Recent work has promoted so-called Deep Operator Nets which emulate Data-to- Solution operators for classes of PDEs. We mention only [19] and the references there. To analyze expression rates of deep neural networks (DNNs) for emulating data-to-solution operators for VIs is the purpose of the present paper. In line with recent work (e.g., [19,21] and the references there), we take the perspective of infinite-dimensional VIs, which are set on closed cones in separable Hilbert spaces. The task at hand is then the analysis of rates of expression of the approximate data-to-solution map, which relates the input data (i.e., operator, cone, etc.) to the unique solution of the VI. Schwab, Stein Res Math Sci (2022) 9:36 Page 3 of 35 36 1.1 Layout The structure of this paper is as follows. In Sect. 2, we recapitulate basic notions and definitions of proximal neural networks in infinite-dimensional, separable Hilbert spaces. A particular role is taken by so-called proximal activations, and a calculus of ProxNets, which we shall use throughout the rest of the paper to build solution operators of VIs. Section 3 addresses the conceptual use of ProxNets in the constructive solution of VIs. We build in particular ProxNet emulators of convergent fixed point iterations to construct solutions of VIs. Section 3.2 introduces quantitative bounds for perturbations of ProxNets. Section 4 emphasizes that ProxNets may be regarded as (approximate) solution operators to unilateral obstacle problems in infinite-dimensional Hilbert spaces. Section 5 presents DNN emulations of iterative solvers of matrix LCPs which arise from discretization of unilateral problems for PDEs. Section 6 presents several numerical experiments, which illustrate the foregoing developments. More precisely, we consider the numerical solution of free boundary value problems arising in the valuation of American-style options, and in parametric obstacle problems. Section 7 provides a brief summary of the main results and indicates possible directions for further research. 1.2 Notation We use standard notation. By L(H, K), we denote the Banach space of bounded, linear operators from the Banach space H into K (surjectivity will not be required). Unless explicitly stated otherwise, all Hilbert and Banach-spaces are infinite-dimensional. By bold symbols, we denote matrices resp. linear maps between finite-dimensional spaces. 0 0 We use the notation conventions ·= 0and  ·= 1 for the empty sum and i=1 i=1 empty product, respectively. Vectors in finite-dimensional, Euclidean space are always understood as column vectors, with  denoting transposition of matrices and vectors. 2 Proximal neural networks (ProxNets) We consider the following model for an artificial neural network: For finite m ∈ N,let H and (H ) be real, separable Hilbert spaces. For every i ∈{1, ... ,m},let W ∈ i 0≤i≤m i L(H , H ) be a bounded linear operator, let b ∈ H ,let R : H → H be a nonlinear, i−1 i i i i i i continuous operator, and define T : H → H,x → R (W x + b ). (1) i i−1 i i i i Moreover, let W ∈ L(H , H), W ∈ L(H , H), b ∈ H and consider the neural 0 0 m+1 m m+1 network (NN) model : H → H,x → W x + W (T ◦ ··· ◦ T )(x) + b . (2) 0 0 m+1 m 1 m+1 The operator W ∈ L(H , H) allows to include skip connections in the model, similar to 0 0 deep residual neural networks as proposed in [12,13]. This article focuses in particular on NNs with identical input and output spaces as in [5, Model 1.1], that arise as special case of model (2)with H = H = H and are of the form 0 m : H → H,x → (1 − λ)x + λ(T ◦ ··· ◦ T )(x), (3) m 1 36 Page 4 of 35 Schwab, Stein Res Math Sci (2022) 9:36 for a relaxation parameter λ> 0 to be adjusted for each application. The relation H = H = H allows us to investigate fixed points of  : H → H, which are in turn solutions to variational inequalities. The nonlinear operators R act as activation operators of the NNs and are subsequently given by suitable proximity operators on H . We refer to and  as proximal neural networks or ProxNets for short and derive sufficient conditions on the operators T ,resp. W and R , so that  defines a contraction on H. Hence, the i i i ∗ ∗ unique fixed point x = (x ) ∈ H solves a variational inequality, that is turn uniquely determined by the network parameters W ,b and R for i ∈{1, ... ,m}. On the other i i i hand, any well-posed variational inequality on H may be recast as fixed-point problem for a suitable contractive ProxNet  : H → H. As an example, consider an elliptic variational inequality on H, with solution u ∈ K ⊂ H, where K is a closed, convex set. The set of contractive mappings on H is open; therefore, we may construct a one-layer ProxNet  : H → H, such that u is the unique fixed-point of . Therein, W ∈ L(H) stems from the bilinear form of the variational inequality, λ> 0 is a relaxation parameter chosen to ensure a Lipschitz constant below one, and R is the H-orthogonal projection onto K, see Sect. 4.1 for a detailed construction. This enables us to approximate solutions to variational inequality problems as fixed- point iterations of ProxNets and derive convergence rates. Due to the contraction property ∗ ∗ of , the fixed-point iteration x = (x ),n ∈ N converges to x = (x ) for any n n−1 x ∈ H at linear rate. Moreover, as the set of contractions on H is open, the iteration is stable under small perturbations of the network parameters. As we show in Sect. 5.3 below, the latter property allows us to solve entire classes of variational inequality problems using only one ProxNet with fixed parameters. 2.1 Proximal activations Definition 2.1 Let i ∈{0, ... ,m} be a fixed index, ψ : H → R ∪{∞} and dom(ψ ):= i i i {x ∈ H |ψ (x) < ∞}.Wedenoteby (H ) the set of all proper, convex, lower semi- i i 0 i continuous functions on H , that is (H ):= 0 i ψ : H → R∪{∞} lim inf ψ (y) ≥ ψ (x) for all x ∈ H and dom(ψ ) =∅ . i i i i i i y→x For any ψ ∈ (H ), the subdifferential of ψ at x ∈ H is i 0 i i i ∂ψ (x):={v ∈ H | (y − x, v) + f (x) ≤ f (y) for all y ∈ H }⊂ H,x ∈ H , i i i i i and the proximity operator of ψ is x − y prox : H → H,x → argmin ψ (y) + . (4) i i i y∈H It is well-known that prox is a firmly nonexpansive operator, i.e., 2prox − id is non- ψ ψ i i expansive, see, e.g., [2, Proposition 12.28]. As outlined in [5, Section 2], there is a natural relation between proximity operators and activation functions in neural networks: Virtu- ally any commonly used activation function such as rectified linear unit, tanh, softmax, etc., may be expressed as proximity operator on H = R , d ∈ N, for an appropriate i Schwab, Stein Res Math Sci (2022) 9:36 Page 5 of 35 36 ψ ∈ (H )(see[5, Section 2] for examples). We consider a set of particular proximity i 0 i operators given by A(H ):={R = prox | ψ ∈ (H ) such that ψ is minimal at 0 ∈ H }, (5) i i i 0 i i i cf. [5, Definition 2.20]. Apart from being continuous and nonexpansive, any R ∈ A(H ) i i satisfies R (0) = 0[5, Proposition 2.21]. Therefore, in the case H = R, the elements in i i A(R) are also referred to as stable activation functions,cf. [10, Lemma 5.1]. With this in mind, we formally define proximal neural networks, or ProxNets. Definition 2.2 Let  : H → H be the m-layer neural network model in (2). If R ∈ 0 i A(H ) holds for any i ∈{1, ... ,m},  is called a proximal neural network or ProxNet. 2.2 ProxNet calculus Before investigating the relation of  in (3) to variational inequality models, we record several useful definitions and results for NN calculus in the more general model  from Eq. (2). (j) (j) (j) Definition 2.3 Let j ∈{1, 2}, m ∈ N,let H , H , ... , H be separable Hilbert spaces j m (1) (2) such that H = H ,and let  be m -layer ProxNets as in (2) given by j j (j) (j) (j) (j) (j) (j) : H → H ,x → W T ◦ ··· ◦ T (x) + b . j m 0 m +1 1 m+1 The concatenation of  and  is defined by the map 1 2 (2) (1) •  : H → H ,x → ( ◦  )(x). (6) 1 2 1 2 (j) Remark 2.4 Due to W ≡ 0, there are no skip connections after the last proximal acti- vation in  ; hence,  •  is in fact a ProxNet as in (2)with2m layers and no skip j 1 2 connection. (j) (j) (j) Definition 2.5 Let m ∈ N, j ∈{1, 2},let H , H , ... , H be separable Hilbert spaces (1) (2) such that H = H ,and let  be m-layer ProxNets as in (2) given by 0 0 (j) (j) (j) (j) (j) (j) (j) : H → H ,x → W x + W T ◦ ··· ◦ T (x) + b . j m 0 0 m+1 1 m+1 (1) (2) The parallelization of  and  is given for H := H = H by 1 2 0 0 0 (1) (2) P( ,  ): H → H ⊕ H ,x → ( (x),  (x)). 1 2 0 1 2 Proposition 2.6 The parallelization P( ,  ) of two ProxNets  and  as in Defini- 1 2 1 2 tion 2.5 is a ProxNet. (j) (1) (2) (j) Proof We set H := H for j ∈{1, 2},fix i ∈{1, ... ,m} and observe that H ⊕ H m+1 i i equipped with the scalar product (·,·) := (·,·) + (·,·) is again a separable (1) (2) (1) (2) H ⊕H H H i i i i 36 Page 6 of 35 Schwab, Stein Res Math Sci (2022) 9:36 Hilbert space. We define (1) (2) (1) (2) W : H → H ⊕ H,x → (W x, W x), 0 0 0 0 (1) (2) (1) (2) W : H → H ⊕ H,x → (W x, W x), 1 0 1 1 1 1 (1) (2) (1) (2) (1) (2) W : H ⊕ H → H ⊕ H , (x, y) → (W x, W y),i ∈{2, ... ,m + 1}, i−1 i−1 i i i i (1) (2) (1) (2) b := (b ,b ) ∈ H ⊕ H ,i ∈{1, ... ,m + 1}, i i i i (1) (2) (1) (2) (1) (2) R : H ⊕ H → H ⊕ H , (x, y) → (R x, R y),i ∈{0, 1, ... ,m}. i i i i i i (j) (j) Note that all W are bounded, linear operators. Moreover, if R = prox ∈ A(H ) (j) i i (j) (j) (1) (2) holds for ψ ∈ (H )and j ∈{1, 2}, then R = prox , where ψ ∈ (H ⊕ H )is 0 i i 0 i i i i i (1) (2) (1) (2) defined by ψ (x, y):= ψ (x) + ψ (y). Hence, R ∈ A(H ⊕ H ) and it holds that i i i i i i (1) (2) P( ,  ): H → H ⊕ H ,x → W x + W (T ◦ ··· ◦ T )(x) + b , 1 2 0 0 m+1 m 1 m+1 with T := R (W ·+b ) for i ∈{1, ... ,m}, which shows the claim. i i i i 3 ProxNets and variational inequalities 3.1 Contractive ProxNets We formulate sufficient conditions on the neural network model in (3) so that  : H → H is a contraction. The associated fixed-point iteration converges to the unique solution of a variational inequality, which is characterized in the following. Assumption 3.1 Let  be a ProxNet as in (3)with m ∈ N layers such that W ∈ L(H , H ), b ∈ H ,and R ∈ A(H ) for all i ∈{1, ... ,m}.Itholds that λ ∈ (0, 2) i−1 i i i i i and the operators W satisfy L := W  < min(1, 2/λ − 1). i L(H ,H ) i−1 i i=1 0 k+1 k Theorem 3.2 Let  be as in (3), let x ∈ H and define the iteration x := (x ),k ∈ N . k 0 Under Assumption 3.1,the sequence (x ,k ∈ N ) converges for any x ∈ H to the unique fixed-point x ∈ H. For any finite number k ∈ N, the error is bounded by 0 0 (x ) − x ∗ k k x − x  ≤ L ,L :=|1 − λ|+ λL ∈ [0, 1). (7) H ,λ ,λ 1 − L ,λ It holds that ∗ ∗ ∗ ∗ ∗ ∗ (x , ... ,x ):= (T x , (T ◦ T )x , ... , (T ◦ ··· ◦ T )x ,x ) ∈ H × ··· × H 1 2 1 m−1 1 1 m 1 m is the unique solution to the variational inequality problem: find x ∈ H , ... ,x = x ∈ 1 1 0 m H ,suchthat W x + b − x ∈ ∂ψ (x ),i ∈{1, ... ,m}. (8) i i−1 i i i i Schwab, Stein Res Math Sci (2022) 9:36 Page 7 of 35 36 Moreover, x is bounded by ⎛ ⎞ m m ∗ ∗ ⎝ ⎠ x  ≤ C W  b  , H j i H L(H ,H ) j−1 j i i=1 j=i+1 ⎨ 1 < ∞, λ ∈ (0, 1], ∗ 1−L C := < ∞, λ ∈ (1, 2). 2−λ(1+L ) Proof By the non-expansiveness of R : H → H for i ∈{1, ... ,m}, it follows for any i i i x, y ∈ H (x) − (y) ≤|1 − λ|x − y + λ(T ◦ ··· ◦ T )x − (T ◦ ··· ◦ T )y H H m 1 m 1 H ≤|1 − λ|x − y + λ(W ◦ (T ◦ ··· ◦ T ))x − (W ◦ (T ◦ ··· ◦ T ))y m m−1 1 m m−1 1 H ≤|1 − λ|x − y + λW  (T ◦ ··· ◦ T )x − (T ◦ ··· ◦ T )y m L(H ,H ) m−1 1 m−1 1 H m−1 m m−1 ≤|1 − λ|x − y + λ W  x − y H i L(H ,H ) H i−1 i 0 i=1 = (|1 − λ|+ λL )x − y . :=L ,λ As λ ∈ (0, 2) and L < min(1, 2/λ− 1) by Assumption 3.1, it follows that L < 1, hence, ,λ : H → H is a contraction. Existence and uniqueness of x ∈ H and the first part of the claim then follow by Banach’s fixed-point theorem for any initial value x ∈ H. By [2, Proposition 16.44], it holds for any i ∈{1, ... ,m}, x ,y ∈ H and ψ ∈ (H ) i i i i 0 i that x = prox (y ) ⇔ y − x ∈ ∂ψ (x ). i i i i i i ∗ ∗ ∗ ∗ ∗ Now, let x := x and x := (T ◦ ··· ◦ T )(x ) for i ∈{1, ... ,m}. This yields (x ) = i 1 0 i 0 ∗ ∗ ∗ ∗ ∗ (1 − λ)x + λx = x and hence, x = x . Recalling that R = prox with ψ ∈ (H ) i i 0 i m m ψ for all i ∈{1, ... ,m}, it hence follows that ∗ ∗ ∗ W x + b − x ∈ ∂ψ (x ), i i i i−1 i i cf. [5, Proposition 4.3]. Finally, to bound x ,weuse that ∗ ∗ ∗ x  ≤(x ) − (0) +(0) ≤ L x  + λ(T ◦ ··· ◦ T )(0) . H H H ,λ H m 1 H m 36 Page 8 of 35 Schwab, Stein Res Math Sci (2022) 9:36 As R ∈ A(H ), it holds R (0) = 0 and therefore, R (x) ≤x for all x ∈ H , which i i i i H H i i i in turn shows (T ◦ ··· ◦ T )(0) ≤W  (T ◦ ··· ◦ T )(0) +b m 1 H m L(H ,H ) m−1 1 H m H m m m m−1 m ≤W L(H ,H ) m m · W  (T ◦ ··· ◦ T )(0) +b  +b m−1 L(H ,H ) m−2 1 H m−1 H m H m−2 m−1 m−2 m−1 m ⎛ ⎞ m m ⎝ ⎠ ≤ W  b  . j L(H ,H ) i H j−1 j i=1 j=i+1 The claim follows with L < min(1, 2/λ − 1), since λ(1 − L ) > 0, λ ∈ (0, 1], 1 − L = ,λ 2 − λ(1 + L ) > 0, λ ∈ (1, 2). 3.2 Perturbation estimates for ProxNets We introduce a perturbed version of the ProxNet  in (3) in this subsection. Besides changing the network parameters W ,b and R , we also augment the input space H and i i i allow an architecture that approximates each nonlinear operator T itself by a multilayer network. These changes allow us to consider ProxNet as an approximate data-to-solution operator for infinite-dimensional variational inequalities and to control perturbations of the network parameters. For instance, we show in Example 3.4 that augmented ProxNets mimic the solution operator to Problem (8), that maps the bias vectors b , ... ,b to the 1 m solution x , ... ,x . 1 m Let H , ... , H be arbitrary separable Hilbert spaces and let H := H . Then, for 0 m−1 0 i ∈{0, ... ,m−1} the direct sum H ⊕ H equipped with the inner product (·,·) +(·,·) i i H i H is again a separable Hilbert space. For notational convenience, we set H :={0 ∈ H } m m and use the identification H ⊕ H = H = H. We consider the ProxNet m m m : H ⊕ H → H, (x, x) → (1 − λ)x + λ(T ◦ ··· ◦ T )(x, x), (9) m 1 where we allow that the operators T are itself multi-layer ProxNets: For any i ∈{1, ... ,m}, (i) (i) (i) (i) let m ∈ N and let H := H ⊕ H , H , ... , H ,H := H ⊕ H be separable i i−1 i−1 m i i 0 1 m −1 i (i) (i) (i) (i) Hilbert spaces. For j ∈{1, ... ,m }, consider the operators T (·) = R (W ·+b ) given i i j j j j i i i i by (i) (i) (i) (i) (i) (i) (i) R ∈ A(H ),W ∈ L(H , H ),b ∈ H . j j j j −1 j j j i i i i i i i We then define T as (i) (i) T : H ⊕ H → H ⊕ H , (x , x ) → (T ◦ ··· ◦ T )(x , x ), i i−1 i−1 i i i−1 i−1 i−1 i−1 m 1 which in turn determines  in (9). By construction,  is a ProxNet of the form (2)with m ≥ m layers. As compared to , we augmented the input and intermediate spaces i=1 by H . The composite structure of the maps T allows to choose input vectors x ∈ H i i i−1 i−1 Schwab, Stein Res Math Sci (2022) 9:36 Page 9 of 35 36 such that the first component of T (x , x ) approximates T (x ) uniformly on a subset i i−1 i−1 i i−1 of H . As we show in Sect. 5.3 below, this enables us to solve large classes of variational i−1 inequalities with only one fixed ProxNet , that in turn approximates a data-to-solution operator, instead of employing different fixed maps  : H → H for every problem. To formulate reasonable assumptions on , we denote for any i ∈{1, ... ,m − 1} by P : H ⊕ H → H , (x , x ) → x , H i i i i i i P : H ⊕ H → H , (x , x ) → x i i i i i i the projections to the first and second component for an element in H ⊕ H , respectively. i i (i) Moreover, we define the closed ball B :={x ∈ H |x  ≤ r}⊂ H with radius r > 0. r i i i H i Assumption 3.3 Let  and  be proximal neural networks defined as in Eqs. (3)and (9), respectively. There are constants L ∈ (0, 1), δ ≥ 0and ≥ ≥ > 0 such that 1 0 2 1.  satisfies Assumption 3.1 with λ ∈ (0, 1] and L ≤ L ∈ (0, 1). 2. It holds that ⎛ ⎞ ⎛ ⎞ i m m ⎝ ⎠ ⎝ ⎠ max W  + W  (b  + δ) ≤ , j L(H ,H ) 0 j L(H ,H ) i H 1 j−1 j j−1 j m i∈{0,1,...,m} j=1 i=1 j=i+1 ⎛ ⎞ m m ⎝ ⎠ W  b  ≤ (1 − L) , j L(H ,H ) i H 2 j−1 j i=1 j=i+1 as well as ⎛ ⎞ m m ⎝ ⎠ + W  ≤ . 2 j 0 L(H ,H ) j−1 j (1 − L) i=1 j=i+1 (i−1) 3. There is a vector x ∈ H , such that for i ∈{1, ... ,m},any x ∈ B ⊂ H and 0 0 i−1 i−1 x := P T (x , x )itholds i  i i−1 i−1 T (x ) − P T (x , x ) ≤ δ. i i−1 H i i−1 i−1 H i i Before we derive error bounds, we provide an example to motivate the construction of and Assumption 3.3. Example 3.4 (Bias-to-solution operator) Let  be as in Assumption 3.1 with m = 2layers and network parameters R ,W ,b for i ∈{1, 2}. We construct a ProxNet  that takes the i i i bias vectors b ,b of  as inputs to represent  for any choice of b ∈ H and therefore, 1 2 i i may be concatenated to map any choice of b ,b to the respective solution (x ,x )of(8). 1 2 1 2 In other words, we approximate the bias-to-solution operator O : H ⊕ H → H ⊕ H , (b ,b ) → (x ,x ). bias 1 2 1 2 1 2 1 2 36 Page 10 of 35 Schwab, Stein Res Math Sci (2022) 9:36 To this end, we set H = H ⊕ H , H = H , m = m = 1, b = 0 ∈ H ⊕ H and 0 1 2 1 2 1 2 i,1 i i (1) W : H ⊕ H ⊕ H → H ⊕ H , (x, x ,x ) → (W x + x ,x ) 1 2 1 2 1 2 1 1 2 (2) W : H ⊕ H → H , (x ,x ) → W x + x , 1 2 2 1 2 2 1 2 (1) R : H ⊕ H → H ⊕ H , (x ,x ) → R (x ) + x , 1 2 1 2 1 2 1 1 2 (2) R : H → H,x → R (x ). 2 2 2 2 2 (1) (1) Note that R = prox (1) with ψ (x ,x ):= ψ (x ) for any (x ,x ) ∈ H ⊕ H , where ψ 1 2 1 1 1 2 1 2 1 1 1 (1) determines R = prox . Hence, R ∈ A(H ⊕ H ), and it follows with x := (b ,b ) ∈ 1 1 1 0 1 2 ψ 1 H ⊕ H for any x ∈ H and x ∈ H that 1 2 1 1 (1) (1) T (x) = R (W x + b ) = P (R (W x + b ),b ) = P R (W (x, x )) = P T (x, x ), 1 1 1 1 H 1 1 1 2 H 0 H 1 0 1 1 1 1 1 (2) (2) (2) (2) T (x) = R (W x + b ) = R (W (x ,b )) = P R (W (x ,P T (x , x )). 2 2 2 1 2 1 2 H 1 1 1 0 1 1 2 1 1 H Therefore, the last part of Assumption 3.3 holds with δ = 0 for arbitrary large > 0and hence, the constants , , do not play any role in this example. The generalization 0 1 2 to m > 2 layers follows by a similar construction of . Now, let (x ,x ) be the solution to (8) for any choice (b ,b ) ∈ H ⊕ H . It follows from 1 2 1 2 1 2 Theorem 3.2 that the operator O : H ⊕ H → H, (b ,b ) → (·,b ,b ) • ··· • (·,b ,b )(x ) bias 1 2 1 2 1 2 1 2 k times satisfies x ≈ O (b ,b )and x ≈ T (O (b ,b )) for any fixed x ∈ H and any tuple 2 bias 1 2 1 1 bias 1 2 (b ,b ) ∈ H ⊕ H , for a sufficiently large number k of concatenations of (·,b ,b ). 1 2 1 2 1 2 The augmented ProxNet  may also be utilized to consider parametric families of obstacle problems, as shown in Example 4.4 below. Therein, the parametrization is with respect to the proximity operators R instead of the bias vectors b , and we construct an i i approximate obstacle-to-solution operator in the fashion of Example 3.4. In the finite- dimensional case (where the linear operators W correspond to matrices), the input of may even be augmented by a suitable space of operators, see Sect. 5.3 below for a detailed discussion. We conclude this section with a perturbation estimate that allows us to approximate the fixed-point of  by the augmented NN . Theorem 3.5 Let  and  be proximal neural networks as in Eqs. (3)and (9) that satisfy Assumption 3.3, and denote by x ∈ H the unique fixed-point of  from Theorem 3.2.Let (0) 0 k+1 x ∈ B be arbitrary, let  x be as in Assumption 3.3 and define the sequence  x := k 0 0 ( x , x ) for k ∈ N ,where x := x . Then, there is a constant C > 0 which is independent 0 0 of δ> 0 and x ,suchthatfor anyk ∈ N,itholds ∗ k k x − x  ≤ C L + δ , where L := (1 − λ) + λL < 1. λ Schwab, Stein Res Math Sci (2022) 9:36 Page 11 of 35 36 (0) Proof Let x ∈ B and let  x ∈ H be as in Assumption 3.3.Wedefine v := x, v := 0 0 0 i P (T ◦···◦ T )(x, x ) ∈ H for i ∈{1, ... ,m − 1},and v := (T ◦···◦ T )(x, x ) ∈ H. H i 1 0 i m m 1 0 With x := P T (x , x ) and the convention that P = id, we obtain the recursion i  i i−1 i−1 H formula v = P T (v , x ),i ∈{1, ... ,m}. (10) i H i i−1 i−1 We now show by induction that v  ≤ for i ∈{0, ... ,m}. By Assumption 3.3 it i H 1 holds v  =x 0 H H ⎛ ⎞ ⎛ ⎞ 0 0 0 ⎝ ⎠ ⎝ ⎠ = W  + W  (b  + δ) j L(H ,H ) 0 L(H ,H ) j H j−1 j −1 j j=1 j=1 =j+1 ≤ . Now, let ⎛ ⎞ ⎛ ⎞ i i i ⎝ ⎠ ⎝ ⎠ v  ≤ W  + W  (b  + δ) i H j 0 j H L(H ,H ) L(H ,H ) i j−1 j −1 j j=1 j=1 =j+1 hold for a fixed i ∈{0, ... ,m − 1}. Assumption 3.3 yields with Eq. (10) T (v ) − v  =T (v ) − P T (v , x ) ≤ δ. i+1 i i+1 H i+1 i H i+1 i 0 H i+1 i+1 i+1 Using R (x) ≤x for x ∈ H then yields together with the triangle i+1 H H i+1 i+1 i+1 inequality and the induction hypothesis v  ≤ δ +T (v ) i+1 H i+1 i H i+1 i+1 ≤ δ +W  v  +b i+1 i H i+1 H L(H ,H ) i i+1 i i+1 ⎛ ⎞ ⎛ ⎞ i+1 i+1 i+1 ⎝ ⎠ ⎝ ⎠ ≤ W  + W  (b  + δ) j L(H ,H ) 0 L(H ,H ) j H j−1 j −1 j j=1 l=1 j=l+1 ≤ , (i) and hence, v ∈ B for all i ∈{0, ... ,m}. With Assumption 3.3 and Eq. (10), we further (0) obtain for each x ∈ B (x) − (x, x ) 0 H =(T ◦ ··· ◦ T )(x) − v m 1 m H ≤(T ◦ ··· ◦ T )(x) − T (v ) +T (v ) − T (v , x ) m 1 m m−1 H m m−1 m m−1 m−1 H ≤W  (T ◦ ··· ◦ T )(x) − v  + δ, m L(H ,H ) m−1 1 m−1 H m−1 m m−1 36 Page 12 of 35 Schwab, Stein Res Math Sci (2022) 9:36 and by iterating this estimate over i ∈{1, ... ,m} ⎛ ⎞ m m ⎝ ⎠ (x) − (x, x ) ≤ λδ W  =: λδC . (11) 0 H j L(H ,H ) m j−1 j i=1 j=i+1 ∗ k k−1 Now, let x ∈ H be the unique fixed-point of  as in Theorem 3.2,let x = (x )and k k−1 0 0 0 x = ( x , x ) for any k ∈ N and a given initial value x =  x ∈ H with x  ≤ . 0 H 2 We obtain as in the proof of Theorem 3.2 1 0 x  ≤(x ) − (0) +(0) H H H ⎛ ⎞ m m ⎝ ⎠ ≤ L x  + λ W  b ,λ H j L(H ,H ) i H j−1 j i=1 j=i+1 ⎛ ⎛ ⎞ ⎞ (12) m m ⎝  ⎝ ⎠ ⎠ ≤ (1 − λ) + λ L + W  b 2 2 j i H L(H ,H ) i j−1 j i=1 j=i+1 ≤ , where we have used that L = (1− λ)+ λL ≤ (1− λ)+ λL and Assumption 3.3. Hence, ,λ k k we have x  ≤ inductively for all k ∈ N. In the next step, we show that  x  ≤ H 2 H 0 by induction over k. First, we obtain with x ≤ ≤ ,(11)and (12) that 2 0 1 0 0 0 0 x  =(x , x ) ≤(x , x ) − (x ) +(x ) ≤ λδC + . H 0 H 0 H H  2 Thus, x  ≤ follows with Assumption 3.3 on the relation of and as λ(1−L) < H 0 0 2 k−1 k k 1. Using the induction hypothesis  x − x  ≤ λδC L for a fixed k ∈ N, j=0 ,λ x  ≤ ,and L ≤ L := (1 − λ) + λL < 1 yields similarly H 2 ,λ λ k+1 k k k k k x  ≤( x , x ) − ( x ) +( x ) − (x ) +(x ) H 0 H H H k k ≤ λδC + L  x − x  + ,λ H 2 ≤ λδC L + , j=0 and hence,  x  ≤ λδC /(λ(1 − L)) + ≤ holds by induction for all k ∈ N.We H  2 0 apply the bounds from Theorem 3.2 and (11) and conclude the proof by deriving ∗ k ∗ k k−1 k−1 k−1 k−1 x − x ≤x − x +(x ) − ( x )+( x ) − ( x , x ) 1 0 x − x k k−1 k−1 ≤ L + L x − x  + λδC ,λ H ,λ 1 − L ,λ k−1 0 0 (x ) − x ≤ L + λδC L 1 − L j=0 max(2 , λC ) ≤ L + δ . 1 − L Schwab, Stein Res Math Sci (2022) 9:36 Page 13 of 35 36 4 Variational inequalities in Hilbert spaces In the previous sections, we have considered a ProxNet model and derived the associated variational inequalities. Now, we use the variational inequality as starting point and derive suitable ProxNets for its (numerical) solution. Let (H, (·,·) ) be a separable Hilbert space with topological dual space denoted by H ,and let ·.· be the associated dual pairing. H H Let a : H × H → R be a bilinear form, let f : H → R be a functional, and let K ⊂ H be a subset of H. We consider the variational inequality problem find u ∈ K : a(u, v − u) ≥ f (v − u), ∀v ∈ K. (13) Assumption 4.1 The bilinear form a : H × H → R is bounded and coercive on H, i.e., there exists constants C ,C > 0 such that for any v, w ∈ H it holds − + a(v, w) ≤ C v w and a(v, v) ≥ C v . + H H − Moreover, f ∈ H and K ⊂ H is nonempty, closed and convex. Problem (13) arises in various applications in the natural sciences, engineering and finance. It is well-known that there exists a unique solution u ∈ K under Assumption 4.1, see, e.g., [14, Theorem A.3.3] for a proof. We also mention that well-posedness of Prob- lem (13) is ensured under weaker conditions as Assumption 4.1; in particular, the coerciv- ity requirement may be relaxed as shown in [8]. For this article, however, we focus on the bounded and coercive case in order to obtain numerical convergence rates for ProxNet approximations. 4.1 Fixed-point approximation by ProxNets Theorem 4.2 Let Assumption 4.1 hold, and define H := H := H. Then, there exists 1 0 a one-layer ProxNet  as in Eq. (3)suchthatu ∈ K is the unique fixed-point of . 0 k k−1 Furthermore, for a given u ∈ H define the iteration u := (u ),k ∈ N. Then, there are constants L ∈ (0, 1) and C = C(u ) > 0 such that ,λ k k u − u ≤ CL ,k ∈ N. (14) ,λ Proof We recall the fixed-point argument, e.g., in [14, Theorem A.3.3], for proving exis- tence and uniqueness of u since it is the base for the ensuing ProxNet construction: Assumption 4.1 ensures that a(v,·),f ∈ H for any v ∈ H. The Riesz representation theorem yields the existence of A ∈ L(H)and F ∈ H such that for all v, w ∈ H (Av, w) = a(v, w)and (F, v) = f (v). H H Since K is closed convex, the H-orthogonal projection P : H → K onto K is well-defined and for any ω> 0 there holds u solves (13) ⇐⇒ u = P (ω(F − Au) + u). Hence, u is a fixed-point of the mapping T : H → H,v → P (ω(F − Av) + v). ω K 36 Page 14 of 35 Schwab, Stein Res Math Sci (2022) 9:36 By Assumption 4.1, it is now possible to choose ω> 0 sufficiently small, so that T is a contraction on H, which proves existence and uniqueness of u. The optimal relaxation ∗ 2 2 parameter in terms of the bounds C ,C is ω = C /C , leading to T ∗ = − + − ω L(H) 2 2 (1 − C /C ) < 1, see, e.g., [14, Theorem A.3.3]. 1 2 To transfer this constructive proof of existence and uniqueness of solutions to the ProxNet setting, we denote by ι the indicator function of K given by 0, if v ∈ K, ι : H → (−∞,∞],v → ∞, otherwise. Since K is closed convex, it holds that ι ∈ (H)and prox = P (cf. [2,Examples K 0 K 1.25 and 12.25]). Now, let m = 1, H = H, W := I − ωA ∈ L(H), b := ωF ∈ H,and 1 1 1 R := prox , where ω> 0 is such that I − ωA is a contraction. The ProxNet emulation  of the contraction map reads: for a parameter λ ∈ (0, 1], : H → H,v → (1 − λ)v + λ R (W v + b ) . 1 1 1 :=T (v) Since W  < 1, Assumption 3.1 is satisfied for every λ ∈ (0, 1]. Theorem 3.2 yields 1 L(H) k k−1 0 that the iteration u := (u ) converges for any u ∈ H to a unique fixed-point u ∈ H with error bounded by (14)and L := (1 − λ) + λW  ∈ (0, 1). Since ,λ 1 L(H) (v) = (1 − λ)v + λT (v), it follows that u is in turn the unique fixed-point of T , hence 1 1 u = u , which proves the claim. Remark 4.3 In the fashion of Example 3.4, we may construct an augmented ProxNet : H ⊗ H → H such that (v, F) = (v) for any v ∈ H, where F ∈ H is the Riesz representer of f ∈ H in Problem (13). The only difference is that F has to be multiplied with ω in the first linear transform to obtain b = ωF instead of F as bias vector. The parameters of  in this construction are independent of F; hence, Theorem 3.5 yields that for any f ∈ H (resp. F ∈ H)and x ∈ H it holds k k u − u ≤ CL ,k ∈ N, ,λ k k−1 where u := (u ,F). The previous remark shows that one fixed ProxNet is sufficient to solve Problem (13) for any f ∈ H . A similar result is achieved if the set K ⊂ H associated Problem (13)is parameterized by a suitable family of functions: Example 4.4 (Obstacle-to-solution operator) Let H be a Hilbert space of real-valued func- d 2 tions over a domain D ⊂ R such that C(D) ∩ H is a dense subset, e.g., H = L (D) or H = H (D), and let K :={v ∈ H| v ≥ g almost everywhere} for a sufficiently smooth function g : D → R. With this choice of K,(13)isan obstacle problem and P (v) = max(v, g) holds for any v ∈ H∩ C(D). We construct a ProxNet approximation to the obstacle-to-solution operator O : H → H,g → u corresponding to Problem (13) obs with K ={v ∈ H| v ≥ g almost everywhere}. Schwab, Stein Res Math Sci (2022) 9:36 Page 15 of 35 36 Assume (v) = P (W v + b ) for W ∈ L(H)and b ∈ H are as in Theorem 4.2 and K 1 1 1 1 let K :={v ∈ H| v ≥ 0 almost everywhere}.ToobtainaProxNetthatusesthe obstacle g ∈ H as input, we define (1) (1) : H ⊕ H → H, (v, g) → T (v, g) = (T ◦ T )(v, g) 2 1 (1) (1) (1) (1) via T (v, g):= R (W (v, g) + b ) which are, for j ∈{1, 2},definedby j j j j 1 1 1 1 (1) W : H ⊕ H → H ⊕ H, (v ,v ) → (W v − v ,v ), 1 2 1 1 2 2 (1) (1) (1) b := (b , 0) ∈ H ⊕ H,R := prox (1), ψ (v, g):= ι (v), 1 K 1 1 1 (1) (1) (1) W : H ⊕ H → H, (v ,v ) → v + v ,b := 0 ∈ H,R := id ∈ A(H). 1 2 1 2 2 2 2 (1) (1) (1) Note that this yields W ∈ L(H ⊕ H), W ∈ L(H), and R (v ,v ) = (P v ,v ) 1 2 K 1 2 1 2 1 for all v ,v ∈ H. It now follows for any given v, g ∈ H and K :={v ∈ H| v ≥ 1 2 g almost everywhere} (v) = P (W v + b ) K 1 1 = P (W v + b − g) + g K 1 1 (1) (1) (1) = R (W (P (W v + b − g),g) + b ) K 1 1 2 2 2 (1) = T (P (W v + b − g),g) K 1 1 2 0 (1) (1) (1) (1) = T ◦ (R (W (v, g) + b )) 2 1 1 1 = (v, g). As in Example 3.4, we concatenate  to obtain for a fixed choice of x ∈ H the operator O : H → H,g → (·,g) • ··· • (·,g) (x ). obs Convergence of O (g)to u for any g ∈ H (with arbitrary a-priori fixed x ∈ H)witha obs contraction rate that is uniform with respect to g ∈ H is again guaranteed as the number of concatenations tends to infinity. Therefore, as in Example 3.4, there exists one ProxNet that approximately solves a family of obstacle problems with obstacle ‘parameter’ g ∈ H. A combination of the ProxNets from Remark 4.3 and Example 4.4 enablesustoconsider both, f and K in (13), as input variables of a suitable NN  : H ⊕ H ⊕ H → H. This allows, in particular, to construct an approximation of the data-to-solution operator to Problem (13)thatmaps(F, g) ∈ H ⊕ H to u. 5 Example: linear matrix complementarity problems Common examples for Problem (13) arise in financial and engineering applications, where the bilinear form a : H × H → R stems from a second-order elliptic or parabolic differ- s s ential operator. In this case, H ⊂ H (D), where H (D) is the Sobolev space of smoothness s > 0 with respect to the spatial domain D ⊂ R , n ∈ N. Coercivity and boundedness of a as in Assumption 4.1 often arise naturally in this setting. To obtain a computationally tractable problem, it is necessary to discretize (13), for instance by a Galerkin approxima- tion with respect to a finite dimensional subspace H ⊂ H. To illustrate this, we assume d 36 Page 16 of 35 Schwab, Stein Res Math Sci (2022) 9:36 that dim(H ) = d ∈ N is a suitable finite-dimensional subspace with basis{v , ... ,v } and d 1 d consider an obstacle problem with K ={v ∈ H| v ≥ g almost everywhere} for a smooth function g ∈ H. Following Example 4.4, we introduce the set K :={v ∈ H| v ≥ 0 almost everywhere} and note that Problem (13) is equivalent to finding u = u + g ∈ K with u ∈ K such that: a(u ,v − u ) ≥ f (v − u ) − a(g, v − u ), ∀v ∈ K . (15) 0 0 0 0 0 0 0 5.1 Discretization and matrix LCP Any element v ∈ H may be expanded as v = w v for a coefficient vector w ∈ R . d i i i=1 To preserve non-negativity of the discrete approximation to (15), we assume that v ∈ K if and only if the basis coordinates are nonnegative, i.e., if w ∈ R . This property holds, for ≥0 instance, in finite element approaches. We write the discrete solution as u = x v . i i i=1 Then, u ∈ K if and only if x ∈ R . Consequently, the discrete version of (15)isto ≥0 d   d find x ∈ R :(y − x) Ax ≥ (y − x) c, ∀y ∈ R , (16) ≥0 ≥0 d×d d where the matrix A ∈ R and the vector c ∈ R are given by A := a(v ,v)and c := f, v  − a(g, v ),i,j ∈{1, ... ,d}. (17) ij j i i i H i Problem (16) is equivalent to the linear complementary problem (LCP) to find x ∈ R d×d d such that for A ∈ R and c ∈ R as in (17)itholds (18) Ax ≥ c, x ≥ 0,x (Ax − c) = 0, see, e.g., [14, Lemma 5.1.3]. If a : H × H → R is bounded and coercive as in Assump- tion 4.1, it readily follows that 2  2 d C x ≤ x Ax ≤ C x ,x ∈ R , (19) − + 2 2 where the constants C ≥ C > 0stemfromAssumption 4.1 and · is the Euclidean + − d d norm on R . This implies in particular that the LCP (18) has a unique solution x ∈ R , see [23, Theorem 4.2]. Equivalently, we may regard Problem (16), resp. (18), as varia- tional inequality on the finite-dimensional Hilbert space R equipped with the Euclidean scalar product (·,·) . Well-posedness then follows directly from Assumption 4.1 with the d d d identification H = R and the discrete bilinear form a : R × R → R, (x, y) → x Ay. 5.2 Solution of matrix LCPs by ProxNets The purpose of this section is to show that several well-known iterative algorithms to solve (finite-dimensional) LCPs may be recovered as particular cases of ProxNets in the setting of Sect. 2.Tothisend,wefix d ∈ N and use the notation H := R for convenience. We denote by {e , ... ,e }⊂ R the canonical basis of H. To approximately solve LCPs 1 d by ProxNets, and to introduce a numerical LCP solution map, we introduce the scalar and vector-valued Rectified Linear Unit (ReLU) activation function. Schwab, Stein Res Math Sci (2022) 9:36 Page 17 of 35 36 Definition 5.1 The scalar ReLU activation function  is defined as  : R → R,x → max(x, 0). The component-wise ReLU activation in R is given by (d) d d : R → R ,x → ((x, e ) )e . (20) i H i i=1 Remark 5.2 The scalar ReLU activation function  satisfies  = prox with ι ∈ [0,∞) [0,∞) (d) d (R)(see[5, Example 2.6]). This in turn yields  ∈ A(R ) for any d ∈ N by [5, Proposition 2.24]. Example 5.3 (PJORNet) Consider the LCP (18) with matrix A and triangular decompo- sition A = D + L + U, (21) d×d d×d where D ∈ R contains the diagonal entries of A,and L, U ∈ R are the (strict) lower and upper triangular parts of A, respectively. The projected Jacobi (PJOR) overrelaxation method to solve LCP (18) is given as: Algorithm 1 Projected Jacobi overrelaxation method 0 d Given: initial guess x ∈ R , relaxation parameter ω> 0 and tolerance ε> 0. 1: for k = 0, 1, 2, ... do k+1 −1 k −1 2: x = max (I − ωD A)x + ωD c, 0 k+1 k 3: if x − x  <ε then k+1 4: return x 5: end if 6: end for The max-function in Algorithm 1 acts component-wise on each entry of a vector in R . Hence, one iteration of the PJOR may be expressed as a ProxNet in Model (3)with m = 1, (d) λ = 1and  from Eq. (20)as d d (d) −1 −1 : R → R ,x → T (x):=  ((I − ωD A) x + ωD c). PJOR 1 d :=b =:W 1 If A satisfies (19) for constants C ≥ C > 0, it holds that + − 2 −1 2 W  =I − ωD A 1 d L(H) −1  2 −1  −1 = sup x x − ωx D (A + A)x + ω (xD A) D Ax x∈R ,x =1 1 1 2 2 ≤ 1 − 2ω min C + ω max A i∈{1,...,d} A i∈{1,...,d} ii A ii 2 2 ≤ 1 − 2ω + ω =: (ω). ∗ 3 2 ∗ The choice ω := C /(C A ) minimizes  such that (ω ) < 1. Moreover, (0) = 1, − 2 ∗ ∗ is strictly decreasing on [0, ω ], and increasing for ω> ω . Hence, there exists ω> 0 36 Page 18 of 35 Schwab, Stein Res Math Sci (2022) 9:36 d d such that for any ω ∈ (0, ω) the mapping  : R → R is a contraction. An application PJOR of Theorem 3.2 then shows that Algorithm (1) converges linearly for suitable ω> 0and any initial guess x . In the special case that A is strictly diagonally dominant, choosing ω = 1 is sufficient to ensure convergence, i.e., no relaxation before the activation is necessary. Example 5.4 (PSORNet) Another popular algorithm to numerically solve LCPs is the projected successive overrelaxation (PSOR) method in Algorithm 2. Algorithm 2 Projected successive overrelaxation algorithm 0 d Given: initial guess x ∈ R , relaxation parameter ω> 0 and tolerance ε> 0. 1: for k = 0, 1, 2, ... do 2: for i = 1, 2, ... ,d do k+1 1 i−1 k+1 d k 3: y = c − A x − A x i ij ij j=i+1 i A j=0 j j ii k+1 k k+1 4: x = max((1 − ω)x + ωy , 0) i i i 5: end for k+1 k 6: if x − x  <ε then k+1 7: return x 8: end if 9: end for To represent the PSOR-iteration by a ProxNet as in (3), we use the scalar ReLU activation from Definition 5.1 and define for i ∈{1, ... ,d} d d R : R → R ,x → ((x, e ) )e + x e . (22) i i H i j j j=1,j =i (d) In contrast to  in Eq. (20), the activation operator R takes the maximum only with respect to the ith entry of the input vector. Nevertheless, R ∈ A(R ) holds again by [5, d d×d Proposition 2.24]. Now, define b ∈ R and W ∈ R by i i 1 − ω l = j = i, 1 l = j ∈{1, ... ,d}\{i}, b = (0, ... , 0, ω , 0, ... , 0), (W ) = i i lj ij ii ⎪ −ω ,l = i, j ∈{1, ... ,d}\{i}, ii ith entry 0, elsewhere, k+1 k+1 d k and let T (x):= R (W x+ b ) for x ∈ R . Given the kth iterate x and x , ... ,x from i i i i 1 i−1 k+1 k+1 k,i−1 k k the inner loop of Algorithm 2, it follows for z := (x , ... ,x ,x , ... ,x ) that 1 i−1 i d k+1 k,i k,i k,i−1 x = z ,z = T (z ),i ∈{1, ... ,d},k ∈ N. (23) i i k−1,d k,0 k k+1 k As z = z = x for k ∈ N, this shows x =  (x ) for PSOR d d : R → R ,x → (T ◦ ··· ◦ T )(x). (24) PSOR d 1 Schwab, Stein Res Math Sci (2022) 9:36 Page 19 of 35 36 Provided (19) holds, we derive similarly to Example 5.3 ω ω 2    2 W  = sup x x − 2 x A x + (x A ) i [i] i [i] d A ii x∈R ,x =1 ii 1 ω ≤ 1 − 2ω C + A , ii ii ∗ 3 2 where A denotes the ith row of A. Hence, ω := C /(C A ) is sufficient to ensure that [i] − is a contraction, and convergence to a unique fixed-point follows as in Theorem 3.2. PSOR Remark 5.5 Both, the PJORNet and PSORNet from Examples 5.3 and 5.4, may be aug- mented as in 3.4 to take c ∈ R as additional input vector, and therefore to solve the LCP (18) for varying c. That is, concatenation of the PJORNet/PSORNet again yields an d d approximation to the solution operator O : R → R ,c → x associated with the RHS LCP (18) for fixed A. This is of particular interest, for instance, in the valuation of American options, where a collection of LCPs with varying model parameters has to be solved, see [14, Chapter 5] and the numerical examples in Sect. 6. Recall that c :=  f, v  −a(g, v )if i H i H i the matrix LCP stems from a discretized obstacle problem as introduced in the beginning of this section. Hence, by varying c it is possible to modify the right hand side f ,aswellas the obstacle g, of the underlying variational inequality (cf. Example 4.4 and Sect. 6.3). 5.3 Solution of parametric matrix LCPs by ProxNets In this section, we construct ProxNets that take arbitrary LCPs (A,c) in finite-dimensional, Euclidean space as input, and output approximations of the solution x to (18)withany prescribed accuracy. Consequently, these ProxNets realize approximate data-to-solution operators d d d O : {A ∈ R | there are C ,C > 0s.t. A satisfies (19)}× R → R , (A,c) → x. − + (25) Theideaistoconstruct aNNthatrealizesAlgorithm (1) that achieves prescribed error threshold ε> 0 uniformly for LCP data (A,c) from a set A , meaning the weights of the NN may not depend on A as in the previous section. To this end, we use that the multiplication of real numbers may be emulated by ReLU-NNs with controlled error and growth bounds on the layers and size of the ReLU NN. This was first shown in [27], and subsequently extended to the multiplication of an arbitrary number n ∈ N of real numbers in [24]. Proposition 5.6 [24, Proposition 2.6] For any δ ∈ (0, 1),n ∈ N and ≥ 1, there exists a ProxNet : R → R of the form (2)suchthat δ , sup  x − (x , ... ,x ) ≤ δ , i 1 n 0 δ , n 0 (x ,...,x )∈[− , ] 1 n i=1 (26) ess sup sup ∂ x − ∂ (x , ... ,x ) ≤ δ , x i x 1 n 0 j j δ , n 0 (x ,...,x )∈[− , ] j∈{1,...,n} 1 n i=1 36 Page 20 of 35 Schwab, Stein Res Math Sci (2022) 9:36 where ∂ denotes the weak derivative with respect to x . The neural network uses only x j j δ , ReLUs as in Definition 5.1 as proximal activations. There exists a constant C, independent of δ ∈ (0, 1),n ∈ N and ≥ 1, such that the number of layers m ∈ N of is 0 n,δ , 0 δ , bounded by m ≤ C 1 + log(n)log . (27) n,δ , Remark 5.7 For our purposes, it is sufficient to consider the cases n ∈{2, 3}; therefore, we assume without loss of generality that there is a constant C, independent of δ ∈ (0, 1) and ≥ 1, such that for n ∈{2, 3} it holds m ≤ C 1 + log . n,δ , Moreover, we may assume without loss of generality that m = m ,asitisalways 2,δ , 3,δ , 0 0 possible to add ReLU-layers that emulate the identity function to the shallower network (see [24, Section 2] for details). With this at hand, we are ready to prove a main result of this section. Theorem 5.8 Let ≥ 2 be a fixed constant, d ≥ 2 and define for any given ≥ 2 the set ! " −1 A satisfies (19) with ≥ C ≥ C ≥ > 0, + − d×d d A := (A,c) ∈ R × R . and c ≤ (28) −1 For the triangular decomposition A = D+L+U as in (21), define z := vec(D +L+U) ∈ 2 2 d d×d d d×d ∗ R ,where vec : R → R is the row-wise vectorization of a R -matrix. Let x be 0 d 0 the unique solution to the LCP (A,c), and let x ∈ R be arbitrary such that  x  ≤ . For any ε> 0, there exists a ProxNet d d d d : R ⊕ R ⊕ R → R (29) as in (9)andak ∈ N such that ∗ k x − x  ≤ ε k k−1 holds for the sequence  x := ( x ,z ,c) generated by  and any tuple (A,c) ∈ A . Moreover, k ≤ C (1 +| log(ε)|), where C > 0 only depends on and  has m ≤ ε 1 1 C (1+| log(ε)|+ log(d)) layers, where C > 0 is independent of . 2 2 Proof Our strategy is to approximate  from Example 5.3 for given (A,c)∈ A by PJOR (·,z ,c). We achieve this by constructing  based on the approximate multiplication NNs from Proposition 5.6 and show that  and  satisfy Assumption 3.3 to apply the PJOR error estimate from Theorem 3.5. Schwab, Stein Res Math Sci (2022) 9:36 Page 21 of 35 36 d d d d We start by defining the map  : R ⊕ R ⊕ R → R via (x, z ,c) = ⎛ ⎞ 3 2 1 1 ⎝ ⎠ max (1 − ω)x − ω x , , A + ω ,c , 0 , i j ij i δ , δ , 0 A 0 A ii ii j=1,j =i −6 − ∗ −3/2 for i ∈{1, ... ,d},0 <ω := ≤ = ω and δ ∈ (0,d ]. C A We show in the following that  is indeed a ProxNet. To bring the input into the (i) correct order for multiplication, we define for i ∈{1, ... ,d} the binary matrix W ∈ (2d+1)×(d +2d) R by 1 l = j ∈{1, ... ,d}, 1 l ∈{d + 1, ... , 2d},j = d + d(i − 1) + (l − d), (i) W := lj ⎪ 2 ⎪ 1 l = 2d + 1,j = d + d + i, 0 elsewhere. Hence, we obtain ⎛ ⎞ ⎜ ⎟ (i) W ⎝ z ⎠ = x , A , , A ,c . A ij ij i j<i j>i ii (i) 2d+1 2d+1 Now, let e , ... ,e ⊂ R be the canonical basis of R and define E := 2d+1 (i) (i) 1×(2d+1)  3×(2d+1) e ∈ R , E := [e e e ] ∈ R for j ∈{1, ... ,d}\{i} and E := d+i d+j i j d+1 3 2 2×(2d+1) [e e ] ∈ R .ByRemark 5.7, we may assume that and have d+i 2d+1 δ , δ , 0 0 an identical number of layers, denoted by m ∈ N. Moreover, it is straightforward to δ , construct a ProxNet Id : R → R with m layers that corresponds to the identity m δ , δ , 0 map, i.e., Id (x) = x for all x ∈ R. We use the concatenation from Definition 2.3 to δ , define (i) (i) (i) d +2d := Id • (E W ): R → R i δ , i (i) (i) (i) d +2d := • (E W ): R → R,j ∈{1, ... ,d}\{i}, j j δ , (i) (i) (i) d +2d := • (E W ): R → R. d+1 d+1 δ , Note that this yields 3 2 1 1 (i) (i) (i) (x, z ,c) = x ,  (x, z ,c) = x , , A ,  (x, z ,c) = ,c . A i A j ij A i i j d+1 δ , δ , 0 A 0 A ii ii (+,i) d +d (+,i) Furthermore, we set m := m + 1 and define T : R → R,x → (W x), 1 δ , m 0 1 (+,i) 1×(d+1) where  : R → R is the (scalar) ReLU activation and W ∈ R is given by 1 − ω j = i, (+,i) W := −ω j ∈{1, ... ,d}\{i}, ω j = d + 1. 36 Page 22 of 35 Schwab, Stein Res Math Sci (2022) 9:36 (i) (i) As  , ... ,  have the same input dimension, the same number of m layers, and δ , 1 d+1 no skip connections, we may parallelize as in Definition 2.5 to ensure ⎛ ⎞ 3 2 1 1 ⎝ ⎠ (x, z ,c) = max (1 − ω)x − ω x , , A + ω ,c , 0 i i j ij i δ , δ , 0 A 0 A ii ii j=1,j =i (i) (i) (+,i) = T • P  , ... ,  (x, z ,c). m 1 1 d+1 (+,i) (i) (i) It holds that  := T • P  , ... ,  is a ProxNet as in Eq. (9)with  : i m i 1 1 d+1 d +2d R → R and m = m + 1 layers for any i ∈{1, ... ,d}. We parallelize once more 1 δ , and obtain that  := P( , ... ,  ) is a ProxNet with m +1 layers that may be written 1 δ , d 0 (1) (1) (1) d d j−1 j as  = T ◦···◦T for suitable one-layer networks T : R → R and dimensions 1 1 1 d ∈ N for j ∈{0, 1, ... ,m } such that d = d + 2d and d = d. j 1 0 m −6 We now fix (A,c)∈ A and let  := R(W ·+ b )beasinExample 5.3 with ω = , PJOR 1 1 −1 −1 W = I − ωD A and b := ωD c. This shows that  has Lipschitz constant 1 d 1 PJOR −4 −8 −4 2 −4 L =W  ≤ 1 − 2 + = 1 − < 1and b  ≤ ω ≤ . 1 2 1 2 Note that |c |, 1/A ,|A |≤ for any i, j ∈{1, ... ,d}. Therefore, Proposition 5.6 yields i ii ij for x := (z ,c)and any x ∈ R with x ≤ that 0 A ∞ (x) − (x, z ,c) =T (x) − T (x, z ,c) 1 1 A ⎛ ⎞ d d 2 3 c 1 1 i  ij ⎝ ⎠ = ω − c , − x − A , ,x i j ij j δ , δ , A 0 A A 0 A ii ii ii ii i=1 j=1,j =i 2 3 2 ≤ ω d δ . −3/2 −6 Hence, since δ ∈ (0,d ]and ω = ,  and  satisfy Assumption 3.3 with 0 PJOR −4 3/2 L := 1 − ∈ (0, 1), δ := ωd δ ≥ 0, := ≥ 2, 0 1 −4 3/2 := −b  − δ ≥ − − ωd δ ≥ , 0 1 1 2 0 −6 123 1 := − δ/(1 − L) ≥ − ≥ − > 0. 2 0 0 −4 64 4 Theorem 3.5 then yields that there exists a constant C > 0 such that for all k, δ holds ∗ k k x − x  ≤ C L + δ . Here, C ≤ max(2 , 1)/(1 − L) ≤ 2 is independent of k. Given ε> 0, we choose % & ε ε min 1, min 1, log(ε) − log(2C) 2Cω 4 k =: , δ := ≥ ε 0 3/2 3/2 d d log(L) ∗ k to ensure x −  x ≤ ε. Hence, k ≤ C (1 +| log(ε)|), where C = C ( ) > 0is ε 1 1 1 −3/2 independent of d. Moreover, Inequality (27)inProposition 5.6 and the choice δ ≤ d shows that m ≤ C (1 +| log(ε)|+ log(d)), where C > 0 is independent of .The δ , 2 2 claim follows since  has m = m + 1 layers by construction. 1 δ , 0 Schwab, Stein Res Math Sci (2022) 9:36 Page 23 of 35 36 For fixed and ε,the ProxNets  emulate one step of the PJOR algorithm for any LCP (A,c) ∈ A and a given initial guess  x . This in turn allows to approximate the data-to-solution operator O from (25) to arbitrary accuracy by concatenation of suitable ProxNets. The precise statement is given in the main result of this section: Theorem 5.9 Let ≥ 2 be fixed, let A be givenasin(28), and let the data-to-solution operator O be given as in (25). Then, for any ε> 0, there is a ProxNet O : A → R such that for any LCP (A,c) ∈ A there holds O(A,c) − O (A,c) ≤ ε. ε 2 d×d Furthermore, let · denote the Frobenius norm on R . There is a constant C > 0, (1) (1) (2) (2) depending only on andd,suchthatforanyε> 0 and any two (A ,c ), (A ,c ) ∈ A there holds (1) (1) (2) (2) (1) (2) (1) (2) O (A ,c ) − O (A ,c ) ≤ C A − A  +c − c  . (30) ε ε 2 F 2 We give an explicit construction of the approximate data-to-solution operator O in the proof of Theorem 5.9 at the end of this section. To show the Lipschitz continuity of O with respect to the parametric LCPs in A , we derive an operator version of the so-called Strang Lemma: (1) (1) (2) (2) Lemma 5.10 Let ≥ 2,d ≥ 2, and let (A ,c ), (A ,c ) ∈ A .For l ∈{1, 2}, (l) (l) (l) (l) (l) let A = D + L + U be the decomposition of A as in (21) and define z := (l) (l) −1 (l) (l) d vec((D ) + L + U ) ∈ R . For target emulation accuracy ε> 0, let  be the 0 d 0 ProxNet as in (29), let x ∈ R be such that  x  ≤ and define the sequences (l),k (l),k−1 (l) (l),0 0 x := ( x ,z (l),c ),k ∈ N,  x := x ,l ∈{1, 2}. (31) Then, there is a constant C > 0, depending only on andd,suchthatfor anyk ∈ N and arbitrary, fixed ε> 0 it holds that (1),k (2),k (1) (2) (1) (2) x − x  ≤ C A − A  +c − c  . (32) 2 F 2 Proof By construction of  in Theorem 5.8,wehavefor x ∈ R , l ∈{1, 2},and i ∈ {1, ... ,d} that (l) (x, z ,c ) (l) i ⎛ ⎞ 3 2 1 1 (l) (l) ⎝ ⎠ = max (1 − ω)x − ω x , , A + ω ,c , 0 . i j ij i (l) (l) δ , δ , 0 0 A A j=1,j =i ii ii Therefore, we estimate by the triangle inequality (1) (2) |(x, z ,c ) − (x, z ,c ) | (1) (2) i i A A 3 3 1 1 (1)  (2) ≤ ω x , , A − x , , A j j ij ij (1) (2) δ , δ , 0 0 A A j=1,j =i ii ii 36 Page 24 of 35 Schwab, Stein Res Math Sci (2022) 9:36 2 2 1 1 (1) (2) + ω  ,c − ,c i i (1) (2) δ , δ , 0 0 A A ii ii 3 3 1 1 (1) (2) ≤ ω x , , A − x , , A j j ij ij (1) (1) δ , δ , 0 0 A A j=1,j =i ii ii 3 3 1 1 (2) (2) ω x , , A − x , , A j j ij ij (1) (2) δ , δ , 0 0 A A j=1,j =i ii ii 2 2 1 1 (1)  (2) + ω ,c − ,c i i (1) (1) δ , δ , 0 0 A A ii ii 2 2 1 1 (2) (2) + ω ,c − ,c . i i (1) (2) δ , δ , 0 0 A A ii ii (l) (l) (l) (l) Since (A ,c ) ∈ A for l ∈{1, 2}, it holds for any i, j ∈{1, ... ,d} that 1/A , A , ii ij (l) c ∈ [− , ]. Hence, for any x with x ≤ we obtain by ≥ 2 and the second estimate in (26) (2) (2) |(x, z ,c ) − (x, z ,c ) | (1) i (2) i A A (1) (2) ≤ ω δ + A − A ij ij (1) j=1,j =i ii 1 1 (2) + ω δ +|x A | − 0 j ij (1) (2) A A ii ii ⎛ ⎞ (1) (2) ⎝ ⎠ + ω δ + c − c i i (1) ij 1 1 (2) + ω δ +|c | − (1) (2) A A ii ii ⎛ ⎞ (1) (2)  (1) (2) 2 4 ⎝ ⎠ ≤ ω2(δ + ) A − A + c − c ij ij i i j=1 ⎛ ⎞ ⎛ ⎞ 1/2 ⎜  (1) (2)  (1) (2)⎟ 2 4 1/2 ⎝ ⎠ ≤ ω(δ + ) d A − A + c − c . ⎝    ⎠ ij ij i i j=1 We have used the mean-value theorem to obtain the bound 1 1 (1) (2) −  ≤ A − A ii ii (1) (2) A A ii ii in the second last inequality and the Cauchy–Schwarz inequality in the last step. We recall −6 −3/2 from the proof of Theorem 5.8 that ω = and δ ≤ d ; hence, there is a constant C = C( ,d) > 0, depending only on the indicated parameters, such that for any x ∈ R with x ≤ it holds (1) (2) (1) (2) (1) (2) (x, z ,c ) − (x, z ,c ) ≤ C A − A  +c − c  . (33) (1) (2) 2 F 2 A A Schwab, Stein Res Math Sci (2022) 9:36 Page 25 of 35 36 Moreover, for any x, y ∈ R such that x ,y ≤ , it holds by the mean-value ∞ ∞ theorem and the second estimate in (26) (1) (1) |(x, z (1),c ) − (y, z (1),c ) | i i A A (1) (1) −1 ≤ (x, z ,c ) − (y, z ,c ) − ((I − ωD A)(x − y)) (1) (1) i i d i A A −1 + ((I − ωD A)(x − y)) d i (1) 3 3 1 1 ij (1) (1) = ω x , , A − y , , A − (x − y ) j j j j ij ij (1) (1) (1) δ , δ , 0 0 A A A j=1,j =i ii ii ii −1 + ((I − ωD A)(x − y)) d i −1 ≤ ωδ |x − y |+ ((I − ωD A)(x − y)) . 0 j j d i j=1,j =i Hence, Young’s inequality yields for any > 0 that (1) (1) 2 (x, z ,c ) − (y, z ,c ) (1) (1) A A 2 ⎛ ⎞ d d 2 2 −1 2 ⎝ ⎠ ≤ 1 + ω δ |x − y | + (1 + )(I − ωD A)(x − y) j j d 0 2 (34) i=1 j=1,j =i 2 2 −1 2 2 ≤ 1 + ω δ d(d − 1) + (1 + )I − ωD A x − y , 0 2 2 where we have used the Cauchy–Schwarz inequality in the last step. From the proof of −6 −3/2 Theorem 5.8,wehaveasbeforethat ω = , δ ≤ d , and, furthermore I − −1 4 −4 (1) d d ωD A ≤ 1 − . Setting  := therefore shows that (·,z ,c ): R → R is 2 (1) a contraction on (R ,· ) with Lipschitz constant L > 0bounded by 2 1 −8 1/2 7 −12 −8 −1 −8 −12 −8 L ≤ + d + (1 − ) ≤ 1 + − ≤ 1 − ∈ (0, 1). 2 16 (35) Note that we have used d ≥ 2and ≥ 2 in the last two steps to obtain (35). Now, (l),k let ( x ) for l ∈{1, 2} and k ∈ N denote the iterates as defined in (31) and recall from (l),k (l),k the proof of Theorem 3.5 that  x  ≤ x  ≤ . Therefore, we may apply the ∞ 2 estimates in (33)and (34)toobtain (1),k (2),k (1),k (2),k−1 (1) (2),k (1) (2),k x − x  ≤ x − ( x ,z ,c ) +( x ,z ,c ) − x (1) (1) 2 2 2 A A (1),k−1 (2),k−1 (1) (2) (1) (2) ≤ L  x − x  + C A − A  +c − c 1 2 F 2 k−1 (1) (2) (1) (2) ≤ C A − A  +c − c  L F 2 j=1 (1) (2) (1) (2) ≤ A − A  +c − c  . F 2 1 − L The claim follows for C := < ∞,since C = C( ,d), and L is bounded indepen- 1−L dently with respect to ε and k by (35).  36 Page 26 of 35 Schwab, Stein Res Math Sci (2022) 9:36 d d d d Proof of Theorem 5.9 For fixed and ε, let the ProxNet  : R ⊕ R ⊕ R → R and k ∈ N be given as in Theorem 5.8. We define the operator O by concatenation of  via ε ε ⎡ ⎤ ⎢ ⎥ O (A,c):= (·,z ,c) • ··· • (·,z ,c) (0), (A,c) ∈ A . ⎣ ⎦ ε A A k -fold concatenation 0 d 0 1 Note that the initial value x := 0 ∈ R satisfies  x  ≤ for arbitrary > 0. Thus, 0 ∗ d applying Theorem 5.8 with x = 0 yields for any LCP (A,c) ∈ A with solution x ∈ R that ∗ k O(A,c) − O (A,c) =x − x  ≤ ε. ε 2 2 1,0 2,0 To show the second part of the claim, we set  x =  x := 0 and observe that (1),k (2),k (l),k (l) (l) ε ε ε x ,x in Lemma 5.10 are given by x = O (A ,c ) for l ∈{1, 2}. Hence, the (1) (1) (2) (2) estimate (30) follows immediately for any ε> 0and (A ,c ), (A ,c ) ∈ A from (32), by setting k = k . 6 Numerical experiments 6.1 Valuation of American options: Black–Scholes model To illustrate an application for ProxNets, we consider the valuation of an American option in the Black–Scholes model. The associated payoff function of the American option is denoted by g : R → R , and we assume a time horizon T = [0,T] for T > 0. In ≥0 ≥0 any time t ∈ T and for any spot price x ≥ 0 of the underlying stock, the value of the option is denoted by V (t, x) and defines a mapping V : T × R → R . Changing to ≥0 ≥0 time-to-maturity and log-price yields the map v : T × R → R , (t, x) → V (T − t, e ), ≥0 which is the solution to the free boundary value problem 2 2 σ σ ∂ v − ∂ v − r − ∂ v + rv ≥0in(0,T] × R, t xx x 2 2 v(t, x) ≥ g(e)in(0,T] × R, (36) 2 2 σ σ ∂ v − ∂ v − r − ∂ v + rv (g − v) =0in(0,T] × R, t xx x 2 2 x x v(0,e ) = g(e)in R, see, e.g., [14, Chapter 5.1]. The parameters σ> 0and r ∈ R are the volatility of the underlying stock and the interest rate, respectively. We assume that g ∈ H (R )and ≥0 construct in the following a ProxNet-approximation to the payoff-to-solution operator at time t ∈ T given by 1 1 O : H (R ) → H (R),g → v(t,·). (37) payoff,t ≥0 As V and v, and therefore O , are in general not known in closed-form, a common payoff,t approach to approximate v for a given payoff function g is to restrict Problem (36)to a bounded domain D ⊂ R and to discretize D by linear finite elements based on d 1 0 d 0 We could have also used any other x = 0 ∈ R such that  x  ≤ to define O for given and ε, but decided to ∞ ε fix the -independent initial guess x := 0 for simplicity. Schwab, Stein Res Math Sci (2022) 9:36 Page 27 of 35 36 equidistant nodal points. The payoff function is interpolated with respect to the nodal basis, and we collect the respective interpolation coefficients of g in the vector g ∈ R . Thetimedomain [0,T] is split by M ∈ N equidistant time steps and step size t = T /M, and the temporal derivative is approximated by a backward Euler approach. This space- time discretization of the free boundary problem (36) leads to a sequence of discrete d d d variational inequalities: Given g ∈ R and u := 0 ∈ R find u ∈ R such that for 0 m m ∈{1, ... ,M},itholds Au ≥ F ,u ≥ 0, (Au − F ) u = 0. (38) m+1 m m+1 m+1 m m+1 2 2 σ σ BS d×d BS The LCP (38) is defined by the matrices A := M + tA ∈ R , A := S + ( − 2 2 d×d BS  d r)B + rM ∈ R and right hand side F :=−t(A ) g + Mu ∈ R . The matrices m m d×d S, B, M ∈ R represent the finite element stiffness, advection and mass matrices; hence, A is tri-diagonal and asymmetric if = r. The true value of the options at time km is approximated at the nodal points via v(tm,·) ≈ u + g. This yields the discrete payoff- to-solution operator at time tm defined by d d O : R → R ,g → u + g,m ∈{1, ... ,M}. (39) payoff,tm m Problem (38) may be solved for all m using a shallow ProxNet d d d d : R ⊕ R ⊕ R → R ,x → R(W x + b ), 1 1 (d) d d with ReLU-activation R =  : R → R . The architecture of  allows to take g and u as additional inputs in each step; hence, we train only one shallow ProxNet that may be used for any payoff function g and every time horizon T. Therefore, we learn the payoff-to-solution operator O associated with Problem (36) by concatenating .The payoff,t d×3d d parameters W ∈ R and b ∈ R are learned in the training process and shall emulate 1 1 one step of the PJOR Algorithm 1, as well as the linear transformation (g,u ) → F to m m obtain the right hand side in (38). Therefore, a total of 3d + d parameters have to be learned in each example. For our experiments, we use the Python-based machine learning package PyTorch. All experiments are run on a notebook with 8 CPUs, each with 1.80 GHz, and 16 GB memory. (i) (i) (i) (i) 3d To train ,wesample N ∈ N input data points x := (x ,g ,u ) ∈ R , i ∈{1, ... ,N }, s s from a 3d-dimensional standard-normal distribution. The output-training data samples (i) (i) 0 y consist of one iteration of Algorithm 1 with ω = 1, initial value x := x ,with A as in BS  (i) (i) d (38) and right hand side given by c :=−t(A ) g + Mu ∈ R . We draw a total of N = 2· 10 input–output samples, use half of the data for training, and the other half for validation. In the training process, we use mini-batches of size N = 100 and the Adam batch −3 Optimizer [18] with initial learning rate 10 , which is reduced by 50% every 20 epochs. As error criterion, we use the mean-squared error (MSE) loss function, which is for each (i ) (i ) (i ) (i ) j j j j batch of inputs ((x ,g ,u ),j = 1, ... ,N ) and outputs (y ,j = 1, ... ,N ) batch batch https://pytorch.org/. 36 Page 28 of 35 Schwab, Stein Res Math Sci (2022) 9:36 Fig. 1 Decay of the loss function for d = 600 (left) and d = 1000 (right). In all of our experiments, the −12 training loss falls below the threshold of 10 before the 250th epoch, and training is stopped early given by (i ) (i ) (i ) (i ) (i ) (i ) 1 1 1 N N N batch batch batch Loss (x ,g ,u ),··· , (x ,g ,u ) batch (i ) (i ) (i ) (i ) 2 j j j j := (x ,g ,u ) − y  . batch j=1 −12 We stop the training process if the loss function falls below the tolerance 10 or after a maximum of 300 epochs. The number of spatial nodal points d that determines the size of the matrix LCPs is varied throughout our experiments in d ∈{200, 400, ... , 1000}. We choose the Black–Scholes parameters σ = 0.1, r = 0.01 and T = 1. Spatial and temporal refinement are balanced by using M = d time steps of size t = T /M = 1/d. The decay of the loss-curves is depicted in Fig. 1, where the reduction in the learning rate every 20 epochs explains the characteristic “steps” in the decay. This stabilizes the training −12 procedure, and we reached a loss of O(10 ) for each d before the 250th epoch. Once training is terminated, we compress the resulting weight matrix of the trained −7 single-layer ProxNet by setting all entries with absolute value lower than 10 to zero. This speeds up evaluation of the trained network, while the resulting error is negligible. As the matrix W in the trained ProxNet is close to the “true” tri-diagonal matrix A from (38), this eliminates most of the ProxNet’s O(d ) parameters, and only O(d) non-trivial entries remain. The relative validation error is estimated based on the N := 10 validation samples val via (i ) (i ) (i ) (i ) val 2 j j j j (x ,g ,u ) − y j=1 2 err := . (40) val val (i ) 2 j=1 2 The validation errors and training times for each dimension are found in Table 1 and confirm the successful training of the ProxNet. Naturally, training time increases in d, −6 while the validation error is small of order O(10 ) for all d. To test the trained neural networks on Problem (38) for the valuation of an American option, we consider a basket of 20 put options with payoff function g (x):= max(K − x, 0) i i and strikes K = 10 + 90 for i ∈{1, ... , 20}. Hence, we use the same ProxNet for 20 different payoff vectors g . Note that we did not train our networks on payoff functions, i Schwab, Stein Res Math Sci (2022) 9:36 Page 29 of 35 36 Table 1 Training times and validation errors for the ProxNets in the Black–Scholes model in several dimensions, as estimated in (40) based on N = 10 samples val d 200 400 600 800 1000 Training time in s 6.06 39.38 90.69 311.04 466.87 −6 −6 −7 −6 −6 err 1.15 · 10 1.08 · 10 8.88 · 10 1.04 · 10 1.36 · 10 val The relative error remains stable with increasing problem dimension but on random samples, and thus, we could in principle consider an arbitrary basket containing different types of payoffs. The restriction to put options is for the sake of brevity only. We denote by u for m ∈{0, ... ,M} the sequence of solutions to (38)with m,i payoff vector g and u = 0 ∈ R for each i. 0,i Concatenating  k times yields an approximation to the discrete operator O payoff,tm in (39) for any m ∈{1, ... ,M} via ⎡ ⎤ ⎢ ⎥ d d d d O : R ⊕R ⊕R → R , (x, u ,g) → (·,g, u ) • ··· • (·,g, u ) (x). ⎣ ⎦ payoff,tm m m m k-fold concatenation An approximating sequence of (u ,m ∈{0, ... ,M}) is then in turn generated by m,i u := O ( u , u ,g),  u := u = 0 ∈ R . m+1,i payoff,tm m,i m,i 0,i 0,i 0 d That is, u is given by iterating  k times with initial input x =  u ∈ R and fixed m+1,i m,i inputs and g and u . We stop for each m after k iterations if two subsequent iterates x m,i k−1 k k−1 −3 and x satisfy x − x  < 10 . The reference solution u is calculated by a Python-implementation that uses the M,i Primal-Dual Active Set (PDAS) Algorithm from [15] to solve LCP (38)withtolerance ε = −6 10 in every time step. Compared to a fixed-point iteration, the PDAS method converges (locally) superlinear according to [15, Theorem 3.1], but has to be called separately for each payoff function g . In contrast, the ProxNet  may be iterated for the entire batch of 20 payoffs at once in PyTorch. We measure the relative error err := u − u  /u i,rel M,i M,i 2 M,i 2 for each payoff vector g at the end point T = tM = 1 and report the sample mean error err := err . (41) rel i,rel i=1 Sample mean errors and computational times are depicted for d ∈{200, 400, ... , 1000} in Table 2, where we also report the number of iterations k for each d to achieve the −3 desired tolerance of 10 . The results clearly show that ProxNets significantly accelerate the valuation of American option baskets, if compared to the standard, PDAS-based implementation. This holds true for any spatial resolution, i.e., the number of grid points d, −3 −4 while the relative error is small of magnitude O(10 )or O(10 ). For d ≥ 600, we actually find that the combined times for training and evaluation of ProxNets is below the runtime of the reference solution. We further observe that computational times scale similarly 36 Page 30 of 35 Schwab, Stein Res Math Sci (2022) 9:36 Table 2 Relative errors and computational times of a ProxNet solver for a basket of American put options in the Black–Scholes model d 200 400 600 800 1000 −4 −4 −3 −3 −3 err 2.15 · 10 7.89 · 10 1.52 · 10 2.41 · 10 3.48 · 10 rel Iterations to tolerance 9 13 15 17 18 Time ProxNet in s 0.26 1.16 6.23 15.06 30.45 Time reference in s 4.37 33.17 142.01 350.86 761.10 ProxNets significantly reduce computational time, while their relative error remains sufficiently small for all d for both, ProxNet and reference solution, in d. Hence, in our experiments, ProxNets are computationally advantageous even for a fine resolution of d = 1000 nodal points. 6.2 Valuation of American options: jump-diffusion model We generalize the setting of the previous subsection from the Black–Scholes market to an exponential Lévy model. That is, the log-price of the stock evolves as a Lévy process, with jumps distributed with respect to the Lévy measure ν : B(R) → [0,∞). The option value v (in log-price and time-to-maturity) is now the solution of a partial integro-differential inequality given by ∂ v − ∂ v − γ∂ v + v(·+ z) − v − ∂ vν(dz) + rv ≥0in(0,T] × R, t xx x x v(t, x) ≥ g(e)in(0,T] × R, (42) ∂ v − ∂ v − γ∂ v + v(·+ z) − v − ∂ vν(dz) + rv (g − v) =0in(0,T] × R, t xx x x x x v(0,e ) = g(e)in R. Introducing jumps in the model hence adds a non-local integral term to Eq. (36). The 2/2 z driftisset to γ :=−σ − (e − 1 − z)ν(dz) ∈ R in order to eliminate arbitrage in the market. We discretize Problem (42) by an equidistant grid in space and time as in the previous subsection, for details, e.g., integration with respect to ν, we refer to [14,Chapter 10]. The space-time approximation yields again a sequence of LCPs of the form L L A u ≥ F ,u ≥ 0, (A u − F ) u = 0, (43) m+1 m m+1 m+1 m m+1 L Levy d×d Levy J J where A := M + tA ∈ R with A := S + A , and the matrix A stems from the integration of ν. A crucial difference to (38) is that A is not anymore tri- diagonal, but a dense matrix, due to the non-local integral term caused by the jumps. The drift γ and interest rate r are transformed into the right hand side, such that F := Levy  d −t(A ) g + Mu ∈ R , where g is the nodal interpolation of the transformed m m rkm payoff g (x):= ge (x− (γ + r)km). The inverse transformation gives an approximation −rT to the solution v of (42) at the nodal points via v(km,·− (γ + r)T) ≈ e u . We refer to [14, Chapter 10.6] for further details on the discretization of American options in Lévy models. The jumps are distributed according to the Lévy measure −β z −β z + − ν(dz) = λpβ e 1 (z) + λ(1 − p)β e 1 (z),z ∈ R. (44) + {z>0} − {z<0} That is, the jumps follow an asymmetric, double-sided exponential distribution with jump intensity λ = ν(R) ∈ (0,∞). We choose p = 0.7, β = 25, β = 20 to characterize the + − Schwab, Stein Res Math Sci (2022) 9:36 Page 31 of 35 36 Table 3 Training times and validation errors for the ProxNets in the jump-diffusion model, as estimated in (40) based on N = 10 samples val d 200 400 600 800 1000 Training time in s 6.59 37.03 88.22 300.40 461.79 −6 −6 −7 −6 −6 err 1.18 · 10 1.09 · 10 9.79 · 10 9.96 · 10 1.43 · 10 val The relative error remains stable with increasing problem dimension Table 4 Relative errors and computational times of a ProxNet solver for a basket of American put options in the jump-diffusion model d 200 400 600 800 1000 −4 −4 −4 −3 −3 err 1.55 · 10 4.97 · 10 9.62 · 10 1.52 · 10 2.09 · 10 rel Iterations to tolerance 6 7 7 7 6 Time ProxNet in s 0.21 1.04 4.81 11.62 34.27 Time reference in s 4.29 31.52 147.20 354.25 782.45 ProxNets significantly reduce computational time, while their relative error remains sufficiently small for all d tails of ν and set jump intensity to λ = 1. We further use σ = 0.1and r = 0.01 as in the Black–Scholes example. We use the same training procedure and parameters as in the previous subsection to train the shallow ProxNets. As only difference, we compress the weight matrix with −8 −7 L tolerance 10 instead of 10 (recall that A is dense). This yields slightly better relative errors in this example, while it does not affect the time to evaluate the ProxNets. Training times and validation errors are depicted in Table 3 and indicate again a successful training. The decay of the training loss is for each d very similar to Fig. 1, and training is again stopped in each case before the 300th epoch. After training, we again concatenate the shallow nets to approximate the operator O in (37), that maps the payoff function g to the corresponding option value v(t,·)at payoff,t any (discrete) point in time. We repeat the test from Sect. 6.1 in the jump-diffusion model with the identical basket of put options to test the trained ProxNets. The reference solution is again computed by a PDAS-based implementation. The results for American options in the jump-diffusion model are depicted in Table 4. Again, we see that the trained ProxNets −3 approximated the solution v to (42) for any g to an error of magnitude O(10 )orless. While keeping the relative error small, ProxNets again significantly reduce computational time and are therefore a valid alternative in more involved financial market models. We finally observe that the number of iterations to tolerance in the jump-diffusion model is stable at 6–7 for all d, whereas this number increases with d in the Black–Scholes mar- ket (compare the third row in Tables 2 and 4). The explanation for this effect is that the excess-to-payoff vector u has a smaller norm in the jump-diffusion case, but the iterations −3 terminate at the (absolute) threshold 10 in both, the Black–Scholes and jump-diffusion model. Therefore, we require less iterations in the latter scenario, although the option prices v and relative errors are of comparable magnitude in both examples. 6.3 Parametric obstacle problem To show an application for ProxNets beyond finance, we consider an elliptic obstacle 2 1 problem in the two-dimensional domain D := (−1, 1) .Wedefine H := H (D)and aim 0 36 Page 32 of 35 Schwab, Stein Res Math Sci (2022) 9:36 to find the solution u ∈ H to the partial differential inequality − u ≥ f in D,u ≥ g in D,u =0on ∂D. (45) Therein, f ∈ H is a given source term and g ∈ H is an obstacle function, for which we assume g ∈ C(D) ∩ H for simplicity in the following. We introduce the convex set K :={v ∈ H| v ≥ g almost everywhere} and the bilinear form a : H × H → R, (v, w) → ∇v ·∇w dx, and note that a, f and K satisfy Assumption 4.1. The variational inequality problem associated with (45) is then to find u ∈ K such that: a(u, v − u) ≥ f (v − u), ∀v ∈ K. (46) As for (15) at the beginning of Sect. 5, we introduce K :={v ∈ H| v ≥ 0 almost everywhere},and Problem(46) is equivalent to finding u = u + g ∈ K with u ∈ K such that: a(u ,v − u ) ≥ f (v − u ) − a(g, v − u ), ∀v ∈ K . (47) 0 0 0 0 0 0 0 As for the previous examples in this section, we use ProxNets to emulate the obstacle-to- solution operator O : H → H,g → u. (48) obs 2 2 We discretize D = [−1, 1] for d ∈ N by a (d + 2) -dimensional nodal basis of 0 0 linear finite elements, based on (d + 2) equidistant points in every dimension. Due to the homogeneous Dirichlet boundary conditions in (45), we only have to determine the discrete approximation of u within D and may restrict ourselves to a finite element basis {v , ... ,v }, for d := d , with respect to the interior nodal points. Following the procedure outlined in Sect. 5.1,wedenoteby g ∈ R again the nodal interpolation coefficients of d×d g (recall that we have assumed g ∈ C(D)) and by A ∈ R the finite element stiffness matrix with entries A := a(v ,v ) for i, j ∈{1, ... ,d} This leads to the matrix LCP to find ij j i u ∈ R such that Au ≥ c, u ≥ 0,u (Au − c) = 0, (49) d T where c ∈ R is in turn given by c := f (v )−(A g) for i ∈{1, ... ,d}. Given a fixed spatial i i i discretization based on d nodes, we again approximate the discrete obstacle-to-solution operator d d O : R → R ,g → u (50) obs d d d by concatenating shallow ProxNets  : R ⊕ R → R . The training process of the ProxNets in the obstacle problem is the same as in Sects. 6.1 and 6.2 and thus, is not further outlined here. The only difference is that we draw the input data for training now from a 2d-dimensional standard normal distribution. The output samples again correspond to one PJOR-Iteration with A and c as in (49)and Schwab, Stein Res Math Sci (2022) 9:36 Page 33 of 35 36 Table 5 Training times and validation errors for the ProxNets in the Obstacle Problem, as estimated in (40) based on N = 10 samples val d 100 400 900 1600 Training time in s 4.34 22.97 259.19 907.07 −6 −6 −7 −6 err 1.11 · 10 1.16 · 10 9.11 · 10 1.78 · 10 val The relative error remains stable with increasing problem dimension ω = 1, where the initial value and g are both replaced by the 2d-dimensional random input vector. After training, we again compress the weight matrices by setting all entries −7 with absolute value lower than 10 to zero. We test the ProxNets for LCPs of dimension 2 2 2 2 d ∈{10 , 20 , 30 , 40 } and report training times and validation errors in Table 5.As before, training is successful and aborted early for each d, since the loss function falls −12 below 10 before the 300th epoch. d d d d Once  : R ⊕ R → R is trained for given d, we use the initial value zero x = 0 ∈ R and concatenate  k times to obtain for any g the approximate discrete obstacle-to- solution operator ⎡ ⎤ ⎢ ⎥ d d O : R → R ,g → (·,g) • ··· • (·,g) (0). ⎣ ⎦ obs k-fold concatenation This yields u = O (g) ≈ u := O (g). We test the trained ProxNets on the parametric obs obs family of obstacles (g ,r > 0) ⊂ H, given by 2 1 1 −rx g (x):= min max e − , 0 , ,x ∈ D. (51) 2 4 For given r > 0, let g ∈ R denote the nodal interpolation of g ,andlet u be discrete solu- tion to the corresponding obstacle problem. We approximate the solutions u to (49) for 4i a basket of 100 obstacles g with r ∈ R :={1+ | i ∈{0, ... , 99}}. For this, we iterate the ProxNets  again on the entire batch of obstacles and denote by u the kth iterate for any k k−1 −4 r ∈ R. We stop the concatenation of  after k iterations if max u − u  < 10 , r∈R 2 r r andreportonthe valueof k for each d. The lower absolute tolerance is necessary in the obstacle problem, since the solutions u now have lower absolute magnitude as compared to the previous examples. The reference solution is again calculated by solving (49)with the PDAS algorithm, which has to be called separately for each obstacle in (g ,r ∈ R). A sample of g together with the associated discrete solution u and its ProxNet approxima- tion u is depicted in Fig. 2. The relative error of the ProxNet approximation, the number of iterations and the com- putational times are depicted in Table 6. ProxNets approximate the discrete solutions −4 well with relative errors of magnitude O(10 ) for all d. However, compared to the exam- ples in Sects. 6.1 and 6.2, we observe that significantly more iterations are necessary to −4 achieve the absolute tolerance of 10 . This is due to the larger contraction constants in the obstacle problem, which are very close to one for all d. The lower absolute tolerance −4 of 10 adds more iterations, but is not the main reason why we observe larger values of k in the obstacle problem. Nevertheless, ProxNets still outperform the reference solver in terms of computational time, with a relative error of at most 0.1% for large d. 36 Page 34 of 35 Schwab, Stein Res Math Sci (2022) 9:36 Fig. 2 From left to right: Obstacle g as in (51) with scale parameter r = 1.7677, the corresponding discrete 2 2 solution u with refinement parameter h := in each spatial dimension (corresponds to d = 40 interior nodal points in D), and its ProxNet approximation u based on k = 698 iterations Table 6 Relative errors and computational times of a ProxNet solver for a family of parametric obstacle problems d 100 400 900 1600 −4 −4 −4 −3 err 3.69 · 10 5.89 · 10 9.20 · 10 1.14 · 10 rel Iterations to tolerance 56 206 416 698 Time ProxNet in s 0.01 0.07 0.50 2.71 Time reference in s 0.08 0.51 3.13 26.67 ProxNets again reduce computational time, while keeping the relative error sufficiently small for all d. The number of iterations to tolerance is now significantly larger as in the previous examples 7Conclusions We proposed deep neural networks which realize approximate input-to-solution opera- tors for unilateral, inequality problems in separable Hilbert spaces. Their construction was based on realizing approximate solution constructions in the continuous (infinite dimen- sional) setting, via proximinal and contractive maps. As particular cases, several classes of finite-dimensional projection maps (PSOR, PJOR) were shown to be representable by the proposed ProxNet DNN architecture. The general construction principle behind ProxNet introduced in the present paper can be employed to realize further DNN architectures, also in more general settings. We refer to [1] for multilevel and multigrid methods to solve (discretized) variational inequality problems. The algorithms in this reference may also be realized as concatenation of ProxNets, similarly to the PJOR-Net and PSOR-Net from Examples 5.3 and 5.4. The analysis and representation of multigrid methods as ProxNets will be considered in a forthcoming work. Acknowledgements The preparation of this work benefited from the participation of ChS in the thematic period “Mathematics of Deep Learning (MDL)” from 1 July to 17 December 2021, at the Isaac Newton Institute, Cambridge, UK. AS has been funded in part by ETH Foundations of Data Science (ETH-FDS), and it is greatly appreciated. Data availability The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. Received: 1 October 2021 Accepted: 4 April 2022 References 1. Badea, L.: Convergence rate of some hybrid multigrid methods for variational inequalities. J. Numer. Math. 23(3), 195–210 (2015) 2. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, 2nd edn. Springer, Cham (2017) (With a foreword by Hédy Attouch) Schwab, Stein Res Math Sci (2022) 9:36 Page 35 of 35 36 3. Becker, S., Cheridito, P., Jentzen, A.: Deep optimal stopping. JMLR 20, 74 (2019) 4. Borwein, J.M., Lewis, A.S.: Convex Analysis and Nonlinear Optimization, volume 3 of CMS Books in Mathemat- ics/Ouvrages de Mathématiques de la SMC, 2nd edn. Springer, New York (2006) (Theory and examples) 5. Combettes, P.L., Pesquet, J.-C.: Deep neural network structures solving variational inequalities. Set-Valued Var. Anal. 28(3), 491–518 (2020) 6. Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57(11), 1413–1457 (2004) 7. Duvaut, G., Lions, J.-L.: Inequalities in Mechanics and Physics, volume 219 of Grundlehren der Mathematischen Wissenschaften. Springer, Berlin (1976) (Translated from the French by C. W. John) 8. Glas, S., Urban, K.: On noncoercive variational inequalities. SIAM J. Numer. Anal. 52(5), 2250–2271 (2014) 9. Gregor, K., LeCun, Y.: Learning fast approximations of sparse coding. In: International Conference on Machine Learning. PMLR, pp. 1–8 (2010) 10. Hasannasab, M., Hertrich, J., Neumayer, S., Plonka, G., Setzer, S., Steidl, G.: Parseval proximal neural networks. J. Fourier Anal. Appl. 26(4), 31 (2020) 11. He, J., Xu, J.: MgNet: a unified framework of multigrid and convolutional neural network. Sci. China Math. 62(7), 1331–1354 (2019) 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 13. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision, pp. 630–645. Springer (2016) 14. Hilber, N., Reichmann, O., Schwab, C., Winter, C.: Computational Methods for Quantitative Finance: Finite Element Methods for Derivative Pricing. Springer, Berlin (2013) 15. Hintermüller, M., Ito, K., Kunisch, K.: The primal-dual active set strategy as a semismooth Newton method. SIAM J. Optim. 13(3), 865–888 (2002) 16. Hornik, K., Stinchcombe, M., White, H.: Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 3(5), 551–560 (1990) 17. Kinderlehrer, D., Stampacchia, G.: An Introduction to Variational Inequalities and Their Applications, volume 31 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA (2000) (Reprint of the 1980 original) 18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 19. Kovachki, N., Li, Z., Liu, B., Azizzadenesheli, K., Bhattacharya, K., Stuart, A., Anandkumar, A.: Neural operator: learning maps between function spaces. arXiv preprint arXiv:2108.08481 (2021) 20. Lamberton, D., Lapeyre, B.: Introduction to Stochastic Calculus Applied to Finance. Chapman & Hall/CRC Financial Mathematics Series, 2nd edn. Chapman & Hall/CRC, Boca Raton, FL (2008) 21. Lu, L., Jin, P., Karniadakis, G.E.: Deeponet: learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv preprint arXiv:1910.03193 (2019) 22. Monga, V., Li, Y., Eldar, Y.C.: Algorithm unrolling: interpretable, efficient deep learning for signal and image processing. IEEE Signal Process. Mag. 38(2), 18–44 (2021) 23. Murty, K.G.: On the number of solutions to the complementarity problem and spanning properties of complementary cones. Linear Algebra Appl. 5(1), 65–108 (1972) 24. Opschoor, J.A.A., Schwab, C., Zech, J.: Exponential ReLU DNN expression of holomorphic maps in high dimension. Constructive Approximation 55, 537–582 (2019) (Report SAM 2019-35 (revised)) 25. Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999) 26. Wohlmuth, B.: Variationally consistent discretization schemes and numerical algorithms for contact problems. Acta Numer. 20, 569–734 (2011) 27. Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017) Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Journal

Research in the Mathematical SciencesSpringer Journals

Published: Sep 1, 2022

References