Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Bayesian deep learning with hierarchical prior: Predictions from limited and noisy data

Bayesian deep learning with hierarchical prior: Predictions from limited and noisy data Datasets in engineering applications are often limited and contaminated, mainly due to unavoidable measurement noise and signal distortion. Thus, using conventional data-driven approaches to build a reliable discriminative model, and further applying this identi ed surrogate to uncertainty anal- ysis remains to be very challenging. In this paper, a deep learning (DL) based probabilistic model is presented to provide predictions based on lim- ited and noisy data. To address noise perturbation, the Bayesian learning method that naturally facilitates an automatic updating mechanism is con- sidered to quantify and propagate model uncertainties into predictive quan- tities. Speci cally, hierarchical Bayesian modeling (HBM) is rst adopted to describe model uncertainties, which allows the prior assumption to be less subjective, while also makes the proposed surrogate more robust. Next, the Bayesian inference is seamlessly integrated into the DL framework, which in turn supports probabilistic programming by yielding a probability distribu- tion of the quantities of interest rather than their point estimates. Variational inference (VI) is implemented for the posterior distribution analysis where the intractable marginalization of the likelihood function over parameter space is framed in an optimization format, and stochastic gradient descent method is applied to solve this optimization problem. Finally, Monte Carlo simulation is used to obtain an unbiased estimator in the predictive phase of Bayesian inference, where the proposed Bayesian deep learning (BDL) scheme is able to o er con dence bounds for the output estimation by analyzing propa- gated uncertainties. The e ectiveness of Bayesian shrinkage is demonstrated Corresponding author. 156 Fitzpatrick Hall, Notre Dame, IN 46556, USA. Email addresses: xluo1@nd.edu (Xihaier Luo ), kareem@nd.edu (Ahsan Kareem) Preprint submitted to Elsevier July 10, 2019 arXiv:1907.04240v1 [stat.ML] 8 Jul 2019 in improving predictive performance using contaminated data, and various examples are provided to illustrate concepts, methodologies, and algorithms of this proposed BDL modeling technique. Keywords: Probabilistic modeling, Bayesian inference, Deep learning, Monte Carlo variational inference, Bayesian hierarchical modeling, Noisy data 1. Introduction Applications of data-driven approaches for learning the performance of engineering systems using limited experimental data is hindered by at least three factors. First, the original input-output patterns are often governed by a series of highly nonlinear and implicit partial di erential equations (PDEs), and hence approximation of their functional relationships may be proportion- ally computation demanding [1, 2]. Secondly, interpolation and extrapola- tion techniques are usually needed to extract knowledge from acquired data in consideration of only a limited number of sensors used in practice along with the fact that sensor malfunction often occurs in real time. Nevertheless, it is very dicult to establish an accurate discriminative model merely from data, especially when a relatively small dataset is available [3, 4]. Thirdly, ex- perimental data is inevitably contaminated by noise from di erent sources, for example, signal perturbation induced noisy sensing during monitoring. The performance of conventional discriminative algorithms may be notice- ably impaired if proper noise reduction has not been performed [5, 6]. In this context, we present a machine learning based predictive model that is capable of providing high-quality predictions from limited and noisy data. To date, most machine learning models are deterministic, which implies that a certain input sample x is strictly bounded to a point estimator y notwithstanding the existence of model uncertainties [7, 8]. Probabilistic modeling, on the other hand, emerges as an attractive alternative on account of its ability to quantify the uncertainty in model predictions, which can pre- vent a poorly trained model from being overcon dent in predictions, and hence helps stakeholders make a more reliable decision [7, 9]. In literature, Gaussian processes (GPs) and generalized polynomial chaos (gPC) are two representative members from the probabilistic modeling family [9, 10, 11, 12]. From a mathematical standpoint, GPs put a joint Gaussian distribution over input random variables by de ning a mean function E[x] and a co- 2 variance function Cov[x ;x ] [10]. Then, GPs compute hyperparameters of i j the spatial covariance function and propagte the inherent randomness of x in virtue of the Bayes' theorem. Next, gPC is an e ective way to propa- gate uncertain quantities by means of utilizing a set of random coecients f ; ; : : : ; g and orthogonal polynomials f ;  ; : : : ;  g. Approxima- 1 2 n 1 2 n tion approaches (e.g. Galerkin projection) are usually used to determine the unknown coecients and polynomial basis  (x) [12]. Even though GPs and gPC are capable of computing empirical con dence intervals, the infer- ence complexity may become overwhelming when the number of observations increases (e.g. cubic scaling relationship O (N ) between the computational complexity and data fx ;y g is found in the GPs case [13]). Furthermore, i i=1 scaling a GPs model or identifying random coecients of a gPC model for problems with high-dimensional data remains challenging [3, 9, 10]. On the contrary, deep learning (DL) has di erentiated itself within the realm of machine learning for its superior performance in handling large-scale complex systems. With the strong support of high-performance computing (HPC), DL has made signi cant accomplishments in a wide range of appli- cations such as image recognition, data compression, computer vision, and language processing [14, 15]. Nonetheless, the same level of application has not been observed for its probabilistic version [16, 17, 18]. This paper at- tempts to bridge the modeling gap between DL and Bayesian learning by presenting a new paradigm Bayesian deep learning (BDL) model. With the aim of developing a surrogate model that can be used to accelerate the un- certainty analysis of engineering systems using noisy data, we focus on three main aspects in probabilistic modeling (See Fig. 1). First, Bayesian statistical inference encodes the subjective beliefs, whereas prior distributions are imposed on model parameters to represent the initial uncertainty. A conventional DL supported surrogate M (), however, can be captivating and confusing in equal measure at the same time. Owning a deeply structured network architecture allows M () to approximate a wide variety of functions, but also makes model parameters hard to interpret, which in turn increases the diculty of choosing a reasonable prior distribu- tion for M () [19, 20]. In [19], Lee shows noninformative prior (e.g. Je reys prior) that bears objective Bayes properties is liable to be misled by the variability in data. And using the Fisher information matrix to compute a Je reys prior for a large network architecture can be computationally pro- hibitive [21]. On the informative prior side, zero-centered Gaussian is, not surprisingly, extensively explored in early work on Bayesian neural networks 3 Phase 1: train a surrogate using noisy data denoising deterministic y ˆ model model Input probabilistic p (y ˆ) model computionally biased model surrogate model output efficient . . . model uncertainty computionally high fidelity unbiased model prediction simulator extensive Phase 2: conduct uncertainty analysis using trained surrogate Figure 1: The proposed BDL model M () aims at reducing the computational burden imposed by the repeated evaluations of the original high- delity model M () in uncertainty analysis. Instead of a point estimate, the BDL model quanti es the model uncertainties by presenting a distribution of values. (BDL) on account of its exibility in implementation as well as its natural regularization mechanism [16, 17], where the quadratic penalty for BDL pa- rameters alleviates the over tting problem. Later in [18], Neal points out that employing a heavy-tailed distribution (e.g. Cauchy distribution) to represent prior knowledge can provide a more robust shrinkage estimator and diminish the e ects of outlying data. Nonetheless, Cauchy distribution is dicult to implement because it does not have nite moments [20]. To develop a prior model that is more amenable to reform and work with, we investigate the ef- cacy of applying hierarchical Bayesian modeling (HBM) to de ne the prior distribution. And it is found that HBM can eciently ameliorate the prior assumptions induced model performance variance by di using the in uences that are cascaded down from the top level, hence allowing a relatively more robust distribution model. Secondly, the determination of the posterior distribution p (!jD) requires 4 integrating model parameters f! ; ! ; : : : ; ! g out of the likelihood function. 1 2 n Unfortunately, performing numerical integration in BDL's parameter space is always computationally intractable as a neural networks model is commonly con gured with hundreds/thousands of parameters [22, 23]. Approximation methods that can be broadly classi ed into the sampling method and the optimization method have been introduced to alleviate such computational bottleneck [24]. For the the sampling method, Markov Chain Monte Carlo (MCMC) has been explored in early work to calculate the posterior proba- bilities [17, 20, 25]. The main idea of MCMC is to produce numerical sam- ples from the posterior distribution by simulating a discrete but dependent Markov chain f! g on the state space, where  (!)  p (!jD). The su- i=1 cient condition that ensures the stationary distribution converges to the tar- get posterior distribution requires the transition kernel T () to have detailed 0 0 0 balance properties p (!jD) T ! ! ! = p ! jD T ! ! ! . However, the convergence of MCMC algorithms can be extremely slow in the pres- ence of large datasets since the burn-in process to eliminate the initialization bias is greatly extended [9, 23]. More recently, variational inference (VI) method has been employed for inferring an intractable posterior distribution where the probabilistic inference problem is cast in a deterministic opti- mization form [26, 27, 28]. A proxy probability distribution q (!), which is formally explicit and computationally ecient, is introduced to approx- imate the true posterior distribution p (!jD). Compared to the sampling method, VI has the advantage of approximating non-conjugate distributions by virtue of optimizing an explicit objective function [23, 24], and VI solves the intractable integrals in a more ecient manner on account of o -the-peg optimization algorithms can be seamlessly adapted to the minimization prob- lem [26, 28, 29]. Unfortunately, the di erentiation of the objective function regarding the proxy posterior involves the determination of the expectations with respect to the variational parameters where Monte Carlo gradient es- timator may give a high variance [30, 31]. To address this issue, we repa- rameterize our objective function by introducing a set of auxiliary variables. It should be noted that such reparameterization would not only yield an unbiased estimator of the objective function but also provides an ecient approximation of variational parameters via permitting the use of stochastic gradient descent (SGD) in optimization. Lastly, Monte Carlo (MC) method is used in the predictive phase of Bayesian inference. MC method draws numerical samples from the proxy probability distribution q (!), builds a predictive probability distribution of 5 new data, and assigns a con dence level to the model prediction for repre- senting model uncertainty. The following outline of this paper is intended as: Section 2 gives a brief introduction of surrogate modeling using deep neural networks. Section 3 describes the proposed BDL model in detail. In Section 4, various examples are provided to demonstrate the e ectiveness of BDL in dealing with noisy data. Finally, Section 5 draws major conclusions. 2. Deterministic modeling: a deep learning framework 2.1. Neural networks based surrogate model In the context of supervised learning [9, 15], let x = [x ; x ; : : : ; x ] 2 R 1 2 m denote an input vector, and the corresponding output vector y = [y ; y ; : : : ; y ] 2 1 2 n R is estimated by a computationally intensive model M () (e.g. a large nite element model). We are interested in using neural networks to approx- imate functional relationships F () between x and y. y = F (x) ! y ^ = F (x) (1) where F () is the mathematical expression of neural networks based sur- rogate M (), and theoretically it can be proportionately broken down as: K K1 1 ^ ^ ^ F (x) = f  f  : : : f (x) (2) with K denoting the layer number, and  symbolizing the functional composition operation which is de ned as [32]: i j i j ^ ^ ^ ^ f  f = f f () (3) Each function in the sequence f () ; i = 1; 2; : : : ; K contains two steps, where the rst step is identical to a linear regression: i i i i i i z = f x = ! x + b (4) i th i In Eq. (4), x is the input vector of the i layer, ! is the weight matrix, i i th and b is the i bias term. For the sake of brevity, b can be integrated into i i ! by introducing an additional input variable x = 1. For the rest of the paper, let ! be a tensor containing all model parameters [32, 33]. Next, f () 6 applies an element-wise nonlinear transformation to the intermediate output z in the second step: i i i y = f x =  z (5) where  () is often referred to as the activation function [14, 15]. Selection of  () directly depends on the characteristics of M (). Moreover, a network architecture is deemed to be deep when K > 3 [14]. Thus, a deep learning framework can be e ectively built by increasing the composition size K . 2.2. Probabilistic interpretation of L loss function Consider a parameterised deep neural network model y ^ = F (x) de- scribed in Section 2.1 and a training dataset D = fx ;y g , the next step i i=1 is to nd an optimal ! such that the surrogate F (x) best describes the data D. In this regard, a loss function L () that measures the error between the predicted value y ^ and the expected result y is de ned. Then, F (x) is trained to approximate F (x) by minimizing the empirical loss through tuning model parameters !: ! = arg min L y ;F (x ) (6) i=1 From a probabilistic modeling perspective, the interest of loss function L () is in the probability distribution of y as a funciton of x: L y ;F (x ) = p (Dj!) = p (yjx;!) (7) where the model parameters can be learned by the method of maximum likelihood estimate (MLE) [9, 22], which searches an estimator for ! that MLE maximizes the likelihood term ! = arg max p (Dj!). Let noise term be independent and identically distributed (i.i.d.), the probability density associated with paired observations under Gaussian assumption can be ex- pressed as: ^ ^ L y ;F (x ) = p (Dj!;  ) = N y jF (x ) ; i  i i i i=1 (8) 2 2 1 1 = exp y F (x ) 2 2 (2 ) 2 i=1 7 Usually, the numerical implementation of the MLE method performs the minimization problem of Eq. (6) in a logarithmic scale: 1 n ? 2 ! = arg min y F (x ) log 2 (9) 2 2 i=1 where the precision term  is determined by minimizing the negative log likelihood: MLE = y F (x ) (10) nN i=1 Furthermore, Eq. (8) can be further simpli ed under homoscedastic con- ditions: ! = arg min y F (x ) (11) i=1 Hence, the probability density based loss function coincides with the well- known mean squared error (MSE). 2.3. Stochastic optimization for updating model parameters Gradient based optimization is one of the most popular algorithms to optimize neural networks: ! ! + rL (!) (12) t+1 t where  is generally known as the leraning rate that follows the Robbins- Monro conditions. The objective function stated in Eq. (11) indicates O (N ) operations are required to compute L () and rL () respectively, which may be computationally demanding for a large dataset. Therefore, stochastic gradient descent (SGD) is considered [14, 15, 32], where a random vector g (x) is de ned to calculate the gradients [34]. With a restricted mag- 2 2 nitude of stochastic gradients Ejjg (x)jj 6 N and a bounded variance 2 2 Ejjg (x) rL (x)jj 6  [34], SGD can eciently update model param- eters by constructing a noisy natural gradient: rL () = E[g (x)] (13) 8 In particular, adaptive moment estimation (ADAM) [35] that computes adaptive learning rates for ! is adopted, and the corresponding g (x) takes the expression of: s ! M M V t t t g (x) = = = +  (14) t t ^ 1 1 V +  1 2 where M and V are estimates of the mean and variance of the gradients t t respectively. In ADAM, they are updated as follows [35]: M = M + (1 )L () t 1 t1 1 t (15) V = V + (1 )L () t 2 t1 2 t It should be noted that the expresssion of Eq. (14) is an unbiased esti- mation of the exact gradient, and its calculation only depends on one data point. 3. Probabilistic modeling: a Bayesian approach In this section, the aforementioned deterministic DL surrogate is en- hanced to account for model uncertainties by the integration of Bayesian inference. Overall, Bayesian learning includes three steps: (1) establish prior beliefs about uncertain parameters; (2) compute the posterior distribution via Bayes' rule; and (3) use the predictive distribution to determine a yet unobserved data point. 3.1. Prior representation: Bayesian hierarchical modelling To begin with, let U and U represent the epistemic uncertainty and E A aleatory uncertainty respectively [8]. Prior information of U and U is E A initially encapsulated in a probability distribution function form. For the epistemic uncertainty, prior distributions are imposed on model parameters ! [16, 17, 18]: !  p (!) (16) Practical applications imply that the prior distribution p (!) should not be too restrictive on account of the limited prior information about ! [20]. For this reason, hierarchical Bayesian modeling (HBM) method, which intro- duces a vector of hyperparameters  = [ ;  ; : : : ;  ] to the prior distribu- 1 2 n tion, is employed to reduce subjective information induced undue in uence 9 on p (!) [9]. Consequently, the marginal prior can be obtained by integrating out  through the sum rule: p (!) = p (!;) d (17) where the joint probability distribution can be further expressed as a product of a set of conditional distributions via applying the product rule: p (!;) = p (!j) p () (18) For probabilistic modeling, model parameters in each layer of a BDL model are often assumed to follow a factorized multivariate Gaussian distri- bution: K K Y Y p (!) = p ! j ;  = N  ;  I (19) i ! ! i ! ! i i i i=1 i=1 At the rst hierarchy stage, let  = 0 and  be a Gamma random variable for instance. Hence, HBM breaks the prior distribution down to: p (!j  ; ; ) = p (!j  ) p ( j ; ) where > 0; > 0 (20) !   ! ! with and denoting the shape parameter and rate parameter of the Gamma distribution respectively. Using Eq. (17) and Eq. (18), the prior distribution can be reformulated as: p (!j ; ) = p (!j  ; ; ) d = St 0; ; 2 (21) !   ! In Eq. (21), St () characterizes the student's t-distribution which is ca- pable of providing heavier tails than Gaussian distribution. Remark (1). HBM grants a more impartial prior distribution by allowing the data to speak for itself [9], and it admits a more general modeling framework where the hierarchical prior becomes direct prior when the hyperparameters are modeled by a Dirac delta function (e.g. using  (x  ) to describe the precision term in Eq. (19)). In addition, HBM o ers the exibility to work with a wide range of probability distributions, and even directly provides an analytical solution for some of the most popular choices such as Laplace, Gaussian, and student's t-distribution [20]. 10 On the other hand, homoscedastic noise  that is independent of the input data Var[j (x ;y )] =  8 (x ;y ) 2 D is added to the output in i i i  i consideration of the aleatory uncertainty U which cannot be explained away by accepting more samples fx;yg. Besides, additive noise term  guarantees a tractable likelihood for the probabilistic model M () where  is most commonly modeled as a Gaussian process: 2 2 p j ; = N  ; (22) Let  = 0 and  be a constant, the output vector hereby follows: y  N yjE [F (x)];  I (23) p(!jD) A graphical model representation for the aforestated hierarchical prior as well as an illustration of the BDL model is given in Fig. 2. f (·) : P y ¯ and σ (·) α ω ˆ τ y τ Σ Figure 2: The architecture of Bayesian deep learning with hierarchical prior. 3.2. Posterior approximation: variational inference After de ning a hierarchical prior distribution for the proposed probabilis- tic model M (), the next step is to infer the posterior distribution, which re ects the updated parameter information. In the Bayesian formalism, the joint posterior distribution p (!;jD) is calculated by: p (Dj!;) p (!j) p () p (!;jD) = (24) p (D) The marginal posterior distribution p (!jD) can be further determined by integrating out the joint posterior distribution, and the denominator of Eq. (24) is often referred to as the model evidence that takes the form of: p (D) = p (Dj!) p (!) d! (25) 11 In most cases, estimation of Eq. (25) it is computationally intractable as numerical integration requires a considerable number of samples if the parameter space W is very high [18]. To overcome this integration problem, variational inference (VI) is adopted so that Bayesian inference can proceed eciently [26, 28]. Remark (2). Di erent from the method of maximum a posteriori (MAP) which captures the mode of a posterior distribution [18], the objective for the posterior approximation at this place is to nd a computationally ecient replacement of the true posterior distribution, so that numerical samples can be easily accessible in the predictive analysis. 3.2.1. Objective function: evidence lower bound In VI, a family of proxy distributions parameterized by  is posited to approximate the true posterior distribution: p (!jD)  q (!) 2  (26) VI attempts to make q (!) looks as close as possible to p (!jD) via re n- ing , and one typical interpretation of the closeness between two probability distributions is the Kullback-Leibler (KL) divergence [27, 28]. Therefore, VI casts the approximation problem in an optimization form, where the objec- tive function can be expressed as: p (!jD) ' q (!) = arg minKL (q (!)jp (!jD)) (27) q (!) = arg min q (!) log d! p (!jD) Instead of minimizing the KL divergence, we can equivalently maximize the evidence lower bound (ELBO) L (!) [28]: q (!) = L (!) = arg max = arg max E [log p (!;D) log q (!)] q (!) (28) = arg maxE [log p (Dj!)] KL (q (!) j p (!)) q (!) The rst conditional log-likelihood term in Eq. (28) is usually referred as to the data term [28, 29]. It compels the posterior distribution to explain data D by maximizing the expected log-likelihood. Mini-batch optimization 12 method is implemented to eciently o er an unbiased stochastic estimator of the log-likelihood: N M X X log p (Dj!) = log p (y jx ;!)  log p (y jx ;!) (29) i i i i i=1 i=1 where M is a subset of N . Noticeably, besides accelerating the com- putational process, mini-batch optimization owns a higher model updating frequency that allows for a more robust convergence, and hence increases the chance of avoiding local minimum [34, 35]. Meanwhile, mean eld variational inference (MFVI) method [26, 28, 29] is adopted to control the computational complexity of the second term in Eq. (28): q (!) = q (! ) (30) i=1 The variational distribution is represented by a layer-wise factorized dis- tribution where each factor is determined by its own variational parameter: R Q exp log p (D;!) q (! ) d! j j j6=i j q (! ) = (31) R R exp log p (D;!) q (! ) d! d! j j i j6=i j Substituting Eq. (29) and Eq. (30) back to Eq. (28), the objective function of ELBO can be rewritten into: M K X Y L (!;) = log p (y jx ;!) q (! ) d! i  j i=1 j=1 (32) q (! ) log q (! ) d! j  j j j j j=1 where the iteration of variational distribution for model parameters ter- minates when the convergence criteria is satis ed. 3.2.2. Gradients computation: stochastic gradient variational Bayes Among the many techniques developed for solving optimization problems, gradient-based optimization method reliably tackles the EBLO maximization 13 problem stated in Eq. (28) in an ecient manner [31]. For the sake of brevity, let: A (!;) = log p (!;D) log q (!) (33) Using the log-derivative trick [28], the objective function L (!;) can be di erentiated with respect to variational parameters : r L (!;) = q (!)A (!;) d! (34) @ log q (!) @A (!;) = q (!) A (!;) + q (!) d! @ @ To quickly estimate numerical integrations, we can write Eq. (34) in its expectatio form and use Monte Carlo method to compute the stochastic gradients: @ log q (!) @A (!;) r L (!;) = E [ A (!;) + ] (35) q (!) @ @ However, it is observed that crude MC estimator for r L (!;) usually induces large variance [30, 31]. For this reason, stochastic gradient variational Bayes (SGVB) method is embraced to reduce the estimations' variance [30]. Simply put, SGVB introduces an auxiliary variable  to the proxy distribu- tion: Z Z q (!) = q (!;) d = q (!j) p () d (36) where conditional probability density function q (!j) is formally de ned as a Dirac delta function: q (!j) =  (! g (;)) (37) and g (;) is a di erentiable transformation function that connects ! and : ! = g (;) (38) For instance, a simple choice for p () is isotropic Gaussian distribution i:i:d: p () = N (0; I ), and the reparameterization can be achieved though 14 ! =  + . Therefore, substituting Eq. (36) and Eq. (37) into Eq. (35), the pathwise estimator can be expressed as [30]: @ log p ()A (g (;) ;) @A (g (;) ;) r L (!;) = E [ + ] (39) p() @ @ Combining Eq. (33) and Eq. (39), the nal Monte Carlo estimator for the gradients can be written as: @ @g (;) r L (!;) = E [ [log p (!;D) log q (!)] ] (40) p() @! @ Now, the variance of stochastic gradients can be e ectively reduced by magnitude of orders using this reparameterized estimator [30], and the VI- based optimization problem can be eciently solved by the stochastic gradi- ent descent algorithm mentioned in the previous section. 3.3. Predictive evaluation: Monte Carlo sampling The last but the most important step of Bayesian computation concerns making predictions for new data samples (x ;y ), where the predictive dis- tribution can be expressed as: p (y jx ;D) = p (y jx ;!) p (!jD) d! (41) The optimized proxy posterior q (!), which is obtained by solving the ELBO optimization problem, will take the place of the true posterior distri- bution p (!jD): p (y jx ;D) ' p (y jx ;!) q (!) d! (42) In the same vein, the predictive integral is numerically achieved by draw- ing random samples from the proxy distribution. An unbiased estimator is given: p (y jx ;D)  p (y jx ;! ) where !  q (!) (43) i i i=1 15 For the purpose of uncertainty representation, it is of great importance to compute statistical moments of y , such as mean: y ^ = F x j! (44) mean i=1 and variance: i  i T  i ^ ^ y ^ =  I + F x j! F x j! var i=1 (45) ! ! k k X X 1 1 i  i ^ ^ F x j! F x j! k k i=1 i=1 Because y ^ and y ^ are essential elements for constructing the acqui- mean var sition function which balances the exploration and exploitation in the context of Bayesian optimization [10, 13]. 4. Numerical examples and results 4.1. Example 1: nonlinear regression The rst example considers a nonlinear function, which is commonly used as a testing problem to assess the accuracy of a regression model [13, 14, 36]. Mathematically, it is written as: y = x sin (x) (46) To identify common features and di erences between proposed model and current approaches, the regression problem is numerically solved us- ing four di erent surrogate modeling methods: polynomial regression (PR); support vector machine (SVM); neural networks (NN); and Bayesian deep learning (BDL). First, a polynomial f (x) = + x +  + x of de- 0 1 n gree n = 11 is de ned to t the symmetric function [32]. The method of least squares is applied to nd the best linear unbiased estimator (BLUE) of the regression coecient vector by minimizing the sum of squared er- rors. Secondly, an SVM regression model with a Gaussian kernel function G (x ; x ) = exp ( jjx x jj ) is implemented to build a mapping between i j i j x and y [9, 10]. The default value for the kernel coecient is 1, and se- quential minimal optimization (SMO) algorithm is utilized to update the 16 coecients where Karush-Kuhn-Tucker (KKT) violation  = 0:0001 is speci- ed as the convergence criterion [37]. Thirdly, a feedforward neural network with one hidden layer that has 20 neurons is built to learn the nonlinear transformation [14, 32]. Hyperbolic tangent function is adopted as the ac- tivation function for the hidden layer since its derivatives are steeper than sigmoid function. For the output layer, a straight line function that outputs the weighted sum from hidden neurons is used. Stochastic gradient descent is performed for parameter optimization [34], where the learning rate is xed as a constant  = 0:0001 and the default epoch setting is 100000. Lastly, a Bayesian surrogate that has the same network con guration is examined. To account for the model uncertainty, a normal prior N (0; 0:1) is directly imposed on the model parameters. (a.1) clean data (b.1) noisy data: σ = 0.3 (c.1) noisy data: σ = 0.7 (a.2) clean data (b.2) noisy data: σ = 0.3 (c.2) noisy data: σ = 0.7 Figure 3: Comparisons of regression results using various surrogate models. The training dataset is contaminated by a Gaussian noise with di erent standard deviations. To train these models, we use the pseudorandom number generator to simulate a training dataset consisting of 30 samples that are uniformly dis- tributed in the interval (10; 10). A Gaussian noise determined by N (0; ) is added to each sample to make the problem more realistic [8]. Fig. 3 visualizes the tted regression model via di erent approaches. It should be noted that the mean value of the predictive distribution is se- lected as the model estimation in the case of BDL. Obviously, BDL improves the generalization performance and mitigates the over tting issue, which is encountered in NN modeling. Meanwhile, BDL is capable of characterizing 17 the model uncertainty associated with the prediction in addition to achiev- ing an equivalently accurate regression result compared to other methods. Table 1 and Table 2 summarize the coecient of determination (R ) and the root mean squared error (RMSE) for di erent surrogates. According to the results, BDL is more resistant to noisy data since increasing the random noise level deteriorates the e ectiveness and quality of other three surrogates in a much more clear way. Method clean data  = 0:1  = 0:3  = 0:5  = 0:7  = 0:9 PR 0.9950 0.9937 0.9884 0.9807 0.9680 0.9533 SVM 0.9783 0.9784 0.9758 0.9604 0.9486 0.9388 NN 0.9890 0.9854 0.9828 0.9770 0.9526 0.9497 BDL 0.9883 0.9893 0.9928 0.9757 0.9672 0.9516 Table 1: Comparison of the coecient of determination (R ) of the di erent surrogate models where the training dataset is contaminated by di erent noise levels. Method clean data  = 0:1  = 0:3  = 0:5  = 0:7  = 0:9 PR 0.2483 0.2629 0.3917 0.4863 0.6227 0.7643 SVM 0.5397 0.5424 0.5642 0.6997 0.7934 0.9198 NN 0.3817 0.4068 0.4783 0.5276 0.7350 0.8077 BDL 0.3270 0.3098 0.2964 0.5425 0.6091 0.7933 Table 2: Comparison of the root mean squared error (RMSE) of the di erent surrogate models where the training dataset is contaminated by di erent noise levels. 4.2. Example 2: binary classi cation To evaluate the classi cation performance of our proposed surrogate model, the second example applies the BDL to a synthetic dataset that holds a two- dimensional swirl pattern. As shown in Fig. 4, the synthetic dataset exhibits two intuitively separable manifolds, where each manifold resembles a crescent moon [9]. A BDL surrogate that is arranged in a 2  5  5  2 form is developed as the neural network classi er. Speci cally, two hidden layers are con g- ured with the hyperbolic tangent activation function and softmax function (x ) = P is implemented to represent the categorical distribution for i J x j=1 the outputs by computing a probability row vector where the sum of the row 18 (a) Two moons manifold (b) Training dataset contaminated by  ∼ N (0, 0.1) (c) Training dataset contaminated by  ∼ N (0, 0.3) Figure 4: Classi cation problem: a highly nonlinear dataset. is 1 [14, 15]. To understand the e ects of di erent priors on the classi cation performance, we have considered three direct priors: Laplace L (0; 1), Gaus- sianN (0; 1), and CauchyC (1; 1). We further come up with three more hyper priors by xing the location parameter of the aforementioned probability distributions along with treating their scale parameter as a random vari- able, which can be described using an Inverse-Gamma distribution IG (1; 1). The basic probability distribution functions are given as: 1 jx j L (x j ; ) = exp C (x j ; ) = (47) [1 + ( ) ] IG (x j ; ) = x exp ( ) x Additionally, we conduct two trials to study the e ects of noise on our neural network classi er. In the rst trial, a BDL model is developed using 900 samples, where each sample is contaminated by a Gaussian noise gener- ated from   N (0; 0:1). In the second trial, 1200 samples are utilized to build M () as the external noise is ampli ed to   N (0; 0:3). Following the 70=30 rule [14, 15, 32], the whole dataset D is divided into the training set D and the validation set D , respectively. A rst-order gradient-based opti- t v mization method, ADAM [35], is adopted to update model parameters, where the learning rate  = 0:001, the exponential decay rates for the rst/second moment estimates and are 0:9 and 0:999, respectively. It should be 1 2 addressed that the reparametrization trick mentioned in Section 3.2 is au- tomatically embedded by means of taking the derivatives of the objective 19 function with respect to the variational parameters [30]. Here, the proxy distribution takes a Gaussian form, which indicates the variational posterior distribution is parameterized with two parameters, mean and standard devi- ation. To accelerate the training process, Mini-batch optimization method is used [14, 28], and the batch size is set to 30. The stop criteria epoch number is 50000. Fig. 5 provides a graphic illustration of the classi cation results. Accord- ing to these results, BDL model becomes less con dent about its predictions when validation samples are more near the true separation trajectory. It is because even small noise can distort the original manifold in a severe way [6]. However, the proposed hyper priors are able to provide better predictions especially in the second trial where the addictive noise is stronger. This is credited to the nature mechanism of Bayesian hierarchical modeling, which relaxes the prior constraints by encoding prior belief using a series of hy- perparameter values instead of xed constants [9]. Lastly, Fig. 6 reveals the variational posterior distribution of weights and bias in the rst hidden layer. For the previous proposed priors, zero centered Laplace prior is equivalent to the L1 regularization and Gaussian prior is identical to the L2 regularization [20, 25]. In Fig. 6, results of L (0; 1) is approximately sparse signal and p (!) of N (0; 1) is not centering around zero, which aligns with the properties of L1 and L2 regularization respectively [32]. Case A: ✏i ∼ N (0, 0.1) Case B: ✏i ∼ N (0, 0.3) <latexit sha1_base64="bn28YPX4cquHu0A4p7/3Cv6dsdI=">AAACLnicbVDLSgMxFM34rPVVdekmWAQFGWZUUFxViuBKKlgVOkPJpHfa0MyD5I5YhoL/48Zf0YWgIm79DNPHwtchgcM59yb3niCVQqPjvFgTk1PTM7OFueL8wuLScmll9VInmeJQ54lM1HXANEgRQx0FSrhOFbAokHAVdKsD/+oGlBZJfIG9FPyItWMRCs7QSM3SiYdwi3nVPEGPj/rU2xkeSLWQxhfU0yKiXsSww5nMz0yFhBC3qLNDHdv1lGh3cLtZKju2MwT9S9wxKZMxas3Sk9dKeBZBjFwyrRuuk6KfM4WCS+gXvUxDyniXtaFhaMwi0H4+XLdPN43SomGizI2RDtXvHTmLtO5FgakczK1/ewPxP6+RYXjo5yJOM4SYjz4KM0kxoYPsaEso4Ch7hjCuhJmV8g5TjKNJuGhCcH+v/Jdc7trunu2e75cr53ejOApknWyQLeKSA1Ihp6RG6oSTe/JIXsmb9WA9W+/Wx6h0whpHuEZ+wPr8AhyBpys=</latexit><latexit sha1_base64="Jlc2TWz/opB2/j6YLDc002NFPmo=">AAACLnicbVDLSgMxFM3Ud32NunQTLIKClBkrKK6KIrgSBVsLnVIy6Z02NPMguSOWoeD/uPFXdCGoiFs/w/Sx0OohgcM59yb3Hj+RQqPjvFq5qemZ2bn5hfzi0vLKqr22XtVxqjhUeCxjVfOZBikiqKBACbVEAQt9CTd+93Tg39yC0iKOrrGXQCNk7UgEgjM0UtM+8xDuMDs1T9CT4z719oYHEi2k8QX1tAipFzLscCazC1MhIcAd6uxRp1jylGh3cLdpF5yiMwT9S9wxKZAxLpv2s9eKeRpChFwyreuuk2AjYwoFl9DPe6mGhPEua0Pd0IiFoBvZcN0+3TZKiwaxMjdCOlR/dmQs1LoX+qZyMLee9Abif149xeCokYkoSREiPvooSCXFmA6yoy2hgKPsGcK4EmZWyjtMMY4m4bwJwZ1c+S+p7hfdUtG9OiiUr+5HccyTTbJFdohLDkmZnJNLUiGcPJAn8kberUfrxfqwPkelOWsc4Qb5BevrGyFUpy4=</latexit> Posterior probability predictions M<latexit sha1_base64="NuzC4m+7YI8+tUeUzFjfOBjnLUo=">AAACA3icbVDLSgNBEJz1GeMr6k0vg0HwFHZV0GPQixchgnlAsoTZ2U4yZHZ2mekVwxLw4q948aCIV3/Cm3/j5HHQxIKGoqqb7q4gkcKg6347C4tLyyurubX8+sbm1nZhZ7dm4lRzqPJYxroRMANSKKiiQAmNRAOLAgn1oH818uv3oI2I1R0OEvAj1lWiIzhDK7UL+y2EB8xu4hAkTRUHjUwoFGCG7ULRLblj0HniTUmRTFFpF75aYczTCBRyyYxpem6CfsY0Ci5hmG+lBhLG+6wLTUsVi8D42fiHIT2ySkg7sbalkI7V3xMZi4wZRIHtjBj2zKw3Ev/zmil2LvxMqCRFUHyyqJNKijEdBUJDoYGjHFjCuBb2Vsp7TDOONra8DcGbfXme1E5K3mnJuz0rli+nceTIATkkx8Qj56RMrkmFVAknj+SZvJI358l5cd6dj0nrgjOd2SN/4Hz+APIsmFo=</latexit> odel uncertainties Posterior probability predictions M<latexit sha1_base64="NuzC4m+7YI8+tUeUzFjfOBjnLUo=">AAACA3icbVDLSgNBEJz1GeMr6k0vg0HwFHZV0GPQixchgnlAsoTZ2U4yZHZ2mekVwxLw4q948aCIV3/Cm3/j5HHQxIKGoqqb7q4gkcKg6347C4tLyyurubX8+sbm1nZhZ7dm4lRzqPJYxroRMANSKKiiQAmNRAOLAgn1oH818uv3oI2I1R0OEvAj1lWiIzhDK7UL+y2EB8xu4hAkTRUHjUwoFGCG7ULRLblj0HniTUmRTFFpF75aYczTCBRyyYxpem6CfsY0Ci5hmG+lBhLG+6wLTUsVi8D42fiHIT2ySkg7sbalkI7V3xMZi4wZRIHtjBj2zKw3Ev/zmil2LvxMqCRFUHyyqJNKijEdBUJDoYGjHFjCuBb2Vsp7TDOONra8DcGbfXme1E5K3mnJuz0rli+nceTIATkkx8Qj56RMrkmFVAknj+SZvJI358l5cd6dj0nrgjOd2SN/4Hz+APIsmFo=</latexit> odel uncertainties <latexit sha1_base64="OthqOefBPJlzKQz2rary2z3lc2A=">AAACEXicbVC7TgJBFJ3FF+ILtbSZSEyoyK6aaEm0scREHgkQMjvchQmzO5uZu0ay4Rds/BUbC42xtbPzb5wFCgVPdXLOubn3Hj+WwqDrfju5ldW19Y38ZmFre2d3r7h/0DAq0RzqXEmlWz4zIEUEdRQooRVrYKEvoemPrjO/eQ/aCBXd4TiGbsgGkQgEZ2ilXrHcQXjAtKYMghZK01grn/lCChxbDn3Bs6CZ9Iolt+JOQZeJNyclMketV/zq9BVPQoiQS2ZM23Nj7KZMo+ASJoVOYiBmfMQG0LY0YiGYbjr9aEJPrNKngb0nUBHSqfp7ImWhMePQt8mQ4dAsepn4n9dOMLjspiKKE4SIzxYFiaSoaFYP7QsNHOXYEsa1sLdSPmSacVuPKdgSvMWXl0njtOKdVbzb81L1al5HnhyRY1ImHrkgVXJDaqROOHkkz+SVvDlPzovz7nzMojlnPnNI/sD5/AFqxJ6j</latexit><latexit sha1_base64="OthqOefBPJlzKQz2rary2z3lc2A=">AAACEXicbVC7TgJBFJ3FF+ILtbSZSEyoyK6aaEm0scREHgkQMjvchQmzO5uZu0ay4Rds/BUbC42xtbPzb5wFCgVPdXLOubn3Hj+WwqDrfju5ldW19Y38ZmFre2d3r7h/0DAq0RzqXEmlWz4zIEUEdRQooRVrYKEvoemPrjO/eQ/aCBXd4TiGbsgGkQgEZ2ilXrHcQXjAtKYMghZK01grn/lCChxbDn3Bs6CZ9Iolt+JOQZeJNyclMketV/zq9BVPQoiQS2ZM23Nj7KZMo+ASJoVOYiBmfMQG0LY0YiGYbjr9aEJPrNKngb0nUBHSqfp7ImWhMePQt8mQ4dAsepn4n9dOMLjspiKKE4SIzxYFiaSoaFYP7QsNHOXYEsa1sLdSPmSacVuPKdgSvMWXl0njtOKdVbzb81L1al5HnhyRY1ImHrkgVXJDaqROOHkkz+SVvDlPzovz7nzMojlnPnNI/sD5/AFqxJ6j</latexit> (a.1) Laplacian prior (b.1) Laplacian prior and (a.1) Laplacian prior (b.1) Laplacian prior and (a.1) Laplacian prior (b.1) Laplacian prior and (a.1) Laplacian prior (b.1) Laplacian prior and Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior (a.2) Gaussian prior (b.2) Gaussian prior and (a.2) Gaussian prior (b.2) Gaussian prior and (a.2) Gaussian prior (b.2) Gaussian prior and (a.2) Gaussian prior (b.2) Gaussian prior and Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior (a.3) Cauchy prior (b.3) Cauchy prior and (a.3) Cauchy prior (b.3) Cauchy prior and (a.3) Cauchy prior (b.3) Cauchy prior and (a.3) Cauchy prior (b.3) Cauchy prior and Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior Figure 5: Classi cation results: the predicted results are represented by the mean pre- dictive probability of the lower crescent on the input domain of (3; 3) (3; 3) and the model uncertainty is quanti ed in terms of the variance associated with each prediction using Eq. (45). 20 a.1 Laplace prior: posterior distribution over ω b.1 Gaussian prior: posterior distribution over ω 1 1 a.2 Laplace prior: posterior distribution over b b.2 Gaussian prior: posterior distribution over b 1 1 Figure 6: Comparison of optimized variational posterior distributions of model parameters using di erent regularization techniques. ! and b denote the weight and bias tensor 1 1 associated with the rst layer, respectively. 4.3. Example 3: structural analysis of a geometrically nonlinear membrane This example addresses the computational cost issue of using nite ele- ment (FE) model in the structural analysis with uncertain inputs [38]. The target structure is a geometrically nonlinear membrane that is clamped at four edges [39, 40]. Fig. 7. (a.1) gives a sketch of the objective domain = [0; l][0; b]  R , where uniformly distributed pressure loads are applied on the upper surface. We are interested in using the BDL based surrogate M () to approximate the nonlinear mechanism between uncertain structure parameters x and random responses y. 4.3.1. Membrane example: nonlinear analysis and surrogate modeling Uncertainty analysis. Vector x covers geometric uncertainties, where x = l, x = b, and x = t are the length, breadth, and thickness of target 1 2 3 membrane, as well as material uncertainties, with x = E and x =  de- 4 5 noting the elastic modulus and Poisson's ratio, respectively. Table 3 gives a systematic summary of statistical properties of x. The quantities of interest y are z-direction displacements w at locations of p (1; 0:5), p (0:6; 0:2), 1 2 and p (1:6; 0:8). The load-displacement relationship is no longer a deter- ministic curve due to the input randomness (See Fig. 7. (a.2)). Nonlinear FE model. Because of the geometric nonlinearity, the in- l non l plain strain is partitioned into two parts  =  + , where  describes the 21 y (v) (0,b) (l, b) Ω 3 (0, 0) (l, 0) x (u) (a.1) Geometric illustration (a.2) Nonlinear behavior illustration <latexit sha1_base64="sfaJaR1P0GZlV6njm3miwLha0IA=">AAACDHicbVC7SgNBFJ31GeMramkzGITYLLsqaBmw0DIB84BkCbOTm2TI7IOZu2JYArY2/oqNhSK2foCdf+Nkk0ITD1w4nHMuM/f4sRQaHefbWlpeWV1bz23kN7e2d3YLe/t1HSWKQ41HMlJNn2mQIoQaCpTQjBWwwJfQ8IdXE79xB0qLKLzFUQxewPqh6AnO0EidQrGNcI9pidnuCb2GKABUglMhZaJRZaGxSTm2k4EuEndGimSGSqfw1e5GPAkgRC6Z1i3XidFLmULBJYzz7URDzPiQ9aFlaMgC0F6aHTOmx0bp0l6kzIRIM/X3RsoCrUeBb5IBw4Ge9ybif14rwd6ll4owThBCPn2ol0iKEZ00Q7tCAUc5MoRxJcxfKR8wxTia/vKmBHf+5EVSP7XdM9utnhfL1YdpHTlySI5IibjkgpTJDamQGuHkkTyTV/JmPVkv1rv1MY0uWbMKD8gfWJ8/l+6bzg==</latexit><latexit sha1_base64="e8LkWi4olWQepLVQM6HH5RhomhA=">AAACFXicbVDLahtBEJxVHD/k2FaSYy6DhUEGI3btQHIU5JJTsCB6gCRE76hXGjQ7s8z0CotFkG/IJb+SSw4JIddAbvkbjx4HW3Kdiqqa6e6KMyUdheH/oPRs7/n+weFR+fjFyelZ5eWrtjO5FdgSRhnbjcGhkhpbJElhN7MIaaywE08/LP3ODK2TRn+meYaDFMZaJlIAeWlYueoT3lFRg/r1Jf9k9PIfsDzGCcyksVwqlTuyq/RiWKmG9XAFvkuiDamyDW6HlX/9kRF5ipqEAud6UZjRoABLUihclPu5wwzEFMbY81RDim5QrK5a8AuvjHjil0iMJr5SH74oIHVunsY+mQJN3La3FJ/yejkl7weF1FlOqMV6UJIrToYvK+IjaVGQmnsCwkq/KxcTsCDIF1n2JUTbJ++S9nU9uqlHzbfVRvPLuo5D9oadsxqL2DvWYB/ZLWsxwb6y7+wn+xV8C34Ev4M/62gp2FT4mj1C8PceqjefoA==</latexit> (b.1) 64 training samples (b.2) 128 training samples (b.3) 256 training samples (b.4) 512 training samples <latexit sha1_base64="qwUUvInzLR0lPg6iVsIjS1xBNXs=">AAACCXicbZBLSwMxFIUz9VXrq+rSTbAIdVNmtKjLghuXLdgHtKVk0ts2NJMZkjtiGQqu3PhX3LhQxK3/wJ3/xvSx0NYDgY9zbkju8SMpDLrut5NaWV1b30hvZra2d3b3svsHNRPGmkOVhzLUDZ8ZkEJBFQVKaEQaWOBLqPvD60levwNtRKhucRRBO2B9JXqCM7RWJ0tbCPeY5P2Cd0ovihQ1E0qoPjUsiCSYcSebcwvuVHQZvDnkyFzlTvar1Q15HIBCLpkxTc+NsJ0wjYJLGGdasYGI8SHrQ9OiYgGYdjLdZExPrNOlvVDbo5BO3d83EhYYMwp8OxkwHJjFbGL+lzVj7F21E6GiGEHx2UO9WFIM6aQW2hUaOMqRBca1sH+lfMA042jLy9gSvMWVl6F2VvDOC16lmCtVHmZ1pMkROSZ54pFLUiI3pEyqhJNH8kxeyZvz5Lw4787HbDTlzCs8JH/kfP4ADl2Zww==</latexit><latexit sha1_base64="kgzAZ9tDRrNmY9r1gx20PI20ebU=">AAACCnicbZBLSwMxFIUzPmt9VV26iRahbspMFeyy4MZlC/YBbSmZ9LYNzWSG5I5YhoI7N/4VNy4UcesvcOe/MX0stPVA4OOcG5J7/EgKg6777aysrq1vbKa20ts7u3v7mYPDmgljzaHKQxnqhs8MSKGgigIlNCINLPAl1P3h9SSv34E2IlS3OIqgHbC+Ej3BGVqrkzlpIdxjkvPzhXPqFYoUNRNKqD41LIgkmHEnk3Xz7lR0Gbw5ZMlc5U7mq9UNeRyAQi6ZMU3PjbCdMI2CSxinW7GBiPEh60PTomIBmHYyXWVMz6zTpb1Q26OQTt3fNxIWGDMKfDsZMByYxWxi/pc1Y+wV24lQUYyg+OyhXiwphnTSC+0KDRzlyALjWti/Uj5gmnG07aVtCd7iystQK+S9i7xXucyWKg+zOlLkmJySHPHIFSmRG1ImVcLJI3kmr+TNeXJenHfnYza64swrPCJ/5Hz+AIcbmf8=</latexit><latexit sha1_base64="jcms6GglDptJVzzgnVs3jvdBP7Y=">AAACCnicbZBLSwMxFIUzPmt9VV26iRZBN8OM72XBjcsW7APaoWTS2xrMZIbkjliGgjs3/hU3LhRx6y9w578xfSy0eiDwcc4NyT1hIoVBz/tyZmbn5hcWc0v55ZXVtfXCxmbNxKnmUOWxjHUjZAakUFBFgRIaiQYWhRLq4c3FMK/fgjYiVlfYTyCIWE+JruAMrdUu7LQQ7jDbD92jA3p4ckpRM6GE6lHDokSCGbQLRc/1RqJ/wZ9AkUxUbhc+W52YpxEo5JIZ0/S9BIOMaRRcwiDfSg0kjN+wHjQtKhaBCbLRKgO6Z50O7cbaHoV05P68kbHImH4U2smI4bWZzobmf1kzxe55kAmVpAiKjx/qppJiTIe90I7QwFH2LTCuhf0r5ddMM462vbwtwZ9e+S/UDl3/yPUrx8VS5X5cR45sk12yT3xyRkrkkpRJlXDyQJ7IC3l1Hp1n5815H4/OOJMKt8gvOR/fi+eaAg==</latexit><latexit sha1_base64="67Z57g0OeItoaKUhGejdhzPdFoM=">AAACCnicbZBLSwMxFIUz9VXrq+rSTbQIuikztaLLghuXLdgHtKVk0tsamskMyR2xDAV3bvwrblwo4tZf4M5/Y/pYaOuBwMc5NyT3+JEUBl3320ktLa+srqXXMxubW9s72d29mgljzaHKQxnqhs8MSKGgigIlNCINLPAl1P3B1Tiv34E2IlQ3OIygHbC+Ej3BGVqrkz1sIdxjcuLni6f03CtQ1EwoofrUsCCSYEadbM7NuxPRRfBmkCMzlTvZr1Y35HEACrlkxjQ9N8J2wjQKLmGUacUGIsYHrA9Ni4oFYNrJZJURPbZOl/ZCbY9COnF/30hYYMww8O1kwPDWzGdj87+sGWPvsp0IFcUIik8f6sWSYkjHvdCu0MBRDi0wroX9K+W3TDOOtr2MLcGbX3kRaoW8d5b3KsVcqfIwrSNNDsgROSEeuSAlck3KpEo4eSTP5JW8OU/Oi/PufExHU86swn3yR87nD4WWmf4=</latexit> (c.1) Evolution of direct prior (c.2) Evolution of hyper prior <latexit sha1_base64="d9JFBUcowVBS+IW0ZGb3w9pwApw=">AAACD3icbVDLSgMxFM34rPU16tJNsCh1U2ZU0GVBBJct2Ae0Q8mkmTY0MxmSO8UyFPwAN/6KGxeKuHXrzr8x03ahrQcuHM65l+QcPxZcg+N8W0vLK6tr67mN/ObW9s6uvbdf1zJRlNWoFFI1faKZ4BGrAQfBmrFiJPQFa/iD68xvDJnSXEZ3MIqZF5JexANOCRipY5+0gd1DWqQl9xTfDKVIMh3LAHe5YhRwrLhU445dcErOBHiRuDNSQDNUOvZXuytpErIIqCBat1wnBi8lCjgVbJxvJ5rFhA5Ij7UMjUjItJdO8ozxsVG6OJDKTAR4ov6+SEmo9Sj0zWZIoK/nvUz8z2slEFx5KY/iBFhEpw8FicAgcVbOLLMYGUKo4uavmPaJIhRMhXlTgjsfeZHUz0ruecmtXhTK1YdpHTl0iI5QEbnoEpXRLaqgGqLoET2jV/RmPVkv1rv1MV1dsmYVHqA/sD5/AEaqnKY=</latexit><latexit sha1_base64="/MfI8lfmN98JGkYMKD8ObUZ5fbc=">AAACDnicbVA9SwNBFNyLXzF+nVraLIZAbMJdFLQMiGCZgPmAJIS9zV6yZO/22H0XDEfA3sa/YmOhiK21nf/GvSSFJg48GGbmsfvGiwTX4DjfVmZtfWNzK7ud29nd2z+wD48aWsaKsjqVQqqWRzQTPGR14CBYK1KMBJ5gTW90nfrNMVOay/AOJhHrBmQQcp9TAkbq2YUOsHtIirRUPsM3YyniVMfSx0OTVjhSXKppz847JWcGvErcBcmjBao9+6vTlzQOWAhUEK3brhNBNyEKOBVsmuvEmkWEjsiAtQ0NScB0N5mdM8UFo/SxL5WZEPBM/b2RkEDrSeCZZEBgqJe9VPzPa8fgX3UTHkYxsJDOH/JjgUHitBvc54pREBNDCFXc/BXTIVGEgmkwZ0pwl09eJY1yyT0vubWLfKX2MK8ji07QKSoiF12iCrpFVVRHFD2iZ/SK3qwn68V6tz7m0Yy1qPAY/YH1+QOYapxK</latexit> Figure 7: Membrane example: problem statement and optimization results. non linear strain and  represents the nonlinear strain term: 2    3 2 2 2 @u @v @w 2 3 + + non @x @x @x x 2 2 2 6 7 @u @v @w non non 4 5 6 7 + + =  = (48) y @y @y @y 4 5 non @u @u @v @v @w @w xy 2 + 2 + 2 @x @y @x @y @x @y The solution d = [u; v; w] of these nonlinear equilibrium equations are obtained by the Newton-Raphson (NR) method [39]. The iterative process terminates when the unbalanced force residual is smaller than the tolerance = 0:0001 or the NR algorithm reaches the default maximum iteration n = 100. The force and displacement vector is initialized to zero and the 22 Basic variables First parameter Second parameter Distribution type l  = 2  = 0:05 Normal b  = 1  = 0:05 Normal t min = 0:001 max = 0:002 Uniform E  = 210  = 10 Normal = log(0:3)  = 0:01 Lognormal Table 3: Statistics of the uncertain input parameters for the at membrane. increment loads  = p=n where p = 100 and n = 400. It should be noted that the thickness of the membrane is comparatively small in relation to other two dimensions. It is therefore the FE model M () can be built by 200 (20 in x axis 10 in y axis) four node (Q4) quadrilateral elements and large deformation theory is adopted [39]. Surrogate model. We use the proposed BDL approach to provide a 5 3 R ! R transformation. The network architecture has three hidden layers 30  15  10 besides the input and output layer [33] and probability dis- tributions that account for model uncertainties are speci ed in a layer by layer fashion. To investigate the ecacy of di erent prior, a direct zero- mean Gaussian N (0; 1) and a hierarchical prior N (0;IG (1; 1)) have been tested. Furthermore, training datasets of size 64, 128, 256, and 512 have been considered for the purpose of identifying the in uence from the amount of data on the accuracy of model predictions. For the posterior approximation, ADAM is adopted [35], where the initial learning rate  is 0:005. Notably, decays every 100 epochs by multiplying a constant rate of 0:75 and the epoch number is 1000. The batch size for the subsampling procedure of all trials is set to 16. In the variational inference stage, 200 numerical samples are employed to estimate the lower bound. 4.3.2. Results In Fig. 7, the optimization results imply the predictive distribution com- puted by Eq. (43) shrinks rapidly as the number of training sample increases. 128 samples can give a narrow-band distribution of w (p ), indicating the trained BDL model becomes suciently reliable as most model uncertain- ties have been explained away by data. Moreover, we compared the evo- lution process of the predictive distribution p (w ) via di erent priors. In both trials, 128 training samples are used, and the same validation sample is randomly chosen where x = [2:0117; 1:0157; 0:0019; 213:1180; 0:3018] and 23 y = [2:3428;1:1501;1:0536]. It is found that the predictive distribution via hyper prior takes more epochs to shrink (See Fig. 7. (c.1) and (c.2)). The intuitive explanation is N (0;IG (1; 1)) has a larger initial parameter space than N (0; 1). Despite the di erence, both BDL surrogates provide a reliable input-output mapping, and the coecient of determination is summarized in Table 4. To check the generalizability of proposed surrogates, M () is further applied to uncertainty analysis. First, 1000 samples were used to train the network. Next, another 1  10 samples are exploited to develop the distribution of displacements at p , p , and p . Fig. 8 presents the UQ 1 2 3 results, where BDL based surrogates accurately propagate uncertainties to the response distributions. 64 128 256 512 Prior type fx ;y g fx ;y g fx ;y g fx ;y g i i i i i i=1 i i=1 i i=1 i i=1 Direct prior 0.9720 0.9982 0.9990 0.9993 Hyper prior 0.9935 0.9989 0.9983 0.9994 Table 4: Comparison of the coecient of determination. (b) distributions of w (p2) (c) distributions of w (p3) (a) distributions of w (p1) Figure 8: Distribution estimate for w (p ), w (p ), and w (p ). The dashed black line is the 1 2 3 ground truth, which is computed via the high- delity FE model using 510 samples. The color lines denote the surrogate predictions that are kernel smoothing function estimates using the predictive mean. 4.4. Example 4: prediction of wind pressure Obtaining detailed data of wind-induced pressure coecients on build- ing surfaces is of great practical importance in the design of high-rise build- ings. However, wind tunnel test results are limited and may be contaminated 24 through di erent sources. In this example, the proposed BDL model is ap- plied to predict the mean and root-mean-square (RMS) pressure coecients using limited experiment data. 4.4.1. Wind pressure database and predictive model Wind tunnel data. The aerodynamic database considered for this example is developed by the Tokyo Polytechnic University (TPU) [41]. For the wind tunnel experiment, a 1 : 400 scale rigid model was built to represent the target tall building of dimension 200m  40m  40m and a power law exponent of 1=4 was used for the description of the mean wind speed. A total of 500 pressure taps were used to collect data at a sampling frequency of 1000 Hz for a sample period of 32:768 s. Hourly average wind speed was 11:1438 m=s and wind attacking angle was 0 , indicating wind direction is perpendicular to the front face of the experiment model. The wind pressure of p p x 1 our interest is characterized by a dimensionless number C (x) = that p 2 U =2 is known as the pressure coecient. p is the static pressure at freestream, is the air density and U is the mean wind speed at the reference height. The predictive quantities are: mean C (x) = E[C (x)] p p (49) rms mean C (x) = jC (x) C (x)j p p p To demonstrate the ecacy of the BDL surrogate in dealing with small datasets, Biharmonic spline interpolation is performed based on the measured pressure data N = 500. Speci cally, the are 250 width interpolation points old and 750 height interpolation points in each building face. As a result, the surface pressure elds are described by a total of N = 250  750  4 = new 750000 synchronous pressure points. However, we only use as much as 1% of the total data to train the Bayesian model. Prediction model. The Cartesian coordinates x 2 R are selected input mean variables of the BDL model, and the output is a scalar either y = C (x) rms or y = C (x). After extensive hyperparameters and network architectures search, it was found that BDL with network con guration of 2 15 10 1 mean rms and 230151 provide superior performance in predicting C and C , p p x x e e respectively. Hyperbolic tangent function tanh (x) = is adopted as the x x e +e activation function since it produces steep derivatives, and ADMA optimizer [35] is implemented with a learning rate initialized to 0:03, which follows 25 a step decay to prevent optimizing parameters chaotically. The annealing strategy is adopted to improve the stochastic gradients where the anneal rate is xed to 0:75. 1000 samples are applied to get a reasonable estimation for the test log likelihood during the training phase. Epoch number is set as 1000 and the testing frequency is 10. To validate the e ectiveness of HBM based prior, a direct Gaussian priorN (0; 1) and a hyper priorN (0;IG (1; 1)) are examined. 4.4.2. Results mean Fig. 9 summarizes the C predictions and Fig. 10 provides the pre- rms dictive results of C . The ground truth is numerically obtained via the interpolation and extrapolation of the wind tunnel data C ; the predictions are de ned as the mean value of the predictive distribution C ; the relative error measures the predictive di erence between C and C ; and model un- p p certainties are evaluated using the variance of C . The results reveal that hyperprior N (0;IG (1; 1)) not only allows M () to be better tted but also makes M () less sensitive to the complex nature of the noise embedded in the experimental data. It can be seen from the limits of the colorbar that the magnitude of relative errors is reduced for those hyperprior trials. From a percent error perspective, most predictions are lying within the 97% con dence interval and the maximum errors are all less than 10%, most of which are mainly gathered around the boundary due to the extrapolation algorithm that is performed at the data preprocessing stage. In Fig. 9, un- certainties results con rm the propagation of uncertainty resulting from the extrapolation process and successfully identify the areas that are prone to. In general, it seems hyperprior improves the model performance more in the rms mean case of predicting C than C , which is reasonable as the second sam- p p ple moment magni es the di erences between predicted values and observed values, causing the need of more degrees of freedom to explain the variation. Table 5 summarizes the model performance using RMSE. To illustrate the robustness of the variational inference method in terms of approximating in- tractable posterior distributions, Fig. 11 shows the optimization process of computing the variational posterior. It can be observed that VI is capable of capturing the main characteristics within the rst 30 epochs, and it is applicable to various situations where the true posterior distribution takes di erent kinds of forms. 26 (A) Direct prior results (A) Direct prior results W<latexit sha1_base64="uFIQ4td98e3eamYVB8C+KHNNNhI=">AAAB+HicbZBLS8NAFIUnPmt9NOrSTbAIrkqigi4Lbly2YB/QhjKZ3LRDJ5Mwc6PWUPB/uHGhiFt/ijv/jdPHQlsPDHycc4e5c4JUcI2u+22trK6tb2wWtorbO7t7JXv/oKmTTDFosEQkqh1QDYJLaCBHAe1UAY0DAa1geD3JW3egNE/kLY5S8GPalzzijKKxenapi/CAeYvL8J6qcNyzy27FncpZBm8OZTJXrWd/dcOEZTFIZIJq3fHcFP2cKuRMwLjYzTSklA1pHzoGJY1B+/l08bFzYpzQiRJljkRn6v6+kdNY61EcmMmY4kAvZhPzv6yTYXTl51ymGYJks4eiTDiYOJMWnJArYChGBihT3OzqsAFVlKHpqmhK8Ba/vAzNs4p3XvHqF+Vq/WlWR4EckWNySjxySarkhtRIgzCSkWfySt6sR+vFerc+ZqMr1rzCQ/JH1ucPpGSULA==</latexit> indward Le<latexit sha1_base64="gpyTV+LvRDmzQB36+lW71VmhEok=">AAAB+HicbZBLS8NAFIUnPmt9NOrSTbAIrkqigi4Lbly4aME+oA1lMrlph04mYeZGrKHg/3DjQhG3/hR3/hunj4W2Hhj4OOcOc+cEqeAaXffbWlldW9/YLGwVt3d290r2/kFTJ5li0GCJSFQ7oBoEl9BAjgLaqQIaBwJawfB6krfuQWmeyDscpeDHtC95xBlFY/XsUhfhAfNbiFDzEMY9u+xW3KmcZfDmUCZz1Xr2VzdMWBaDRCao1h3PTdHPqULOBIyL3UxDStmQ9qFjUNIYtJ9PFx87J8YJnShR5kh0pu7vGzmNtR7FgZmMKQ70YjYx/8s6GUZXfs5lmiFINnsoyoSDiTNpwQm5AoZiZIAyxc2uDhtQRRmaroqmBG/xy8vQPKt45xWvflGu1p9mdRTIETkmp8Qjl6RKbkiNNAgjGXkmr+TNerRerHfrYza6Ys0rPCR/ZH3+AIu9lBw=</latexit> ftside s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario scenario <latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> (a.1) Predictions (a.2) Errors (a.3) Uncertainties (a.1) Predictions (a.2) Errors (a.3) Uncertainties (B) Hyperprior results (B) Hyperprior results (c) Ground truth (c) Ground truth (b.3) Uncertainties (b.1) Predictions (b.2) Errors (b.3) Uncertainties (b.1) Predictions (b.2) Errors (A) Direct prior results (A) Direct prior results Le<latexit sha1_base64="w8EtRCEqos+9EA4jArOL/Mo1pg8=">AAAB9XicbZBLSwMxFIUz9VXrq+rSzWARXJUZFXRZcOPCRQv2Ae1YMpk7bWgmMyR3rGUo+DPcuFDErf/Fnf/G9LHQ1gOBj3NuyM3xE8E1Os63lVtZXVvfyG8WtrZ3dveK+wcNHaeKQZ3FIlYtn2oQXEIdOQpoJQpo5Ato+oPrSd58AKV5LO9wlIAX0Z7kIWcUjXXfQXjE7BZgSFUw7hZLTtmZyl4Gdw4lMle1W/zqBDFLI5DIBNW67ToJehlVyJmAcaGTakgoG9AetA1KGoH2sunWY/vEOIEdxsocifbU/X0jo5HWo8g3kxHFvl7MJuZ/WTvF8MrLuExSBMlmD4WpsDG2JxXYAVfAUIwMUKa42dVmfaooQ1NUwZTgLn55GRpnZfe87NYuSpXa06yOPDkix+SUuOSSVMgNqZI6YUSRZ/JK3qyh9WK9Wx+z0Zw1r/CQ/JH1+QNFYpN1</latexit> eward Rightside <latexit sha1_base64="keHx9+2qE5dYLBFvkxzjFrVb4SI=">AAAB+XicbZBLSwMxFIUz9VXra9Slm2ARXJUZFXRZcOOyFfuAdiiZzJ02NPMguVMsQ8Ef4saFIm79J+78N6aPhbYeCHycc0Nujp9KodFxvq3C2vrG5lZxu7Szu7d/YB8eNXWSKQ4NnshEtX2mQYoYGihQQjtVwCJfQssf3k7z1giUFkn8gOMUvIj1YxEKztBYPdvuIjxifi/6A9QigEnPLjsVZya6Cu4CymShWs/+6gYJzyKIkUumdcd1UvRyplBwCZNSN9OQMj5kfegYjFkE2stnm0/omXECGibKnBjpzP19I2eR1uPIN5MRw4Fezqbmf1knw/DGy0WcZggxnz8UZpJiQqc10EAo4CjHBhhXwuxK+YApxtGUVTIluMtfXoXmRcW9rLj1q3K1/jSvo0hOyCk5Jy65JlVyR2qkQTgZkWfySt6s3Hqx3q2P+WjBWlR4TP7I+vwBZbeUmQ==</latexit> s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario (a.1) Predictions (a.2) Errors (a.3) Uncertainties (a.1) Predictions (a.2) Errors (a.3) Uncertainties (B) Hyperprior results (B) Hyperprior results (c) Ground truth (c) Ground truth (b.1) Predictions (b.2) Errors (b.3) Uncertainties (b.1) Predictions (b.2) Errors (b.3) Uncertainties mean Figure 9: Prediction results of C using di erent priors. The training dataset has 700 samples. 5. Concluding Remarks This paper presents a probabilistic modeling approach for learning hid- den relationships from limited and noisy data using Bayesian deep learning (BDL) with hierarchical prior. The proposed surrogate rigorously accounts for the model uncertainties by means of imposing prior distributions on model parameters. Meanwhile, it e ectively propagates the preassigned prior be- lief to the prediction quantities. In summary, the following conclusions are drawn: (1) Bayesian inference has been successfully integrated into the current deterministic deep learning framework. Consequently, the proposed 27 (A) Direct prior results (A) Direct prior results W<latexit sha1_base64="uFIQ4td98e3eamYVB8C+KHNNNhI=">AAAB+HicbZBLS8NAFIUnPmt9NOrSTbAIrkqigi4Lbly2YB/QhjKZ3LRDJ5Mwc6PWUPB/uHGhiFt/ijv/jdPHQlsPDHycc4e5c4JUcI2u+22trK6tb2wWtorbO7t7JXv/oKmTTDFosEQkqh1QDYJLaCBHAe1UAY0DAa1geD3JW3egNE/kLY5S8GPalzzijKKxenapi/CAeYvL8J6qcNyzy27FncpZBm8OZTJXrWd/dcOEZTFIZIJq3fHcFP2cKuRMwLjYzTSklA1pHzoGJY1B+/l08bFzYpzQiRJljkRn6v6+kdNY61EcmMmY4kAvZhPzv6yTYXTl51ymGYJks4eiTDiYOJMWnJArYChGBihT3OzqsAFVlKHpqmhK8Ba/vAzNs4p3XvHqF+Vq/WlWR4EckWNySjxySarkhtRIgzCSkWfySt6sR+vFerc+ZqMr1rzCQ/JH1ucPpGSULA==</latexit> indward Le<latexit sha1_base64="gpyTV+LvRDmzQB36+lW71VmhEok=">AAAB+HicbZBLS8NAFIUnPmt9NOrSTbAIrkqigi4Lbly4aME+oA1lMrlph04mYeZGrKHg/3DjQhG3/hR3/hunj4W2Hhj4OOcOc+cEqeAaXffbWlldW9/YLGwVt3d290r2/kFTJ5li0GCJSFQ7oBoEl9BAjgLaqQIaBwJawfB6krfuQWmeyDscpeDHtC95xBlFY/XsUhfhAfNbiFDzEMY9u+xW3KmcZfDmUCZz1Xr2VzdMWBaDRCao1h3PTdHPqULOBIyL3UxDStmQ9qFjUNIYtJ9PFx87J8YJnShR5kh0pu7vGzmNtR7FgZmMKQ70YjYx/8s6GUZXfs5lmiFINnsoyoSDiTNpwQm5AoZiZIAyxc2uDhtQRRmaroqmBG/xy8vQPKt45xWvflGu1p9mdRTIETkmp8Qjl6RKbkiNNAgjGXkmr+TNerRerHfrYza6Ys0rPCR/ZH3+AIu9lBw=</latexit> ftside s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario scenario <latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> (a.1) Predictions (a.2) Errors (a.3) Uncertainties (a.1) Predictions (a.2) Errors (a.3) Uncertainties (B) Hyperprior results (B) Hyperprior results (c) Ground truth (c) Ground truth (b.3) Uncertainties (b.1) Predictions (b.2) Errors (b.3) Uncertainties (b.1) Predictions (b.2) Errors (A) Direct prior results (A) Direct prior results Le<latexit sha1_base64="w8EtRCEqos+9EA4jArOL/Mo1pg8=">AAAB9XicbZBLSwMxFIUz9VXrq+rSzWARXJUZFXRZcOPCRQv2Ae1YMpk7bWgmMyR3rGUo+DPcuFDErf/Fnf/G9LHQ1gOBj3NuyM3xE8E1Os63lVtZXVvfyG8WtrZ3dveK+wcNHaeKQZ3FIlYtn2oQXEIdOQpoJQpo5Ato+oPrSd58AKV5LO9wlIAX0Z7kIWcUjXXfQXjE7BZgSFUw7hZLTtmZyl4Gdw4lMle1W/zqBDFLI5DIBNW67ToJehlVyJmAcaGTakgoG9AetA1KGoH2sunWY/vEOIEdxsocifbU/X0jo5HWo8g3kxHFvl7MJuZ/WTvF8MrLuExSBMlmD4WpsDG2JxXYAVfAUIwMUKa42dVmfaooQ1NUwZTgLn55GRpnZfe87NYuSpXa06yOPDkix+SUuOSSVMgNqZI6YUSRZ/JK3qyh9WK9Wx+z0Zw1r/CQ/JH1+QNFYpN1</latexit> eward Rightside <latexit sha1_base64="keHx9+2qE5dYLBFvkxzjFrVb4SI=">AAAB+XicbZBLSwMxFIUz9VXra9Slm2ARXJUZFXRZcOOyFfuAdiiZzJ02NPMguVMsQ8Ef4saFIm79J+78N6aPhbYeCHycc0Nujp9KodFxvq3C2vrG5lZxu7Szu7d/YB8eNXWSKQ4NnshEtX2mQYoYGihQQjtVwCJfQssf3k7z1giUFkn8gOMUvIj1YxEKztBYPdvuIjxifi/6A9QigEnPLjsVZya6Cu4CymShWs/+6gYJzyKIkUumdcd1UvRyplBwCZNSN9OQMj5kfegYjFkE2stnm0/omXECGibKnBjpzP19I2eR1uPIN5MRw4Fezqbmf1knw/DGy0WcZggxnz8UZpJiQqc10EAo4CjHBhhXwuxK+YApxtGUVTIluMtfXoXmRcW9rLj1q3K1/jSvo0hOyCk5Jy65JlVyR2qkQTgZkWfySt6s3Hqx3q2P+WjBWlR4TP7I+vwBZbeUmQ==</latexit> s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario (a.2) Errors (a.3) Uncertainties (a.1) Predictions (a.3) Uncertainties (a.1) Predictions (a.2) Errors (B) Hyperprior results (B) Hyperprior results (c) Ground truth (c) Ground truth (b.3) Uncertainties (b.1) Predictions (b.2) Errors (b.1) Predictions (b.2) Errors (b.3) Uncertainties rms Figure 10: Prediction results of C using di erent priors. The training dataset has 700 samples. model is able to analyze uncertainties associated with model predictions and help stakeholders make a more informed decision by providing a con dence level for the predictive estimation. (2) The hypothesis of using hierarchical Bayesian modeling to describe prior distributions of model parameters is tested. In both classi ca- tion and regression problems, superior performances can be achieved utilizing hyper priors, especially when the training data is seriously con- taminated. Moreover, probabilistic surrogate with hyper prior tends to have an improved learning ability from a small dataset. (3) Intractable posterior distributions that risen from multidimensional 28 100 200 300 400 500 600 Scenario fx ;y g fx ;y g fx ;y g fx ;y g fx ;y g fx ;y g i i i i i i i i=1 i i=1 i i=1 i i=1 i i=1 i i=1 Windward 0.077 0.030 0.027 0.022 0.019 0.017 Windward 0.051 0.021 0.021 0.020 0.018 0.015 Leftside 0.039 0.030 0.027 0.019 0.013 0.012 Leftside 0.037 0.030 0.024 0.016 0.012 0.012 Leeward 0.014 0.013 0.011 0.011 0.010 0.010 Leeward 0.014 0.011 0.011 0.010 0.010 0.009 Rightside 0.037 0.025 0.025 0.021 0.019 0.015 Rightside 0.033 0.025 0.023 0.020 0.014 0.011 y denotes the direct prior N (0; 1), and z denotes the hyper prior N (0;IG (1; 1)) Table 5: Comparison of the root mean square error. integrals step of Bayesian analysis has been addressed by the state- of-the-art variational inference method. Compared to some advanced sampling-based methods, variational inference method o ers a higher scalability by tackling the model learning problem in an objective- equivalent-transformed and gradients-e ective-computed optimization form. (4) The examples provided have demonstrated the applicability of the pro- posed modeling scheme to both classi cation and regression tasks in- volving complex systems. Especially in the membrane example, BDL is capable of providing an accurate description of the highly nonlinear mapping between di erent design variables and various structural per- formance indicators and produces virtually identical uncertainty quan- ti cation results as conventional Monte Carlo method. Furthermore in the wind eld prediction example, BDL model is trained and tested us- ing very limited wind tunnel data, and it is shown that the probabilistic model is not only able to e ectively recover the entire mapping of the mean and root-mean-square pressure elds with high precision using as small as 1% of the data but also quanti es the uncertainty level at every single point in the prediction domain, serving as a reliable surrogate for learning complex eld distribution. To improve the model performance, it is envisaged that the combination of the BDL model with information from underlying physics can not only further accelerate the training of neural networks but also holds the promise 29 (a.1) p(ω1, ω3) (a.2) p(ω1, ω3) (a.3) p(ω1, ω3) (a.4) p(ω1, ω3) (b.1) p(ω , ω ) (b.2) p(ω , ω ) (b.3) p(ω , ω ) (b.4) p(ω , ω ) 3 17 3 17 3 17 3 17 Figure 11: Evolution of the variational posterior distribution. Three model parameters ! , ! , and ! are randomly selected. The rst row corresponds to the joint distribution 1 3 17 of p (! ; ! ), and the second row plots the joint distribution of p (! ; ! ). 1 3 3 17 of interpreting the learning process. 6. Acknowledgments This work was supported by the National Science Foundation (NSF) un- der Grant No. 1520817 and No. 1612843. This support is gratefully ac- knowledged. 7. Reference References [1] S. L. Brunton, J. L. Proctor, J. N. Kutz, Discovering governing equa- tions from data by sparse identi cation of nonlinear dynamical systems, Proceedings of the National Academy of Sciences (2016) 201517384. [2] M. Raissi, G. E. Karniadakis, Hidden physics models: Machine learn- ing of nonlinear partial di erential equations, Journal of Computational Physics 357 (2018) 125{141. [3] C. Soize, Identi cation of high-dimension polynomial chaos expansions with random coecients for non-gaussian tensor-valued random elds 30 using partial and limited experimental data, Computer methods in ap- plied mechanics and engineering 199 (33-36) (2010) 2150{2164. [4] Y. Yang, S. Nagarajaiah, Output-only modal identi cation with limited sensors using sparse component analysis, Journal of Sound and Vibra- tion 332 (19) (2013) 4741{4765. [5] J. Javh, J. Slavi c, M. Bolte zar, High frequency modal identi cation on noisy high-speed camera data, Mechanical Systems and Signal Process- ing 98 (2018) 344{351. [6] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in: Advances in neural information processing systems, 2013, pp. 1196{1204. [7] Z. Ghahramani, Probabilistic machine learning and arti cial intelli- gence, Nature 521 (7553) (2015) 452. [8] A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learning for computer vision?, in: Advances in neural information pro- cessing systems, 2017, pp. 5574{5584. [9] C. Robert, Machine learning, a probabilistic perspective (2014). [10] C. E. Rasmussen, Gaussian processes in machine learning, in: Advanced lectures on machine learning, Springer, 2004, pp. 63{71. [11] R. G. Ghanem, P. D. Spanos, Stochastic nite element method: Re- sponse statistics, in: Stochastic Finite Elements: A Spectral Approach, Springer, 1991, pp. 101{119. [12] D. Xiu, G. E. Karniadakis, Modeling uncertainty in ow simulations via generalized polynomial chaos, Journal of computational physics 187 (1) (2003) 137{167. [13] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, R. Adams, Scalable bayesian optimization using deep neural networks, in: International Conference on Machine Learning, 2015, pp. 2171{2180. [14] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learning, Vol. 1, MIT press Cambridge, 2016. 31 [15] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2015) 436. [16] D. J. MacKay, A practical bayesian framework for backpropagation net- works, Neural computation 4 (3) (1992) 448{472. [17] D. J. MacKay, Probable networks and plausible predictionsa review of practical bayesian methods for supervised neural networks, Network: Computation in Neural Systems 6 (3) (1995) 469{505. [18] R. M. Neal, Bayesian learning for neural networks, Vol. 118, Springer Science & Business Media, 2012. [19] H. K. Lee, Bayesian nonparametrics via neural networks, Vol. 13, SIAM, [20] E. T. Nalisnick, On priors for bayesian neural networks, Ph.D. thesis, UC Irvine (2018). [21] H. Je reys, An invariant form for the prior probability in estimation problems, Proc. R. Soc. Lond. A 186 (1007) (1946) 453{461. [22] Y. Gal, Uncertainty in deep learning, University of Cambridge. [23] Y. Li, Approximate inference: New visions, Ph.D. thesis, University of Cambridge (2018). [24] J. Zhu, J. Chen, W. Hu, B. Zhang, Big learning with bayesian methods, National Science Review 4 (4) (2017) 627{651. [25] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncer- tainty in neural networks, arXiv preprint arXiv:1505.05424. [26] D. M. Blei, A. Kucukelbir, J. D. McAuli e, Variational inference: A review for statisticians, Journal of the American Statistical Association 112 (518) (2017) 859{877. [27] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, L. K. Saul, An intro- duction to variational methods for graphical models, Machine learning 37 (2) (1999) 183{233. 32 [28] M. D. Ho man, D. M. Blei, C. Wang, J. Paisley, Stochastic variational inference, The Journal of Machine Learning Research 14 (1) (2013) 1303{1347. [29] R. Ranganath, S. Gerrish, D. Blei, Black box variational inference, in: Arti cial Intelligence and Statistics, 2014, pp. 814{822. [30] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114. [31] D. J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, arXiv preprint arXiv:1401.4082. [32] N. M. Nasrabadi, Pattern recognition and machine learning, Journal of electronic imaging 16 (4) (2007) 049901. [33] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensor ow: a system for large- scale machine learning., in: OSDI, Vol. 16, 2016, pp. 265{283. [34] H. Robbins, S. Monro, A stochastic approximation method, in: Herbert Robbins Selected Papers, Springer, 1985, pp. 102{109. [35] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980. [36] J. T. Springenberg, A. Klein, S. Falkner, F. Hutter, Bayesian optimiza- tion with robust bayesian neural networks, in: Advances in Neural In- formation Processing Systems, 2016, pp. 4134{4142. [37] J. Platt, Sequential minimal optimization: A fast algorithm for training support vector machines. [38] X. Luo, A. Kareem, A convnet-based surrogate for uncertainty quanti - cation of structural systems considering the spatial variability of mate- rial properties. [39] N.-H. Kim, Introduction to nonlinear nite element analysis, Springer Science & Business Media, 2014. 33 [40] R. J. Kuether, M. S. Allen, A numerical approach to directly compute nonlinear normal modes of geometrically nonlinear nite element mod- els, Mechanical Systems and Signal Processing 46 (1) (2014) 1{15. [41] Tpu aerodynamic database: Wind pressure database based on wind tunnel experiment for high-rise building (http://wind.arch.t- kougei.ac.jp/system/eng/contents/code/tpu). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Bayesian deep learning with hierarchical prior: Predictions from limited and noisy data

Loading next page...
 
/lp/arxiv-cornell-university/bayesian-deep-learning-with-hierarchical-prior-predictions-from-J0J6F3qY7s
ISSN
0167-4730
eISSN
ARCH-3348
DOI
10.1016/j.strusafe.2019.101918
Publisher site
See Article on Publisher Site

Abstract

Datasets in engineering applications are often limited and contaminated, mainly due to unavoidable measurement noise and signal distortion. Thus, using conventional data-driven approaches to build a reliable discriminative model, and further applying this identi ed surrogate to uncertainty anal- ysis remains to be very challenging. In this paper, a deep learning (DL) based probabilistic model is presented to provide predictions based on lim- ited and noisy data. To address noise perturbation, the Bayesian learning method that naturally facilitates an automatic updating mechanism is con- sidered to quantify and propagate model uncertainties into predictive quan- tities. Speci cally, hierarchical Bayesian modeling (HBM) is rst adopted to describe model uncertainties, which allows the prior assumption to be less subjective, while also makes the proposed surrogate more robust. Next, the Bayesian inference is seamlessly integrated into the DL framework, which in turn supports probabilistic programming by yielding a probability distribu- tion of the quantities of interest rather than their point estimates. Variational inference (VI) is implemented for the posterior distribution analysis where the intractable marginalization of the likelihood function over parameter space is framed in an optimization format, and stochastic gradient descent method is applied to solve this optimization problem. Finally, Monte Carlo simulation is used to obtain an unbiased estimator in the predictive phase of Bayesian inference, where the proposed Bayesian deep learning (BDL) scheme is able to o er con dence bounds for the output estimation by analyzing propa- gated uncertainties. The e ectiveness of Bayesian shrinkage is demonstrated Corresponding author. 156 Fitzpatrick Hall, Notre Dame, IN 46556, USA. Email addresses: xluo1@nd.edu (Xihaier Luo ), kareem@nd.edu (Ahsan Kareem) Preprint submitted to Elsevier July 10, 2019 arXiv:1907.04240v1 [stat.ML] 8 Jul 2019 in improving predictive performance using contaminated data, and various examples are provided to illustrate concepts, methodologies, and algorithms of this proposed BDL modeling technique. Keywords: Probabilistic modeling, Bayesian inference, Deep learning, Monte Carlo variational inference, Bayesian hierarchical modeling, Noisy data 1. Introduction Applications of data-driven approaches for learning the performance of engineering systems using limited experimental data is hindered by at least three factors. First, the original input-output patterns are often governed by a series of highly nonlinear and implicit partial di erential equations (PDEs), and hence approximation of their functional relationships may be proportion- ally computation demanding [1, 2]. Secondly, interpolation and extrapola- tion techniques are usually needed to extract knowledge from acquired data in consideration of only a limited number of sensors used in practice along with the fact that sensor malfunction often occurs in real time. Nevertheless, it is very dicult to establish an accurate discriminative model merely from data, especially when a relatively small dataset is available [3, 4]. Thirdly, ex- perimental data is inevitably contaminated by noise from di erent sources, for example, signal perturbation induced noisy sensing during monitoring. The performance of conventional discriminative algorithms may be notice- ably impaired if proper noise reduction has not been performed [5, 6]. In this context, we present a machine learning based predictive model that is capable of providing high-quality predictions from limited and noisy data. To date, most machine learning models are deterministic, which implies that a certain input sample x is strictly bounded to a point estimator y notwithstanding the existence of model uncertainties [7, 8]. Probabilistic modeling, on the other hand, emerges as an attractive alternative on account of its ability to quantify the uncertainty in model predictions, which can pre- vent a poorly trained model from being overcon dent in predictions, and hence helps stakeholders make a more reliable decision [7, 9]. In literature, Gaussian processes (GPs) and generalized polynomial chaos (gPC) are two representative members from the probabilistic modeling family [9, 10, 11, 12]. From a mathematical standpoint, GPs put a joint Gaussian distribution over input random variables by de ning a mean function E[x] and a co- 2 variance function Cov[x ;x ] [10]. Then, GPs compute hyperparameters of i j the spatial covariance function and propagte the inherent randomness of x in virtue of the Bayes' theorem. Next, gPC is an e ective way to propa- gate uncertain quantities by means of utilizing a set of random coecients f ; ; : : : ; g and orthogonal polynomials f ;  ; : : : ;  g. Approxima- 1 2 n 1 2 n tion approaches (e.g. Galerkin projection) are usually used to determine the unknown coecients and polynomial basis  (x) [12]. Even though GPs and gPC are capable of computing empirical con dence intervals, the infer- ence complexity may become overwhelming when the number of observations increases (e.g. cubic scaling relationship O (N ) between the computational complexity and data fx ;y g is found in the GPs case [13]). Furthermore, i i=1 scaling a GPs model or identifying random coecients of a gPC model for problems with high-dimensional data remains challenging [3, 9, 10]. On the contrary, deep learning (DL) has di erentiated itself within the realm of machine learning for its superior performance in handling large-scale complex systems. With the strong support of high-performance computing (HPC), DL has made signi cant accomplishments in a wide range of appli- cations such as image recognition, data compression, computer vision, and language processing [14, 15]. Nonetheless, the same level of application has not been observed for its probabilistic version [16, 17, 18]. This paper at- tempts to bridge the modeling gap between DL and Bayesian learning by presenting a new paradigm Bayesian deep learning (BDL) model. With the aim of developing a surrogate model that can be used to accelerate the un- certainty analysis of engineering systems using noisy data, we focus on three main aspects in probabilistic modeling (See Fig. 1). First, Bayesian statistical inference encodes the subjective beliefs, whereas prior distributions are imposed on model parameters to represent the initial uncertainty. A conventional DL supported surrogate M (), however, can be captivating and confusing in equal measure at the same time. Owning a deeply structured network architecture allows M () to approximate a wide variety of functions, but also makes model parameters hard to interpret, which in turn increases the diculty of choosing a reasonable prior distribu- tion for M () [19, 20]. In [19], Lee shows noninformative prior (e.g. Je reys prior) that bears objective Bayes properties is liable to be misled by the variability in data. And using the Fisher information matrix to compute a Je reys prior for a large network architecture can be computationally pro- hibitive [21]. On the informative prior side, zero-centered Gaussian is, not surprisingly, extensively explored in early work on Bayesian neural networks 3 Phase 1: train a surrogate using noisy data denoising deterministic y ˆ model model Input probabilistic p (y ˆ) model computionally biased model surrogate model output efficient . . . model uncertainty computionally high fidelity unbiased model prediction simulator extensive Phase 2: conduct uncertainty analysis using trained surrogate Figure 1: The proposed BDL model M () aims at reducing the computational burden imposed by the repeated evaluations of the original high- delity model M () in uncertainty analysis. Instead of a point estimate, the BDL model quanti es the model uncertainties by presenting a distribution of values. (BDL) on account of its exibility in implementation as well as its natural regularization mechanism [16, 17], where the quadratic penalty for BDL pa- rameters alleviates the over tting problem. Later in [18], Neal points out that employing a heavy-tailed distribution (e.g. Cauchy distribution) to represent prior knowledge can provide a more robust shrinkage estimator and diminish the e ects of outlying data. Nonetheless, Cauchy distribution is dicult to implement because it does not have nite moments [20]. To develop a prior model that is more amenable to reform and work with, we investigate the ef- cacy of applying hierarchical Bayesian modeling (HBM) to de ne the prior distribution. And it is found that HBM can eciently ameliorate the prior assumptions induced model performance variance by di using the in uences that are cascaded down from the top level, hence allowing a relatively more robust distribution model. Secondly, the determination of the posterior distribution p (!jD) requires 4 integrating model parameters f! ; ! ; : : : ; ! g out of the likelihood function. 1 2 n Unfortunately, performing numerical integration in BDL's parameter space is always computationally intractable as a neural networks model is commonly con gured with hundreds/thousands of parameters [22, 23]. Approximation methods that can be broadly classi ed into the sampling method and the optimization method have been introduced to alleviate such computational bottleneck [24]. For the the sampling method, Markov Chain Monte Carlo (MCMC) has been explored in early work to calculate the posterior proba- bilities [17, 20, 25]. The main idea of MCMC is to produce numerical sam- ples from the posterior distribution by simulating a discrete but dependent Markov chain f! g on the state space, where  (!)  p (!jD). The su- i=1 cient condition that ensures the stationary distribution converges to the tar- get posterior distribution requires the transition kernel T () to have detailed 0 0 0 balance properties p (!jD) T ! ! ! = p ! jD T ! ! ! . However, the convergence of MCMC algorithms can be extremely slow in the pres- ence of large datasets since the burn-in process to eliminate the initialization bias is greatly extended [9, 23]. More recently, variational inference (VI) method has been employed for inferring an intractable posterior distribution where the probabilistic inference problem is cast in a deterministic opti- mization form [26, 27, 28]. A proxy probability distribution q (!), which is formally explicit and computationally ecient, is introduced to approx- imate the true posterior distribution p (!jD). Compared to the sampling method, VI has the advantage of approximating non-conjugate distributions by virtue of optimizing an explicit objective function [23, 24], and VI solves the intractable integrals in a more ecient manner on account of o -the-peg optimization algorithms can be seamlessly adapted to the minimization prob- lem [26, 28, 29]. Unfortunately, the di erentiation of the objective function regarding the proxy posterior involves the determination of the expectations with respect to the variational parameters where Monte Carlo gradient es- timator may give a high variance [30, 31]. To address this issue, we repa- rameterize our objective function by introducing a set of auxiliary variables. It should be noted that such reparameterization would not only yield an unbiased estimator of the objective function but also provides an ecient approximation of variational parameters via permitting the use of stochastic gradient descent (SGD) in optimization. Lastly, Monte Carlo (MC) method is used in the predictive phase of Bayesian inference. MC method draws numerical samples from the proxy probability distribution q (!), builds a predictive probability distribution of 5 new data, and assigns a con dence level to the model prediction for repre- senting model uncertainty. The following outline of this paper is intended as: Section 2 gives a brief introduction of surrogate modeling using deep neural networks. Section 3 describes the proposed BDL model in detail. In Section 4, various examples are provided to demonstrate the e ectiveness of BDL in dealing with noisy data. Finally, Section 5 draws major conclusions. 2. Deterministic modeling: a deep learning framework 2.1. Neural networks based surrogate model In the context of supervised learning [9, 15], let x = [x ; x ; : : : ; x ] 2 R 1 2 m denote an input vector, and the corresponding output vector y = [y ; y ; : : : ; y ] 2 1 2 n R is estimated by a computationally intensive model M () (e.g. a large nite element model). We are interested in using neural networks to approx- imate functional relationships F () between x and y. y = F (x) ! y ^ = F (x) (1) where F () is the mathematical expression of neural networks based sur- rogate M (), and theoretically it can be proportionately broken down as: K K1 1 ^ ^ ^ F (x) = f  f  : : : f (x) (2) with K denoting the layer number, and  symbolizing the functional composition operation which is de ned as [32]: i j i j ^ ^ ^ ^ f  f = f f () (3) Each function in the sequence f () ; i = 1; 2; : : : ; K contains two steps, where the rst step is identical to a linear regression: i i i i i i z = f x = ! x + b (4) i th i In Eq. (4), x is the input vector of the i layer, ! is the weight matrix, i i th and b is the i bias term. For the sake of brevity, b can be integrated into i i ! by introducing an additional input variable x = 1. For the rest of the paper, let ! be a tensor containing all model parameters [32, 33]. Next, f () 6 applies an element-wise nonlinear transformation to the intermediate output z in the second step: i i i y = f x =  z (5) where  () is often referred to as the activation function [14, 15]. Selection of  () directly depends on the characteristics of M (). Moreover, a network architecture is deemed to be deep when K > 3 [14]. Thus, a deep learning framework can be e ectively built by increasing the composition size K . 2.2. Probabilistic interpretation of L loss function Consider a parameterised deep neural network model y ^ = F (x) de- scribed in Section 2.1 and a training dataset D = fx ;y g , the next step i i=1 is to nd an optimal ! such that the surrogate F (x) best describes the data D. In this regard, a loss function L () that measures the error between the predicted value y ^ and the expected result y is de ned. Then, F (x) is trained to approximate F (x) by minimizing the empirical loss through tuning model parameters !: ! = arg min L y ;F (x ) (6) i=1 From a probabilistic modeling perspective, the interest of loss function L () is in the probability distribution of y as a funciton of x: L y ;F (x ) = p (Dj!) = p (yjx;!) (7) where the model parameters can be learned by the method of maximum likelihood estimate (MLE) [9, 22], which searches an estimator for ! that MLE maximizes the likelihood term ! = arg max p (Dj!). Let noise term be independent and identically distributed (i.i.d.), the probability density associated with paired observations under Gaussian assumption can be ex- pressed as: ^ ^ L y ;F (x ) = p (Dj!;  ) = N y jF (x ) ; i  i i i i=1 (8) 2 2 1 1 = exp y F (x ) 2 2 (2 ) 2 i=1 7 Usually, the numerical implementation of the MLE method performs the minimization problem of Eq. (6) in a logarithmic scale: 1 n ? 2 ! = arg min y F (x ) log 2 (9) 2 2 i=1 where the precision term  is determined by minimizing the negative log likelihood: MLE = y F (x ) (10) nN i=1 Furthermore, Eq. (8) can be further simpli ed under homoscedastic con- ditions: ! = arg min y F (x ) (11) i=1 Hence, the probability density based loss function coincides with the well- known mean squared error (MSE). 2.3. Stochastic optimization for updating model parameters Gradient based optimization is one of the most popular algorithms to optimize neural networks: ! ! + rL (!) (12) t+1 t where  is generally known as the leraning rate that follows the Robbins- Monro conditions. The objective function stated in Eq. (11) indicates O (N ) operations are required to compute L () and rL () respectively, which may be computationally demanding for a large dataset. Therefore, stochastic gradient descent (SGD) is considered [14, 15, 32], where a random vector g (x) is de ned to calculate the gradients [34]. With a restricted mag- 2 2 nitude of stochastic gradients Ejjg (x)jj 6 N and a bounded variance 2 2 Ejjg (x) rL (x)jj 6  [34], SGD can eciently update model param- eters by constructing a noisy natural gradient: rL () = E[g (x)] (13) 8 In particular, adaptive moment estimation (ADAM) [35] that computes adaptive learning rates for ! is adopted, and the corresponding g (x) takes the expression of: s ! M M V t t t g (x) = = = +  (14) t t ^ 1 1 V +  1 2 where M and V are estimates of the mean and variance of the gradients t t respectively. In ADAM, they are updated as follows [35]: M = M + (1 )L () t 1 t1 1 t (15) V = V + (1 )L () t 2 t1 2 t It should be noted that the expresssion of Eq. (14) is an unbiased esti- mation of the exact gradient, and its calculation only depends on one data point. 3. Probabilistic modeling: a Bayesian approach In this section, the aforementioned deterministic DL surrogate is en- hanced to account for model uncertainties by the integration of Bayesian inference. Overall, Bayesian learning includes three steps: (1) establish prior beliefs about uncertain parameters; (2) compute the posterior distribution via Bayes' rule; and (3) use the predictive distribution to determine a yet unobserved data point. 3.1. Prior representation: Bayesian hierarchical modelling To begin with, let U and U represent the epistemic uncertainty and E A aleatory uncertainty respectively [8]. Prior information of U and U is E A initially encapsulated in a probability distribution function form. For the epistemic uncertainty, prior distributions are imposed on model parameters ! [16, 17, 18]: !  p (!) (16) Practical applications imply that the prior distribution p (!) should not be too restrictive on account of the limited prior information about ! [20]. For this reason, hierarchical Bayesian modeling (HBM) method, which intro- duces a vector of hyperparameters  = [ ;  ; : : : ;  ] to the prior distribu- 1 2 n tion, is employed to reduce subjective information induced undue in uence 9 on p (!) [9]. Consequently, the marginal prior can be obtained by integrating out  through the sum rule: p (!) = p (!;) d (17) where the joint probability distribution can be further expressed as a product of a set of conditional distributions via applying the product rule: p (!;) = p (!j) p () (18) For probabilistic modeling, model parameters in each layer of a BDL model are often assumed to follow a factorized multivariate Gaussian distri- bution: K K Y Y p (!) = p ! j ;  = N  ;  I (19) i ! ! i ! ! i i i i=1 i=1 At the rst hierarchy stage, let  = 0 and  be a Gamma random variable for instance. Hence, HBM breaks the prior distribution down to: p (!j  ; ; ) = p (!j  ) p ( j ; ) where > 0; > 0 (20) !   ! ! with and denoting the shape parameter and rate parameter of the Gamma distribution respectively. Using Eq. (17) and Eq. (18), the prior distribution can be reformulated as: p (!j ; ) = p (!j  ; ; ) d = St 0; ; 2 (21) !   ! In Eq. (21), St () characterizes the student's t-distribution which is ca- pable of providing heavier tails than Gaussian distribution. Remark (1). HBM grants a more impartial prior distribution by allowing the data to speak for itself [9], and it admits a more general modeling framework where the hierarchical prior becomes direct prior when the hyperparameters are modeled by a Dirac delta function (e.g. using  (x  ) to describe the precision term in Eq. (19)). In addition, HBM o ers the exibility to work with a wide range of probability distributions, and even directly provides an analytical solution for some of the most popular choices such as Laplace, Gaussian, and student's t-distribution [20]. 10 On the other hand, homoscedastic noise  that is independent of the input data Var[j (x ;y )] =  8 (x ;y ) 2 D is added to the output in i i i  i consideration of the aleatory uncertainty U which cannot be explained away by accepting more samples fx;yg. Besides, additive noise term  guarantees a tractable likelihood for the probabilistic model M () where  is most commonly modeled as a Gaussian process: 2 2 p j ; = N  ; (22) Let  = 0 and  be a constant, the output vector hereby follows: y  N yjE [F (x)];  I (23) p(!jD) A graphical model representation for the aforestated hierarchical prior as well as an illustration of the BDL model is given in Fig. 2. f (·) : P y ¯ and σ (·) α ω ˆ τ y τ Σ Figure 2: The architecture of Bayesian deep learning with hierarchical prior. 3.2. Posterior approximation: variational inference After de ning a hierarchical prior distribution for the proposed probabilis- tic model M (), the next step is to infer the posterior distribution, which re ects the updated parameter information. In the Bayesian formalism, the joint posterior distribution p (!;jD) is calculated by: p (Dj!;) p (!j) p () p (!;jD) = (24) p (D) The marginal posterior distribution p (!jD) can be further determined by integrating out the joint posterior distribution, and the denominator of Eq. (24) is often referred to as the model evidence that takes the form of: p (D) = p (Dj!) p (!) d! (25) 11 In most cases, estimation of Eq. (25) it is computationally intractable as numerical integration requires a considerable number of samples if the parameter space W is very high [18]. To overcome this integration problem, variational inference (VI) is adopted so that Bayesian inference can proceed eciently [26, 28]. Remark (2). Di erent from the method of maximum a posteriori (MAP) which captures the mode of a posterior distribution [18], the objective for the posterior approximation at this place is to nd a computationally ecient replacement of the true posterior distribution, so that numerical samples can be easily accessible in the predictive analysis. 3.2.1. Objective function: evidence lower bound In VI, a family of proxy distributions parameterized by  is posited to approximate the true posterior distribution: p (!jD)  q (!) 2  (26) VI attempts to make q (!) looks as close as possible to p (!jD) via re n- ing , and one typical interpretation of the closeness between two probability distributions is the Kullback-Leibler (KL) divergence [27, 28]. Therefore, VI casts the approximation problem in an optimization form, where the objec- tive function can be expressed as: p (!jD) ' q (!) = arg minKL (q (!)jp (!jD)) (27) q (!) = arg min q (!) log d! p (!jD) Instead of minimizing the KL divergence, we can equivalently maximize the evidence lower bound (ELBO) L (!) [28]: q (!) = L (!) = arg max = arg max E [log p (!;D) log q (!)] q (!) (28) = arg maxE [log p (Dj!)] KL (q (!) j p (!)) q (!) The rst conditional log-likelihood term in Eq. (28) is usually referred as to the data term [28, 29]. It compels the posterior distribution to explain data D by maximizing the expected log-likelihood. Mini-batch optimization 12 method is implemented to eciently o er an unbiased stochastic estimator of the log-likelihood: N M X X log p (Dj!) = log p (y jx ;!)  log p (y jx ;!) (29) i i i i i=1 i=1 where M is a subset of N . Noticeably, besides accelerating the com- putational process, mini-batch optimization owns a higher model updating frequency that allows for a more robust convergence, and hence increases the chance of avoiding local minimum [34, 35]. Meanwhile, mean eld variational inference (MFVI) method [26, 28, 29] is adopted to control the computational complexity of the second term in Eq. (28): q (!) = q (! ) (30) i=1 The variational distribution is represented by a layer-wise factorized dis- tribution where each factor is determined by its own variational parameter: R Q exp log p (D;!) q (! ) d! j j j6=i j q (! ) = (31) R R exp log p (D;!) q (! ) d! d! j j i j6=i j Substituting Eq. (29) and Eq. (30) back to Eq. (28), the objective function of ELBO can be rewritten into: M K X Y L (!;) = log p (y jx ;!) q (! ) d! i  j i=1 j=1 (32) q (! ) log q (! ) d! j  j j j j j=1 where the iteration of variational distribution for model parameters ter- minates when the convergence criteria is satis ed. 3.2.2. Gradients computation: stochastic gradient variational Bayes Among the many techniques developed for solving optimization problems, gradient-based optimization method reliably tackles the EBLO maximization 13 problem stated in Eq. (28) in an ecient manner [31]. For the sake of brevity, let: A (!;) = log p (!;D) log q (!) (33) Using the log-derivative trick [28], the objective function L (!;) can be di erentiated with respect to variational parameters : r L (!;) = q (!)A (!;) d! (34) @ log q (!) @A (!;) = q (!) A (!;) + q (!) d! @ @ To quickly estimate numerical integrations, we can write Eq. (34) in its expectatio form and use Monte Carlo method to compute the stochastic gradients: @ log q (!) @A (!;) r L (!;) = E [ A (!;) + ] (35) q (!) @ @ However, it is observed that crude MC estimator for r L (!;) usually induces large variance [30, 31]. For this reason, stochastic gradient variational Bayes (SGVB) method is embraced to reduce the estimations' variance [30]. Simply put, SGVB introduces an auxiliary variable  to the proxy distribu- tion: Z Z q (!) = q (!;) d = q (!j) p () d (36) where conditional probability density function q (!j) is formally de ned as a Dirac delta function: q (!j) =  (! g (;)) (37) and g (;) is a di erentiable transformation function that connects ! and : ! = g (;) (38) For instance, a simple choice for p () is isotropic Gaussian distribution i:i:d: p () = N (0; I ), and the reparameterization can be achieved though 14 ! =  + . Therefore, substituting Eq. (36) and Eq. (37) into Eq. (35), the pathwise estimator can be expressed as [30]: @ log p ()A (g (;) ;) @A (g (;) ;) r L (!;) = E [ + ] (39) p() @ @ Combining Eq. (33) and Eq. (39), the nal Monte Carlo estimator for the gradients can be written as: @ @g (;) r L (!;) = E [ [log p (!;D) log q (!)] ] (40) p() @! @ Now, the variance of stochastic gradients can be e ectively reduced by magnitude of orders using this reparameterized estimator [30], and the VI- based optimization problem can be eciently solved by the stochastic gradi- ent descent algorithm mentioned in the previous section. 3.3. Predictive evaluation: Monte Carlo sampling The last but the most important step of Bayesian computation concerns making predictions for new data samples (x ;y ), where the predictive dis- tribution can be expressed as: p (y jx ;D) = p (y jx ;!) p (!jD) d! (41) The optimized proxy posterior q (!), which is obtained by solving the ELBO optimization problem, will take the place of the true posterior distri- bution p (!jD): p (y jx ;D) ' p (y jx ;!) q (!) d! (42) In the same vein, the predictive integral is numerically achieved by draw- ing random samples from the proxy distribution. An unbiased estimator is given: p (y jx ;D)  p (y jx ;! ) where !  q (!) (43) i i i=1 15 For the purpose of uncertainty representation, it is of great importance to compute statistical moments of y , such as mean: y ^ = F x j! (44) mean i=1 and variance: i  i T  i ^ ^ y ^ =  I + F x j! F x j! var i=1 (45) ! ! k k X X 1 1 i  i ^ ^ F x j! F x j! k k i=1 i=1 Because y ^ and y ^ are essential elements for constructing the acqui- mean var sition function which balances the exploration and exploitation in the context of Bayesian optimization [10, 13]. 4. Numerical examples and results 4.1. Example 1: nonlinear regression The rst example considers a nonlinear function, which is commonly used as a testing problem to assess the accuracy of a regression model [13, 14, 36]. Mathematically, it is written as: y = x sin (x) (46) To identify common features and di erences between proposed model and current approaches, the regression problem is numerically solved us- ing four di erent surrogate modeling methods: polynomial regression (PR); support vector machine (SVM); neural networks (NN); and Bayesian deep learning (BDL). First, a polynomial f (x) = + x +  + x of de- 0 1 n gree n = 11 is de ned to t the symmetric function [32]. The method of least squares is applied to nd the best linear unbiased estimator (BLUE) of the regression coecient vector by minimizing the sum of squared er- rors. Secondly, an SVM regression model with a Gaussian kernel function G (x ; x ) = exp ( jjx x jj ) is implemented to build a mapping between i j i j x and y [9, 10]. The default value for the kernel coecient is 1, and se- quential minimal optimization (SMO) algorithm is utilized to update the 16 coecients where Karush-Kuhn-Tucker (KKT) violation  = 0:0001 is speci- ed as the convergence criterion [37]. Thirdly, a feedforward neural network with one hidden layer that has 20 neurons is built to learn the nonlinear transformation [14, 32]. Hyperbolic tangent function is adopted as the ac- tivation function for the hidden layer since its derivatives are steeper than sigmoid function. For the output layer, a straight line function that outputs the weighted sum from hidden neurons is used. Stochastic gradient descent is performed for parameter optimization [34], where the learning rate is xed as a constant  = 0:0001 and the default epoch setting is 100000. Lastly, a Bayesian surrogate that has the same network con guration is examined. To account for the model uncertainty, a normal prior N (0; 0:1) is directly imposed on the model parameters. (a.1) clean data (b.1) noisy data: σ = 0.3 (c.1) noisy data: σ = 0.7 (a.2) clean data (b.2) noisy data: σ = 0.3 (c.2) noisy data: σ = 0.7 Figure 3: Comparisons of regression results using various surrogate models. The training dataset is contaminated by a Gaussian noise with di erent standard deviations. To train these models, we use the pseudorandom number generator to simulate a training dataset consisting of 30 samples that are uniformly dis- tributed in the interval (10; 10). A Gaussian noise determined by N (0; ) is added to each sample to make the problem more realistic [8]. Fig. 3 visualizes the tted regression model via di erent approaches. It should be noted that the mean value of the predictive distribution is se- lected as the model estimation in the case of BDL. Obviously, BDL improves the generalization performance and mitigates the over tting issue, which is encountered in NN modeling. Meanwhile, BDL is capable of characterizing 17 the model uncertainty associated with the prediction in addition to achiev- ing an equivalently accurate regression result compared to other methods. Table 1 and Table 2 summarize the coecient of determination (R ) and the root mean squared error (RMSE) for di erent surrogates. According to the results, BDL is more resistant to noisy data since increasing the random noise level deteriorates the e ectiveness and quality of other three surrogates in a much more clear way. Method clean data  = 0:1  = 0:3  = 0:5  = 0:7  = 0:9 PR 0.9950 0.9937 0.9884 0.9807 0.9680 0.9533 SVM 0.9783 0.9784 0.9758 0.9604 0.9486 0.9388 NN 0.9890 0.9854 0.9828 0.9770 0.9526 0.9497 BDL 0.9883 0.9893 0.9928 0.9757 0.9672 0.9516 Table 1: Comparison of the coecient of determination (R ) of the di erent surrogate models where the training dataset is contaminated by di erent noise levels. Method clean data  = 0:1  = 0:3  = 0:5  = 0:7  = 0:9 PR 0.2483 0.2629 0.3917 0.4863 0.6227 0.7643 SVM 0.5397 0.5424 0.5642 0.6997 0.7934 0.9198 NN 0.3817 0.4068 0.4783 0.5276 0.7350 0.8077 BDL 0.3270 0.3098 0.2964 0.5425 0.6091 0.7933 Table 2: Comparison of the root mean squared error (RMSE) of the di erent surrogate models where the training dataset is contaminated by di erent noise levels. 4.2. Example 2: binary classi cation To evaluate the classi cation performance of our proposed surrogate model, the second example applies the BDL to a synthetic dataset that holds a two- dimensional swirl pattern. As shown in Fig. 4, the synthetic dataset exhibits two intuitively separable manifolds, where each manifold resembles a crescent moon [9]. A BDL surrogate that is arranged in a 2  5  5  2 form is developed as the neural network classi er. Speci cally, two hidden layers are con g- ured with the hyperbolic tangent activation function and softmax function (x ) = P is implemented to represent the categorical distribution for i J x j=1 the outputs by computing a probability row vector where the sum of the row 18 (a) Two moons manifold (b) Training dataset contaminated by  ∼ N (0, 0.1) (c) Training dataset contaminated by  ∼ N (0, 0.3) Figure 4: Classi cation problem: a highly nonlinear dataset. is 1 [14, 15]. To understand the e ects of di erent priors on the classi cation performance, we have considered three direct priors: Laplace L (0; 1), Gaus- sianN (0; 1), and CauchyC (1; 1). We further come up with three more hyper priors by xing the location parameter of the aforementioned probability distributions along with treating their scale parameter as a random vari- able, which can be described using an Inverse-Gamma distribution IG (1; 1). The basic probability distribution functions are given as: 1 jx j L (x j ; ) = exp C (x j ; ) = (47) [1 + ( ) ] IG (x j ; ) = x exp ( ) x Additionally, we conduct two trials to study the e ects of noise on our neural network classi er. In the rst trial, a BDL model is developed using 900 samples, where each sample is contaminated by a Gaussian noise gener- ated from   N (0; 0:1). In the second trial, 1200 samples are utilized to build M () as the external noise is ampli ed to   N (0; 0:3). Following the 70=30 rule [14, 15, 32], the whole dataset D is divided into the training set D and the validation set D , respectively. A rst-order gradient-based opti- t v mization method, ADAM [35], is adopted to update model parameters, where the learning rate  = 0:001, the exponential decay rates for the rst/second moment estimates and are 0:9 and 0:999, respectively. It should be 1 2 addressed that the reparametrization trick mentioned in Section 3.2 is au- tomatically embedded by means of taking the derivatives of the objective 19 function with respect to the variational parameters [30]. Here, the proxy distribution takes a Gaussian form, which indicates the variational posterior distribution is parameterized with two parameters, mean and standard devi- ation. To accelerate the training process, Mini-batch optimization method is used [14, 28], and the batch size is set to 30. The stop criteria epoch number is 50000. Fig. 5 provides a graphic illustration of the classi cation results. Accord- ing to these results, BDL model becomes less con dent about its predictions when validation samples are more near the true separation trajectory. It is because even small noise can distort the original manifold in a severe way [6]. However, the proposed hyper priors are able to provide better predictions especially in the second trial where the addictive noise is stronger. This is credited to the nature mechanism of Bayesian hierarchical modeling, which relaxes the prior constraints by encoding prior belief using a series of hy- perparameter values instead of xed constants [9]. Lastly, Fig. 6 reveals the variational posterior distribution of weights and bias in the rst hidden layer. For the previous proposed priors, zero centered Laplace prior is equivalent to the L1 regularization and Gaussian prior is identical to the L2 regularization [20, 25]. In Fig. 6, results of L (0; 1) is approximately sparse signal and p (!) of N (0; 1) is not centering around zero, which aligns with the properties of L1 and L2 regularization respectively [32]. Case A: ✏i ∼ N (0, 0.1) Case B: ✏i ∼ N (0, 0.3) <latexit sha1_base64="bn28YPX4cquHu0A4p7/3Cv6dsdI=">AAACLnicbVDLSgMxFM34rPVVdekmWAQFGWZUUFxViuBKKlgVOkPJpHfa0MyD5I5YhoL/48Zf0YWgIm79DNPHwtchgcM59yb3niCVQqPjvFgTk1PTM7OFueL8wuLScmll9VInmeJQ54lM1HXANEgRQx0FSrhOFbAokHAVdKsD/+oGlBZJfIG9FPyItWMRCs7QSM3SiYdwi3nVPEGPj/rU2xkeSLWQxhfU0yKiXsSww5nMz0yFhBC3qLNDHdv1lGh3cLtZKju2MwT9S9wxKZMxas3Sk9dKeBZBjFwyrRuuk6KfM4WCS+gXvUxDyniXtaFhaMwi0H4+XLdPN43SomGizI2RDtXvHTmLtO5FgakczK1/ewPxP6+RYXjo5yJOM4SYjz4KM0kxoYPsaEso4Ch7hjCuhJmV8g5TjKNJuGhCcH+v/Jdc7trunu2e75cr53ejOApknWyQLeKSA1Ihp6RG6oSTe/JIXsmb9WA9W+/Wx6h0whpHuEZ+wPr8AhyBpys=</latexit><latexit sha1_base64="Jlc2TWz/opB2/j6YLDc002NFPmo=">AAACLnicbVDLSgMxFM3Ud32NunQTLIKClBkrKK6KIrgSBVsLnVIy6Z02NPMguSOWoeD/uPFXdCGoiFs/w/Sx0OohgcM59yb3Hj+RQqPjvFq5qemZ2bn5hfzi0vLKqr22XtVxqjhUeCxjVfOZBikiqKBACbVEAQt9CTd+93Tg39yC0iKOrrGXQCNk7UgEgjM0UtM+8xDuMDs1T9CT4z719oYHEi2k8QX1tAipFzLscCazC1MhIcAd6uxRp1jylGh3cLdpF5yiMwT9S9wxKZAxLpv2s9eKeRpChFwyreuuk2AjYwoFl9DPe6mGhPEua0Pd0IiFoBvZcN0+3TZKiwaxMjdCOlR/dmQs1LoX+qZyMLee9Abif149xeCokYkoSREiPvooSCXFmA6yoy2hgKPsGcK4EmZWyjtMMY4m4bwJwZ1c+S+p7hfdUtG9OiiUr+5HccyTTbJFdohLDkmZnJNLUiGcPJAn8kberUfrxfqwPkelOWsc4Qb5BevrGyFUpy4=</latexit> Posterior probability predictions M<latexit sha1_base64="NuzC4m+7YI8+tUeUzFjfOBjnLUo=">AAACA3icbVDLSgNBEJz1GeMr6k0vg0HwFHZV0GPQixchgnlAsoTZ2U4yZHZ2mekVwxLw4q948aCIV3/Cm3/j5HHQxIKGoqqb7q4gkcKg6347C4tLyyurubX8+sbm1nZhZ7dm4lRzqPJYxroRMANSKKiiQAmNRAOLAgn1oH818uv3oI2I1R0OEvAj1lWiIzhDK7UL+y2EB8xu4hAkTRUHjUwoFGCG7ULRLblj0HniTUmRTFFpF75aYczTCBRyyYxpem6CfsY0Ci5hmG+lBhLG+6wLTUsVi8D42fiHIT2ySkg7sbalkI7V3xMZi4wZRIHtjBj2zKw3Ev/zmil2LvxMqCRFUHyyqJNKijEdBUJDoYGjHFjCuBb2Vsp7TDOONra8DcGbfXme1E5K3mnJuz0rli+nceTIATkkx8Qj56RMrkmFVAknj+SZvJI358l5cd6dj0nrgjOd2SN/4Hz+APIsmFo=</latexit> odel uncertainties Posterior probability predictions M<latexit sha1_base64="NuzC4m+7YI8+tUeUzFjfOBjnLUo=">AAACA3icbVDLSgNBEJz1GeMr6k0vg0HwFHZV0GPQixchgnlAsoTZ2U4yZHZ2mekVwxLw4q948aCIV3/Cm3/j5HHQxIKGoqqb7q4gkcKg6347C4tLyyurubX8+sbm1nZhZ7dm4lRzqPJYxroRMANSKKiiQAmNRAOLAgn1oH818uv3oI2I1R0OEvAj1lWiIzhDK7UL+y2EB8xu4hAkTRUHjUwoFGCG7ULRLblj0HniTUmRTFFpF75aYczTCBRyyYxpem6CfsY0Ci5hmG+lBhLG+6wLTUsVi8D42fiHIT2ySkg7sbalkI7V3xMZi4wZRIHtjBj2zKw3Ev/zmil2LvxMqCRFUHyyqJNKijEdBUJDoYGjHFjCuBb2Vsp7TDOONra8DcGbfXme1E5K3mnJuz0rli+nceTIATkkx8Qj56RMrkmFVAknj+SZvJI358l5cd6dj0nrgjOd2SN/4Hz+APIsmFo=</latexit> odel uncertainties <latexit sha1_base64="OthqOefBPJlzKQz2rary2z3lc2A=">AAACEXicbVC7TgJBFJ3FF+ILtbSZSEyoyK6aaEm0scREHgkQMjvchQmzO5uZu0ay4Rds/BUbC42xtbPzb5wFCgVPdXLOubn3Hj+WwqDrfju5ldW19Y38ZmFre2d3r7h/0DAq0RzqXEmlWz4zIEUEdRQooRVrYKEvoemPrjO/eQ/aCBXd4TiGbsgGkQgEZ2ilXrHcQXjAtKYMghZK01grn/lCChxbDn3Bs6CZ9Iolt+JOQZeJNyclMketV/zq9BVPQoiQS2ZM23Nj7KZMo+ASJoVOYiBmfMQG0LY0YiGYbjr9aEJPrNKngb0nUBHSqfp7ImWhMePQt8mQ4dAsepn4n9dOMLjspiKKE4SIzxYFiaSoaFYP7QsNHOXYEsa1sLdSPmSacVuPKdgSvMWXl0njtOKdVbzb81L1al5HnhyRY1ImHrkgVXJDaqROOHkkz+SVvDlPzovz7nzMojlnPnNI/sD5/AFqxJ6j</latexit><latexit sha1_base64="OthqOefBPJlzKQz2rary2z3lc2A=">AAACEXicbVC7TgJBFJ3FF+ILtbSZSEyoyK6aaEm0scREHgkQMjvchQmzO5uZu0ay4Rds/BUbC42xtbPzb5wFCgVPdXLOubn3Hj+WwqDrfju5ldW19Y38ZmFre2d3r7h/0DAq0RzqXEmlWz4zIEUEdRQooRVrYKEvoemPrjO/eQ/aCBXd4TiGbsgGkQgEZ2ilXrHcQXjAtKYMghZK01grn/lCChxbDn3Bs6CZ9Iolt+JOQZeJNyclMketV/zq9BVPQoiQS2ZM23Nj7KZMo+ASJoVOYiBmfMQG0LY0YiGYbjr9aEJPrNKngb0nUBHSqfp7ImWhMePQt8mQ4dAsepn4n9dOMLjspiKKE4SIzxYFiaSoaFYP7QsNHOXYEsa1sLdSPmSacVuPKdgSvMWXl0njtOKdVbzb81L1al5HnhyRY1ImHrkgVXJDaqROOHkkz+SVvDlPzovz7nzMojlnPnNI/sD5/AFqxJ6j</latexit> (a.1) Laplacian prior (b.1) Laplacian prior and (a.1) Laplacian prior (b.1) Laplacian prior and (a.1) Laplacian prior (b.1) Laplacian prior and (a.1) Laplacian prior (b.1) Laplacian prior and Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior (a.2) Gaussian prior (b.2) Gaussian prior and (a.2) Gaussian prior (b.2) Gaussian prior and (a.2) Gaussian prior (b.2) Gaussian prior and (a.2) Gaussian prior (b.2) Gaussian prior and Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior (a.3) Cauchy prior (b.3) Cauchy prior and (a.3) Cauchy prior (b.3) Cauchy prior and (a.3) Cauchy prior (b.3) Cauchy prior and (a.3) Cauchy prior (b.3) Cauchy prior and Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior Inverse-gamma hyperprior Figure 5: Classi cation results: the predicted results are represented by the mean pre- dictive probability of the lower crescent on the input domain of (3; 3) (3; 3) and the model uncertainty is quanti ed in terms of the variance associated with each prediction using Eq. (45). 20 a.1 Laplace prior: posterior distribution over ω b.1 Gaussian prior: posterior distribution over ω 1 1 a.2 Laplace prior: posterior distribution over b b.2 Gaussian prior: posterior distribution over b 1 1 Figure 6: Comparison of optimized variational posterior distributions of model parameters using di erent regularization techniques. ! and b denote the weight and bias tensor 1 1 associated with the rst layer, respectively. 4.3. Example 3: structural analysis of a geometrically nonlinear membrane This example addresses the computational cost issue of using nite ele- ment (FE) model in the structural analysis with uncertain inputs [38]. The target structure is a geometrically nonlinear membrane that is clamped at four edges [39, 40]. Fig. 7. (a.1) gives a sketch of the objective domain = [0; l][0; b]  R , where uniformly distributed pressure loads are applied on the upper surface. We are interested in using the BDL based surrogate M () to approximate the nonlinear mechanism between uncertain structure parameters x and random responses y. 4.3.1. Membrane example: nonlinear analysis and surrogate modeling Uncertainty analysis. Vector x covers geometric uncertainties, where x = l, x = b, and x = t are the length, breadth, and thickness of target 1 2 3 membrane, as well as material uncertainties, with x = E and x =  de- 4 5 noting the elastic modulus and Poisson's ratio, respectively. Table 3 gives a systematic summary of statistical properties of x. The quantities of interest y are z-direction displacements w at locations of p (1; 0:5), p (0:6; 0:2), 1 2 and p (1:6; 0:8). The load-displacement relationship is no longer a deter- ministic curve due to the input randomness (See Fig. 7. (a.2)). Nonlinear FE model. Because of the geometric nonlinearity, the in- l non l plain strain is partitioned into two parts  =  + , where  describes the 21 y (v) (0,b) (l, b) Ω 3 (0, 0) (l, 0) x (u) (a.1) Geometric illustration (a.2) Nonlinear behavior illustration <latexit sha1_base64="sfaJaR1P0GZlV6njm3miwLha0IA=">AAACDHicbVC7SgNBFJ31GeMramkzGITYLLsqaBmw0DIB84BkCbOTm2TI7IOZu2JYArY2/oqNhSK2foCdf+Nkk0ITD1w4nHMuM/f4sRQaHefbWlpeWV1bz23kN7e2d3YLe/t1HSWKQ41HMlJNn2mQIoQaCpTQjBWwwJfQ8IdXE79xB0qLKLzFUQxewPqh6AnO0EidQrGNcI9pidnuCb2GKABUglMhZaJRZaGxSTm2k4EuEndGimSGSqfw1e5GPAkgRC6Z1i3XidFLmULBJYzz7URDzPiQ9aFlaMgC0F6aHTOmx0bp0l6kzIRIM/X3RsoCrUeBb5IBw4Ge9ybif14rwd6ll4owThBCPn2ol0iKEZ00Q7tCAUc5MoRxJcxfKR8wxTia/vKmBHf+5EVSP7XdM9utnhfL1YdpHTlySI5IibjkgpTJDamQGuHkkTyTV/JmPVkv1rv1MY0uWbMKD8gfWJ8/l+6bzg==</latexit><latexit sha1_base64="e8LkWi4olWQepLVQM6HH5RhomhA=">AAACFXicbVDLahtBEJxVHD/k2FaSYy6DhUEGI3btQHIU5JJTsCB6gCRE76hXGjQ7s8z0CotFkG/IJb+SSw4JIddAbvkbjx4HW3Kdiqqa6e6KMyUdheH/oPRs7/n+weFR+fjFyelZ5eWrtjO5FdgSRhnbjcGhkhpbJElhN7MIaaywE08/LP3ODK2TRn+meYaDFMZaJlIAeWlYueoT3lFRg/r1Jf9k9PIfsDzGCcyksVwqlTuyq/RiWKmG9XAFvkuiDamyDW6HlX/9kRF5ipqEAud6UZjRoABLUihclPu5wwzEFMbY81RDim5QrK5a8AuvjHjil0iMJr5SH74oIHVunsY+mQJN3La3FJ/yejkl7weF1FlOqMV6UJIrToYvK+IjaVGQmnsCwkq/KxcTsCDIF1n2JUTbJ++S9nU9uqlHzbfVRvPLuo5D9oadsxqL2DvWYB/ZLWsxwb6y7+wn+xV8C34Ev4M/62gp2FT4mj1C8PceqjefoA==</latexit> (b.1) 64 training samples (b.2) 128 training samples (b.3) 256 training samples (b.4) 512 training samples <latexit sha1_base64="qwUUvInzLR0lPg6iVsIjS1xBNXs=">AAACCXicbZBLSwMxFIUz9VXrq+rSTbAIdVNmtKjLghuXLdgHtKVk0ts2NJMZkjtiGQqu3PhX3LhQxK3/wJ3/xvSx0NYDgY9zbkju8SMpDLrut5NaWV1b30hvZra2d3b3svsHNRPGmkOVhzLUDZ8ZkEJBFQVKaEQaWOBLqPvD60levwNtRKhucRRBO2B9JXqCM7RWJ0tbCPeY5P2Cd0ovihQ1E0qoPjUsiCSYcSebcwvuVHQZvDnkyFzlTvar1Q15HIBCLpkxTc+NsJ0wjYJLGGdasYGI8SHrQ9OiYgGYdjLdZExPrNOlvVDbo5BO3d83EhYYMwp8OxkwHJjFbGL+lzVj7F21E6GiGEHx2UO9WFIM6aQW2hUaOMqRBca1sH+lfMA042jLy9gSvMWVl6F2VvDOC16lmCtVHmZ1pMkROSZ54pFLUiI3pEyqhJNH8kxeyZvz5Lw4787HbDTlzCs8JH/kfP4ADl2Zww==</latexit><latexit sha1_base64="kgzAZ9tDRrNmY9r1gx20PI20ebU=">AAACCnicbZBLSwMxFIUzPmt9VV26iRahbspMFeyy4MZlC/YBbSmZ9LYNzWSG5I5YhoI7N/4VNy4UcesvcOe/MX0stPVA4OOcG5J7/EgKg6777aysrq1vbKa20ts7u3v7mYPDmgljzaHKQxnqhs8MSKGgigIlNCINLPAl1P3h9SSv34E2IlS3OIqgHbC+Ej3BGVqrkzlpIdxjkvPzhXPqFYoUNRNKqD41LIgkmHEnk3Xz7lR0Gbw5ZMlc5U7mq9UNeRyAQi6ZMU3PjbCdMI2CSxinW7GBiPEh60PTomIBmHYyXWVMz6zTpb1Q26OQTt3fNxIWGDMKfDsZMByYxWxi/pc1Y+wV24lQUYyg+OyhXiwphnTSC+0KDRzlyALjWti/Uj5gmnG07aVtCd7iystQK+S9i7xXucyWKg+zOlLkmJySHPHIFSmRG1ImVcLJI3kmr+TNeXJenHfnYza64swrPCJ/5Hz+AIcbmf8=</latexit><latexit sha1_base64="jcms6GglDptJVzzgnVs3jvdBP7Y=">AAACCnicbZBLSwMxFIUzPmt9VV26iRZBN8OM72XBjcsW7APaoWTS2xrMZIbkjliGgjs3/hU3LhRx6y9w578xfSy0eiDwcc4NyT1hIoVBz/tyZmbn5hcWc0v55ZXVtfXCxmbNxKnmUOWxjHUjZAakUFBFgRIaiQYWhRLq4c3FMK/fgjYiVlfYTyCIWE+JruAMrdUu7LQQ7jDbD92jA3p4ckpRM6GE6lHDokSCGbQLRc/1RqJ/wZ9AkUxUbhc+W52YpxEo5JIZ0/S9BIOMaRRcwiDfSg0kjN+wHjQtKhaBCbLRKgO6Z50O7cbaHoV05P68kbHImH4U2smI4bWZzobmf1kzxe55kAmVpAiKjx/qppJiTIe90I7QwFH2LTCuhf0r5ddMM462vbwtwZ9e+S/UDl3/yPUrx8VS5X5cR45sk12yT3xyRkrkkpRJlXDyQJ7IC3l1Hp1n5815H4/OOJMKt8gvOR/fi+eaAg==</latexit><latexit sha1_base64="67Z57g0OeItoaKUhGejdhzPdFoM=">AAACCnicbZBLSwMxFIUz9VXrq+rSTbQIuikztaLLghuXLdgHtKVk0tsamskMyR2xDAV3bvwrblwo4tZf4M5/Y/pYaOuBwMc5NyT3+JEUBl3320ktLa+srqXXMxubW9s72d29mgljzaHKQxnqhs8MSKGgigIlNCINLPAl1P3B1Tiv34E2IlQ3OIygHbC+Ej3BGVqrkz1sIdxjcuLni6f03CtQ1EwoofrUsCCSYEadbM7NuxPRRfBmkCMzlTvZr1Y35HEACrlkxjQ9N8J2wjQKLmGUacUGIsYHrA9Ni4oFYNrJZJURPbZOl/ZCbY9COnF/30hYYMww8O1kwPDWzGdj87+sGWPvsp0IFcUIik8f6sWSYkjHvdCu0MBRDi0wroX9K+W3TDOOtr2MLcGbX3kRaoW8d5b3KsVcqfIwrSNNDsgROSEeuSAlck3KpEo4eSTP5JW8OU/Oi/PufExHU86swn3yR87nD4WWmf4=</latexit> (c.1) Evolution of direct prior (c.2) Evolution of hyper prior <latexit sha1_base64="d9JFBUcowVBS+IW0ZGb3w9pwApw=">AAACD3icbVDLSgMxFM34rPU16tJNsCh1U2ZU0GVBBJct2Ae0Q8mkmTY0MxmSO8UyFPwAN/6KGxeKuHXrzr8x03ahrQcuHM65l+QcPxZcg+N8W0vLK6tr67mN/ObW9s6uvbdf1zJRlNWoFFI1faKZ4BGrAQfBmrFiJPQFa/iD68xvDJnSXEZ3MIqZF5JexANOCRipY5+0gd1DWqQl9xTfDKVIMh3LAHe5YhRwrLhU445dcErOBHiRuDNSQDNUOvZXuytpErIIqCBat1wnBi8lCjgVbJxvJ5rFhA5Ij7UMjUjItJdO8ozxsVG6OJDKTAR4ov6+SEmo9Sj0zWZIoK/nvUz8z2slEFx5KY/iBFhEpw8FicAgcVbOLLMYGUKo4uavmPaJIhRMhXlTgjsfeZHUz0ruecmtXhTK1YdpHTl0iI5QEbnoEpXRLaqgGqLoET2jV/RmPVkv1rv1MV1dsmYVHqA/sD5/AEaqnKY=</latexit><latexit sha1_base64="/MfI8lfmN98JGkYMKD8ObUZ5fbc=">AAACDnicbVA9SwNBFNyLXzF+nVraLIZAbMJdFLQMiGCZgPmAJIS9zV6yZO/22H0XDEfA3sa/YmOhiK21nf/GvSSFJg48GGbmsfvGiwTX4DjfVmZtfWNzK7ud29nd2z+wD48aWsaKsjqVQqqWRzQTPGR14CBYK1KMBJ5gTW90nfrNMVOay/AOJhHrBmQQcp9TAkbq2YUOsHtIirRUPsM3YyniVMfSx0OTVjhSXKppz847JWcGvErcBcmjBao9+6vTlzQOWAhUEK3brhNBNyEKOBVsmuvEmkWEjsiAtQ0NScB0N5mdM8UFo/SxL5WZEPBM/b2RkEDrSeCZZEBgqJe9VPzPa8fgX3UTHkYxsJDOH/JjgUHitBvc54pREBNDCFXc/BXTIVGEgmkwZ0pwl09eJY1yyT0vubWLfKX2MK8ji07QKSoiF12iCrpFVVRHFD2iZ/SK3qwn68V6tz7m0Yy1qPAY/YH1+QOYapxK</latexit> Figure 7: Membrane example: problem statement and optimization results. non linear strain and  represents the nonlinear strain term: 2    3 2 2 2 @u @v @w 2 3 + + non @x @x @x x 2 2 2 6 7 @u @v @w non non 4 5 6 7 + + =  = (48) y @y @y @y 4 5 non @u @u @v @v @w @w xy 2 + 2 + 2 @x @y @x @y @x @y The solution d = [u; v; w] of these nonlinear equilibrium equations are obtained by the Newton-Raphson (NR) method [39]. The iterative process terminates when the unbalanced force residual is smaller than the tolerance = 0:0001 or the NR algorithm reaches the default maximum iteration n = 100. The force and displacement vector is initialized to zero and the 22 Basic variables First parameter Second parameter Distribution type l  = 2  = 0:05 Normal b  = 1  = 0:05 Normal t min = 0:001 max = 0:002 Uniform E  = 210  = 10 Normal = log(0:3)  = 0:01 Lognormal Table 3: Statistics of the uncertain input parameters for the at membrane. increment loads  = p=n where p = 100 and n = 400. It should be noted that the thickness of the membrane is comparatively small in relation to other two dimensions. It is therefore the FE model M () can be built by 200 (20 in x axis 10 in y axis) four node (Q4) quadrilateral elements and large deformation theory is adopted [39]. Surrogate model. We use the proposed BDL approach to provide a 5 3 R ! R transformation. The network architecture has three hidden layers 30  15  10 besides the input and output layer [33] and probability dis- tributions that account for model uncertainties are speci ed in a layer by layer fashion. To investigate the ecacy of di erent prior, a direct zero- mean Gaussian N (0; 1) and a hierarchical prior N (0;IG (1; 1)) have been tested. Furthermore, training datasets of size 64, 128, 256, and 512 have been considered for the purpose of identifying the in uence from the amount of data on the accuracy of model predictions. For the posterior approximation, ADAM is adopted [35], where the initial learning rate  is 0:005. Notably, decays every 100 epochs by multiplying a constant rate of 0:75 and the epoch number is 1000. The batch size for the subsampling procedure of all trials is set to 16. In the variational inference stage, 200 numerical samples are employed to estimate the lower bound. 4.3.2. Results In Fig. 7, the optimization results imply the predictive distribution com- puted by Eq. (43) shrinks rapidly as the number of training sample increases. 128 samples can give a narrow-band distribution of w (p ), indicating the trained BDL model becomes suciently reliable as most model uncertain- ties have been explained away by data. Moreover, we compared the evo- lution process of the predictive distribution p (w ) via di erent priors. In both trials, 128 training samples are used, and the same validation sample is randomly chosen where x = [2:0117; 1:0157; 0:0019; 213:1180; 0:3018] and 23 y = [2:3428;1:1501;1:0536]. It is found that the predictive distribution via hyper prior takes more epochs to shrink (See Fig. 7. (c.1) and (c.2)). The intuitive explanation is N (0;IG (1; 1)) has a larger initial parameter space than N (0; 1). Despite the di erence, both BDL surrogates provide a reliable input-output mapping, and the coecient of determination is summarized in Table 4. To check the generalizability of proposed surrogates, M () is further applied to uncertainty analysis. First, 1000 samples were used to train the network. Next, another 1  10 samples are exploited to develop the distribution of displacements at p , p , and p . Fig. 8 presents the UQ 1 2 3 results, where BDL based surrogates accurately propagate uncertainties to the response distributions. 64 128 256 512 Prior type fx ;y g fx ;y g fx ;y g fx ;y g i i i i i i=1 i i=1 i i=1 i i=1 Direct prior 0.9720 0.9982 0.9990 0.9993 Hyper prior 0.9935 0.9989 0.9983 0.9994 Table 4: Comparison of the coecient of determination. (b) distributions of w (p2) (c) distributions of w (p3) (a) distributions of w (p1) Figure 8: Distribution estimate for w (p ), w (p ), and w (p ). The dashed black line is the 1 2 3 ground truth, which is computed via the high- delity FE model using 510 samples. The color lines denote the surrogate predictions that are kernel smoothing function estimates using the predictive mean. 4.4. Example 4: prediction of wind pressure Obtaining detailed data of wind-induced pressure coecients on build- ing surfaces is of great practical importance in the design of high-rise build- ings. However, wind tunnel test results are limited and may be contaminated 24 through di erent sources. In this example, the proposed BDL model is ap- plied to predict the mean and root-mean-square (RMS) pressure coecients using limited experiment data. 4.4.1. Wind pressure database and predictive model Wind tunnel data. The aerodynamic database considered for this example is developed by the Tokyo Polytechnic University (TPU) [41]. For the wind tunnel experiment, a 1 : 400 scale rigid model was built to represent the target tall building of dimension 200m  40m  40m and a power law exponent of 1=4 was used for the description of the mean wind speed. A total of 500 pressure taps were used to collect data at a sampling frequency of 1000 Hz for a sample period of 32:768 s. Hourly average wind speed was 11:1438 m=s and wind attacking angle was 0 , indicating wind direction is perpendicular to the front face of the experiment model. The wind pressure of p p x 1 our interest is characterized by a dimensionless number C (x) = that p 2 U =2 is known as the pressure coecient. p is the static pressure at freestream, is the air density and U is the mean wind speed at the reference height. The predictive quantities are: mean C (x) = E[C (x)] p p (49) rms mean C (x) = jC (x) C (x)j p p p To demonstrate the ecacy of the BDL surrogate in dealing with small datasets, Biharmonic spline interpolation is performed based on the measured pressure data N = 500. Speci cally, the are 250 width interpolation points old and 750 height interpolation points in each building face. As a result, the surface pressure elds are described by a total of N = 250  750  4 = new 750000 synchronous pressure points. However, we only use as much as 1% of the total data to train the Bayesian model. Prediction model. The Cartesian coordinates x 2 R are selected input mean variables of the BDL model, and the output is a scalar either y = C (x) rms or y = C (x). After extensive hyperparameters and network architectures search, it was found that BDL with network con guration of 2 15 10 1 mean rms and 230151 provide superior performance in predicting C and C , p p x x e e respectively. Hyperbolic tangent function tanh (x) = is adopted as the x x e +e activation function since it produces steep derivatives, and ADMA optimizer [35] is implemented with a learning rate initialized to 0:03, which follows 25 a step decay to prevent optimizing parameters chaotically. The annealing strategy is adopted to improve the stochastic gradients where the anneal rate is xed to 0:75. 1000 samples are applied to get a reasonable estimation for the test log likelihood during the training phase. Epoch number is set as 1000 and the testing frequency is 10. To validate the e ectiveness of HBM based prior, a direct Gaussian priorN (0; 1) and a hyper priorN (0;IG (1; 1)) are examined. 4.4.2. Results mean Fig. 9 summarizes the C predictions and Fig. 10 provides the pre- rms dictive results of C . The ground truth is numerically obtained via the interpolation and extrapolation of the wind tunnel data C ; the predictions are de ned as the mean value of the predictive distribution C ; the relative error measures the predictive di erence between C and C ; and model un- p p certainties are evaluated using the variance of C . The results reveal that hyperprior N (0;IG (1; 1)) not only allows M () to be better tted but also makes M () less sensitive to the complex nature of the noise embedded in the experimental data. It can be seen from the limits of the colorbar that the magnitude of relative errors is reduced for those hyperprior trials. From a percent error perspective, most predictions are lying within the 97% con dence interval and the maximum errors are all less than 10%, most of which are mainly gathered around the boundary due to the extrapolation algorithm that is performed at the data preprocessing stage. In Fig. 9, un- certainties results con rm the propagation of uncertainty resulting from the extrapolation process and successfully identify the areas that are prone to. In general, it seems hyperprior improves the model performance more in the rms mean case of predicting C than C , which is reasonable as the second sam- p p ple moment magni es the di erences between predicted values and observed values, causing the need of more degrees of freedom to explain the variation. Table 5 summarizes the model performance using RMSE. To illustrate the robustness of the variational inference method in terms of approximating in- tractable posterior distributions, Fig. 11 shows the optimization process of computing the variational posterior. It can be observed that VI is capable of capturing the main characteristics within the rst 30 epochs, and it is applicable to various situations where the true posterior distribution takes di erent kinds of forms. 26 (A) Direct prior results (A) Direct prior results W<latexit sha1_base64="uFIQ4td98e3eamYVB8C+KHNNNhI=">AAAB+HicbZBLS8NAFIUnPmt9NOrSTbAIrkqigi4Lbly2YB/QhjKZ3LRDJ5Mwc6PWUPB/uHGhiFt/ijv/jdPHQlsPDHycc4e5c4JUcI2u+22trK6tb2wWtorbO7t7JXv/oKmTTDFosEQkqh1QDYJLaCBHAe1UAY0DAa1geD3JW3egNE/kLY5S8GPalzzijKKxenapi/CAeYvL8J6qcNyzy27FncpZBm8OZTJXrWd/dcOEZTFIZIJq3fHcFP2cKuRMwLjYzTSklA1pHzoGJY1B+/l08bFzYpzQiRJljkRn6v6+kdNY61EcmMmY4kAvZhPzv6yTYXTl51ymGYJks4eiTDiYOJMWnJArYChGBihT3OzqsAFVlKHpqmhK8Ba/vAzNs4p3XvHqF+Vq/WlWR4EckWNySjxySarkhtRIgzCSkWfySt6sR+vFerc+ZqMr1rzCQ/JH1ucPpGSULA==</latexit> indward Le<latexit sha1_base64="gpyTV+LvRDmzQB36+lW71VmhEok=">AAAB+HicbZBLS8NAFIUnPmt9NOrSTbAIrkqigi4Lbly4aME+oA1lMrlph04mYeZGrKHg/3DjQhG3/hR3/hunj4W2Hhj4OOcOc+cEqeAaXffbWlldW9/YLGwVt3d290r2/kFTJ5li0GCJSFQ7oBoEl9BAjgLaqQIaBwJawfB6krfuQWmeyDscpeDHtC95xBlFY/XsUhfhAfNbiFDzEMY9u+xW3KmcZfDmUCZz1Xr2VzdMWBaDRCao1h3PTdHPqULOBIyL3UxDStmQ9qFjUNIYtJ9PFx87J8YJnShR5kh0pu7vGzmNtR7FgZmMKQ70YjYx/8s6GUZXfs5lmiFINnsoyoSDiTNpwQm5AoZiZIAyxc2uDhtQRRmaroqmBG/xy8vQPKt45xWvflGu1p9mdRTIETkmp8Qjl6RKbkiNNAgjGXkmr+TNerRerHfrYza6Ys0rPCR/ZH3+AIu9lBw=</latexit> ftside s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario scenario <latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> (a.1) Predictions (a.2) Errors (a.3) Uncertainties (a.1) Predictions (a.2) Errors (a.3) Uncertainties (B) Hyperprior results (B) Hyperprior results (c) Ground truth (c) Ground truth (b.3) Uncertainties (b.1) Predictions (b.2) Errors (b.3) Uncertainties (b.1) Predictions (b.2) Errors (A) Direct prior results (A) Direct prior results Le<latexit sha1_base64="w8EtRCEqos+9EA4jArOL/Mo1pg8=">AAAB9XicbZBLSwMxFIUz9VXrq+rSzWARXJUZFXRZcOPCRQv2Ae1YMpk7bWgmMyR3rGUo+DPcuFDErf/Fnf/G9LHQ1gOBj3NuyM3xE8E1Os63lVtZXVvfyG8WtrZ3dveK+wcNHaeKQZ3FIlYtn2oQXEIdOQpoJQpo5Ato+oPrSd58AKV5LO9wlIAX0Z7kIWcUjXXfQXjE7BZgSFUw7hZLTtmZyl4Gdw4lMle1W/zqBDFLI5DIBNW67ToJehlVyJmAcaGTakgoG9AetA1KGoH2sunWY/vEOIEdxsocifbU/X0jo5HWo8g3kxHFvl7MJuZ/WTvF8MrLuExSBMlmD4WpsDG2JxXYAVfAUIwMUKa42dVmfaooQ1NUwZTgLn55GRpnZfe87NYuSpXa06yOPDkix+SUuOSSVMgNqZI6YUSRZ/JK3qyh9WK9Wx+z0Zw1r/CQ/JH1+QNFYpN1</latexit> eward Rightside <latexit sha1_base64="keHx9+2qE5dYLBFvkxzjFrVb4SI=">AAAB+XicbZBLSwMxFIUz9VXra9Slm2ARXJUZFXRZcOOyFfuAdiiZzJ02NPMguVMsQ8Ef4saFIm79J+78N6aPhbYeCHycc0Nujp9KodFxvq3C2vrG5lZxu7Szu7d/YB8eNXWSKQ4NnshEtX2mQYoYGihQQjtVwCJfQssf3k7z1giUFkn8gOMUvIj1YxEKztBYPdvuIjxifi/6A9QigEnPLjsVZya6Cu4CymShWs/+6gYJzyKIkUumdcd1UvRyplBwCZNSN9OQMj5kfegYjFkE2stnm0/omXECGibKnBjpzP19I2eR1uPIN5MRw4Fezqbmf1knw/DGy0WcZggxnz8UZpJiQqc10EAo4CjHBhhXwuxK+YApxtGUVTIluMtfXoXmRcW9rLj1q3K1/jSvo0hOyCk5Jy65JlVyR2qkQTgZkWfySt6s3Hqx3q2P+WjBWlR4TP7I+vwBZbeUmQ==</latexit> s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario (a.1) Predictions (a.2) Errors (a.3) Uncertainties (a.1) Predictions (a.2) Errors (a.3) Uncertainties (B) Hyperprior results (B) Hyperprior results (c) Ground truth (c) Ground truth (b.1) Predictions (b.2) Errors (b.3) Uncertainties (b.1) Predictions (b.2) Errors (b.3) Uncertainties mean Figure 9: Prediction results of C using di erent priors. The training dataset has 700 samples. 5. Concluding Remarks This paper presents a probabilistic modeling approach for learning hid- den relationships from limited and noisy data using Bayesian deep learning (BDL) with hierarchical prior. The proposed surrogate rigorously accounts for the model uncertainties by means of imposing prior distributions on model parameters. Meanwhile, it e ectively propagates the preassigned prior be- lief to the prediction quantities. In summary, the following conclusions are drawn: (1) Bayesian inference has been successfully integrated into the current deterministic deep learning framework. Consequently, the proposed 27 (A) Direct prior results (A) Direct prior results W<latexit sha1_base64="uFIQ4td98e3eamYVB8C+KHNNNhI=">AAAB+HicbZBLS8NAFIUnPmt9NOrSTbAIrkqigi4Lbly2YB/QhjKZ3LRDJ5Mwc6PWUPB/uHGhiFt/ijv/jdPHQlsPDHycc4e5c4JUcI2u+22trK6tb2wWtorbO7t7JXv/oKmTTDFosEQkqh1QDYJLaCBHAe1UAY0DAa1geD3JW3egNE/kLY5S8GPalzzijKKxenapi/CAeYvL8J6qcNyzy27FncpZBm8OZTJXrWd/dcOEZTFIZIJq3fHcFP2cKuRMwLjYzTSklA1pHzoGJY1B+/l08bFzYpzQiRJljkRn6v6+kdNY61EcmMmY4kAvZhPzv6yTYXTl51ymGYJks4eiTDiYOJMWnJArYChGBihT3OzqsAFVlKHpqmhK8Ba/vAzNs4p3XvHqF+Vq/WlWR4EckWNySjxySarkhtRIgzCSkWfySt6sR+vFerc+ZqMr1rzCQ/JH1ucPpGSULA==</latexit> indward Le<latexit sha1_base64="gpyTV+LvRDmzQB36+lW71VmhEok=">AAAB+HicbZBLS8NAFIUnPmt9NOrSTbAIrkqigi4Lbly4aME+oA1lMrlph04mYeZGrKHg/3DjQhG3/hR3/hunj4W2Hhj4OOcOc+cEqeAaXffbWlldW9/YLGwVt3d290r2/kFTJ5li0GCJSFQ7oBoEl9BAjgLaqQIaBwJawfB6krfuQWmeyDscpeDHtC95xBlFY/XsUhfhAfNbiFDzEMY9u+xW3KmcZfDmUCZz1Xr2VzdMWBaDRCao1h3PTdHPqULOBIyL3UxDStmQ9qFjUNIYtJ9PFx87J8YJnShR5kh0pu7vGzmNtR7FgZmMKQ70YjYx/8s6GUZXfs5lmiFINnsoyoSDiTNpwQm5AoZiZIAyxc2uDhtQRRmaroqmBG/xy8vQPKt45xWvflGu1p9mdRTIETkmp8Qjl6RKbkiNNAgjGXkmr+TNerRerHfrYza6Ys0rPCR/ZH3+AIu9lBw=</latexit> ftside s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario scenario <latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> (a.1) Predictions (a.2) Errors (a.3) Uncertainties (a.1) Predictions (a.2) Errors (a.3) Uncertainties (B) Hyperprior results (B) Hyperprior results (c) Ground truth (c) Ground truth (b.3) Uncertainties (b.1) Predictions (b.2) Errors (b.3) Uncertainties (b.1) Predictions (b.2) Errors (A) Direct prior results (A) Direct prior results Le<latexit sha1_base64="w8EtRCEqos+9EA4jArOL/Mo1pg8=">AAAB9XicbZBLSwMxFIUz9VXrq+rSzWARXJUZFXRZcOPCRQv2Ae1YMpk7bWgmMyR3rGUo+DPcuFDErf/Fnf/G9LHQ1gOBj3NuyM3xE8E1Os63lVtZXVvfyG8WtrZ3dveK+wcNHaeKQZ3FIlYtn2oQXEIdOQpoJQpo5Ato+oPrSd58AKV5LO9wlIAX0Z7kIWcUjXXfQXjE7BZgSFUw7hZLTtmZyl4Gdw4lMle1W/zqBDFLI5DIBNW67ToJehlVyJmAcaGTakgoG9AetA1KGoH2sunWY/vEOIEdxsocifbU/X0jo5HWo8g3kxHFvl7MJuZ/WTvF8MrLuExSBMlmD4WpsDG2JxXYAVfAUIwMUKa42dVmfaooQ1NUwZTgLn55GRpnZfe87NYuSpXa06yOPDkix+SUuOSSVMgNqZI6YUSRZ/JK3qyh9WK9Wx+z0Zw1r/CQ/JH1+QNFYpN1</latexit> eward Rightside <latexit sha1_base64="keHx9+2qE5dYLBFvkxzjFrVb4SI=">AAAB+XicbZBLSwMxFIUz9VXra9Slm2ARXJUZFXRZcOOyFfuAdiiZzJ02NPMguVMsQ8Ef4saFIm79J+78N6aPhbYeCHycc0Nujp9KodFxvq3C2vrG5lZxu7Szu7d/YB8eNXWSKQ4NnshEtX2mQYoYGihQQjtVwCJfQssf3k7z1giUFkn8gOMUvIj1YxEKztBYPdvuIjxifi/6A9QigEnPLjsVZya6Cu4CymShWs/+6gYJzyKIkUumdcd1UvRyplBwCZNSN9OQMj5kfegYjFkE2stnm0/omXECGibKnBjpzP19I2eR1uPIN5MRw4Fezqbmf1knw/DGy0WcZggxnz8UZpJiQqc10EAo4CjHBhhXwuxK+YApxtGUVTIluMtfXoXmRcW9rLj1q3K1/jSvo0hOyCk5Jy65JlVyR2qkQTgZkWfySt6s3Hqx3q2P+WjBWlR4TP7I+vwBZbeUmQ==</latexit> s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario s<latexit sha1_base64="C+HReOy+0Hi/X7LIvr+HiwVUZjI=">AAAB+HicbZBLSwMxFIUz9VXro6Mu3QwWwVWZUUGXBTcuW7APaEvJpHfa0EwyJHfEOhT8H25cKOLWn+LOf2P6WGjrgcDHOTfk5oSJ4AZ9/9vJra1vbG7ltws7u3v7RffgsGFUqhnUmRJKt0JqQHAJdeQooJVooHEooBmObqZ58x604Ure4TiBbkwHkkecUbRWzy12EB4wMwwk1VxNem7JL/szeasQLKBEFqr23K9OX7E0BolMUGPagZ9gN6MaORMwKXRSAwllIzqAtkVJYzDdbLb4xDu1Tt+LlLZHojdzf9/IaGzMOA7tZExxaJazqflf1k4xuu5mXCYpgmTzh6JUeKi8aQten2tgKMYWKNPc7uqxIdWUoe2qYEsIlr+8Co3zcnBRDmqXpUrtaV5HnhyTE3JGAnJFKuSWVEmdMJKSZ/JK3pxH58V5dz7mozlnUeER+SPn8wfDOpRA</latexit> cenario (a.2) Errors (a.3) Uncertainties (a.1) Predictions (a.3) Uncertainties (a.1) Predictions (a.2) Errors (B) Hyperprior results (B) Hyperprior results (c) Ground truth (c) Ground truth (b.3) Uncertainties (b.1) Predictions (b.2) Errors (b.1) Predictions (b.2) Errors (b.3) Uncertainties rms Figure 10: Prediction results of C using di erent priors. The training dataset has 700 samples. model is able to analyze uncertainties associated with model predictions and help stakeholders make a more informed decision by providing a con dence level for the predictive estimation. (2) The hypothesis of using hierarchical Bayesian modeling to describe prior distributions of model parameters is tested. In both classi ca- tion and regression problems, superior performances can be achieved utilizing hyper priors, especially when the training data is seriously con- taminated. Moreover, probabilistic surrogate with hyper prior tends to have an improved learning ability from a small dataset. (3) Intractable posterior distributions that risen from multidimensional 28 100 200 300 400 500 600 Scenario fx ;y g fx ;y g fx ;y g fx ;y g fx ;y g fx ;y g i i i i i i i i=1 i i=1 i i=1 i i=1 i i=1 i i=1 Windward 0.077 0.030 0.027 0.022 0.019 0.017 Windward 0.051 0.021 0.021 0.020 0.018 0.015 Leftside 0.039 0.030 0.027 0.019 0.013 0.012 Leftside 0.037 0.030 0.024 0.016 0.012 0.012 Leeward 0.014 0.013 0.011 0.011 0.010 0.010 Leeward 0.014 0.011 0.011 0.010 0.010 0.009 Rightside 0.037 0.025 0.025 0.021 0.019 0.015 Rightside 0.033 0.025 0.023 0.020 0.014 0.011 y denotes the direct prior N (0; 1), and z denotes the hyper prior N (0;IG (1; 1)) Table 5: Comparison of the root mean square error. integrals step of Bayesian analysis has been addressed by the state- of-the-art variational inference method. Compared to some advanced sampling-based methods, variational inference method o ers a higher scalability by tackling the model learning problem in an objective- equivalent-transformed and gradients-e ective-computed optimization form. (4) The examples provided have demonstrated the applicability of the pro- posed modeling scheme to both classi cation and regression tasks in- volving complex systems. Especially in the membrane example, BDL is capable of providing an accurate description of the highly nonlinear mapping between di erent design variables and various structural per- formance indicators and produces virtually identical uncertainty quan- ti cation results as conventional Monte Carlo method. Furthermore in the wind eld prediction example, BDL model is trained and tested us- ing very limited wind tunnel data, and it is shown that the probabilistic model is not only able to e ectively recover the entire mapping of the mean and root-mean-square pressure elds with high precision using as small as 1% of the data but also quanti es the uncertainty level at every single point in the prediction domain, serving as a reliable surrogate for learning complex eld distribution. To improve the model performance, it is envisaged that the combination of the BDL model with information from underlying physics can not only further accelerate the training of neural networks but also holds the promise 29 (a.1) p(ω1, ω3) (a.2) p(ω1, ω3) (a.3) p(ω1, ω3) (a.4) p(ω1, ω3) (b.1) p(ω , ω ) (b.2) p(ω , ω ) (b.3) p(ω , ω ) (b.4) p(ω , ω ) 3 17 3 17 3 17 3 17 Figure 11: Evolution of the variational posterior distribution. Three model parameters ! , ! , and ! are randomly selected. The rst row corresponds to the joint distribution 1 3 17 of p (! ; ! ), and the second row plots the joint distribution of p (! ; ! ). 1 3 3 17 of interpreting the learning process. 6. Acknowledgments This work was supported by the National Science Foundation (NSF) un- der Grant No. 1520817 and No. 1612843. This support is gratefully ac- knowledged. 7. Reference References [1] S. L. Brunton, J. L. Proctor, J. N. Kutz, Discovering governing equa- tions from data by sparse identi cation of nonlinear dynamical systems, Proceedings of the National Academy of Sciences (2016) 201517384. [2] M. Raissi, G. E. Karniadakis, Hidden physics models: Machine learn- ing of nonlinear partial di erential equations, Journal of Computational Physics 357 (2018) 125{141. [3] C. Soize, Identi cation of high-dimension polynomial chaos expansions with random coecients for non-gaussian tensor-valued random elds 30 using partial and limited experimental data, Computer methods in ap- plied mechanics and engineering 199 (33-36) (2010) 2150{2164. [4] Y. Yang, S. Nagarajaiah, Output-only modal identi cation with limited sensors using sparse component analysis, Journal of Sound and Vibra- tion 332 (19) (2013) 4741{4765. [5] J. Javh, J. Slavi c, M. Bolte zar, High frequency modal identi cation on noisy high-speed camera data, Mechanical Systems and Signal Process- ing 98 (2018) 344{351. [6] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in: Advances in neural information processing systems, 2013, pp. 1196{1204. [7] Z. Ghahramani, Probabilistic machine learning and arti cial intelli- gence, Nature 521 (7553) (2015) 452. [8] A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learning for computer vision?, in: Advances in neural information pro- cessing systems, 2017, pp. 5574{5584. [9] C. Robert, Machine learning, a probabilistic perspective (2014). [10] C. E. Rasmussen, Gaussian processes in machine learning, in: Advanced lectures on machine learning, Springer, 2004, pp. 63{71. [11] R. G. Ghanem, P. D. Spanos, Stochastic nite element method: Re- sponse statistics, in: Stochastic Finite Elements: A Spectral Approach, Springer, 1991, pp. 101{119. [12] D. Xiu, G. E. Karniadakis, Modeling uncertainty in ow simulations via generalized polynomial chaos, Journal of computational physics 187 (1) (2003) 137{167. [13] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, R. Adams, Scalable bayesian optimization using deep neural networks, in: International Conference on Machine Learning, 2015, pp. 2171{2180. [14] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learning, Vol. 1, MIT press Cambridge, 2016. 31 [15] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2015) 436. [16] D. J. MacKay, A practical bayesian framework for backpropagation net- works, Neural computation 4 (3) (1992) 448{472. [17] D. J. MacKay, Probable networks and plausible predictionsa review of practical bayesian methods for supervised neural networks, Network: Computation in Neural Systems 6 (3) (1995) 469{505. [18] R. M. Neal, Bayesian learning for neural networks, Vol. 118, Springer Science & Business Media, 2012. [19] H. K. Lee, Bayesian nonparametrics via neural networks, Vol. 13, SIAM, [20] E. T. Nalisnick, On priors for bayesian neural networks, Ph.D. thesis, UC Irvine (2018). [21] H. Je reys, An invariant form for the prior probability in estimation problems, Proc. R. Soc. Lond. A 186 (1007) (1946) 453{461. [22] Y. Gal, Uncertainty in deep learning, University of Cambridge. [23] Y. Li, Approximate inference: New visions, Ph.D. thesis, University of Cambridge (2018). [24] J. Zhu, J. Chen, W. Hu, B. Zhang, Big learning with bayesian methods, National Science Review 4 (4) (2017) 627{651. [25] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncer- tainty in neural networks, arXiv preprint arXiv:1505.05424. [26] D. M. Blei, A. Kucukelbir, J. D. McAuli e, Variational inference: A review for statisticians, Journal of the American Statistical Association 112 (518) (2017) 859{877. [27] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, L. K. Saul, An intro- duction to variational methods for graphical models, Machine learning 37 (2) (1999) 183{233. 32 [28] M. D. Ho man, D. M. Blei, C. Wang, J. Paisley, Stochastic variational inference, The Journal of Machine Learning Research 14 (1) (2013) 1303{1347. [29] R. Ranganath, S. Gerrish, D. Blei, Black box variational inference, in: Arti cial Intelligence and Statistics, 2014, pp. 814{822. [30] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114. [31] D. J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, arXiv preprint arXiv:1401.4082. [32] N. M. Nasrabadi, Pattern recognition and machine learning, Journal of electronic imaging 16 (4) (2007) 049901. [33] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensor ow: a system for large- scale machine learning., in: OSDI, Vol. 16, 2016, pp. 265{283. [34] H. Robbins, S. Monro, A stochastic approximation method, in: Herbert Robbins Selected Papers, Springer, 1985, pp. 102{109. [35] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980. [36] J. T. Springenberg, A. Klein, S. Falkner, F. Hutter, Bayesian optimiza- tion with robust bayesian neural networks, in: Advances in Neural In- formation Processing Systems, 2016, pp. 4134{4142. [37] J. Platt, Sequential minimal optimization: A fast algorithm for training support vector machines. [38] X. Luo, A. Kareem, A convnet-based surrogate for uncertainty quanti - cation of structural systems considering the spatial variability of mate- rial properties. [39] N.-H. Kim, Introduction to nonlinear nite element analysis, Springer Science & Business Media, 2014. 33 [40] R. J. Kuether, M. S. Allen, A numerical approach to directly compute nonlinear normal modes of geometrically nonlinear nite element mod- els, Mechanical Systems and Signal Processing 46 (1) (2014) 1{15. [41] Tpu aerodynamic database: Wind pressure database based on wind tunnel experiment for high-rise building (http://wind.arch.t- kougei.ac.jp/system/eng/contents/code/tpu).

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Jul 8, 2019

References