Access the full text.
Sign up today, get DeepDyve free for 14 days.
Levent Sagun, L. Bottou, Yann LeCun (2016)
Singularity of the Hessian in Deep LearningArXiv, abs/1611.07476
D. Rumelhart, Geoffrey Hinton, Ronald Williams (1986)
Learning representations by back-propagating errorsNature, 323
P. Caines, Minyi Huang, R. Malhamé (2006)
Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence principleCommun. Inf. Syst., 6
Christian Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, Scott Reed, Dragomir Anguelov, D. Erhan, Vincent Vanhoucke, Andrew Rabinovich (2014)
Going deeper with convolutions2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
J. Carrillo, R. McCann, C. Villani (2006)
Contractions in the 2-Wasserstein Length Space and Thermalization of Granular MediaArchive for Rational Mechanics and Analysis, 179
W. Fleming, R. Rishel (1975)
Deterministic and Stochastic Optimal Control
D. Bertsekas, A. Nedić, A. Ozdaglar (2003)
Convex Analysis and Optimization
Adam Oberman (2006)
Convergent Difference Schemes for Degenerate Elliptic and Parabolic Equations: Hamilton-Jacobi Equations and Free Boundary ProblemsSIAM J. Numer. Anal., 44
Yann LeCun, L. Bottou, Yoshua Bengio, P. Haffner (1998)
Gradient-based learning applied to document recognitionProc. IEEE, 86
P. Chaudhari, A. Choromańska, Stefano Soatto, Yann LeCun, Carlo Baldassi, C. Borgs, J. Chayes, Levent Sagun, R. Zecchina (2016)
Entropy-SGD: biasing gradient descent into wide valleysJournal of Statistical Mechanics: Theory and Experiment, 2019
F. Santambrogio (2015)
Optimal Transport for Applied Mathematicians
A. Achille, Stefano Soatto (2017)
Information Dropout: learning optimal representations through noiseArXiv, abs/1611.01353
P. Cannarsa, C. Sinestrari (2004)
Semiconcave Functions, Hamilton-Jacobi Equations, and Optimal Control
Qianxiao Li, Cheng Tai, Weinan E (2015)
Dynamics of Stochastic Gradient AlgorithmsArXiv, abs/1511.06251
Xiaojun Chen (2012)
Smoothing methods for nonsmooth, nonconvex minimizationMathematical Programming, 134
S. Geman, C. Hwang (1986)
Diffusions for global optimizationsSiam Journal on Control and Optimization, 24
P. Chaudhari, Stefano Soatto (2017)
Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks2018 Information Theory and Applications Workshop (ITA)
T. Chiang, C. Hwang (1987)
Diffusion for global optimization in R nSiam Journal on Control and Optimization, 25
(1923)
Uber eine Klasse von Mittelbildungen mit Anwendungen auf die Determinantentheorie
Carlo Baldassi, Alessandro Ingrosso, C. Lucibello, Luca Saglietti, R. Zecchina (2015)
Local entropy as a measure for sampling solutions in constraint satisfaction problemsJournal of Statistical Mechanics: Theory and Experiment, 2016
Aaron Defazio, F. Bach, Simon Lacoste-Julien (2014)
SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives
G. Pavliotis, A. Stuart (2008)
Multiscale Methods: Averaging and Homogenization
Hao Li, Zheng Xu, Gavin Taylor, T. Goldstein (2017)
Visualizing the Loss Landscape of Neural Nets
Sixin Zhang, A. Choromańska, Yann LeCun (2014)
Deep learning with Elastic Averaging SGD
Yann LeCun, Yoshua Bengio, Geoffrey Hinton (2015)
Deep LearningNature, 521
V. Sidoravicius (2007)
Stochastic Processes and Applications
P. Chaudhari, Stefano Soatto (2015)
On the energy landscape of deep networksarXiv: Learning
A. Nemirovski, A. Juditsky, Guanghui Lan, A. Shapiro (2008)
Robust Stochastic Approximation Approach to Stochastic ProgrammingSIAM J. Optim., 19
A. Bray, D. Dean (2006)
Statistics of critical points of Gaussian fields on large-dimensional spaces.Physical review letters, 98 15
R. Rockafellar (1976)
Monotone Operators and the Proximal Point AlgorithmSiam Journal on Control and Optimization, 14
Saeed Ghadimi, Guanghui Lan (2013)
Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic ProgrammingSIAM J. Optim., 23
L. Bottou, Frank Curtis, J. Nocedal (2016)
Optimization Methods for Large-Scale Machine LearningSIAM Rev., 60
Mark Schmidt, Nicolas Roux, F. Bach (2013)
Minimizing finite sums with the stochastic average gradientMathematical Programming, 162
Sergey Ioffe, Christian Szegedy (2015)
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate ShiftArXiv, abs/1502.03167
P. Baldi, K. Hornik (1989)
Neural networks and principal component analysis: Learning from examples without local minimaNeural Networks, 2
A. Marshall, I. Olkin (1979)
Inequalities: Theory of Majorization and Its Application
Y. Fyodorov, Ian Williams (2007)
Replica Symmetry Breaking Condition Exposed by Random Matrix Calculation of Landscape ComplexityJournal of Statistical Physics, 129
B. Haeffele, R. Vidal (2015)
Global Optimality in Tensor Factorization, Deep Learning, and BeyondArXiv, abs/1506.07540
H. Robbins (1951)
A Stochastic Approximation MethodAnnals of Mathematical Statistics, 22
Kaiming He, X. Zhang, Shaoqing Ren, Jian Sun (2016)
Identity Mappings in Deep Residual Networks
M. Welling, Y. Teh (2011)
Bayesian Learning via Stochastic Gradient Langevin Dynamics
R. Jordan, D. Kinderlehrer, F. Otto (1996)
THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATIONSiam Journal on Applied Mathematics
Y. Nesterov (1983)
A method for solving the convex programming problem with convergence rate O(1/k^2)Proceedings of the USSR Academy of Sciences, 269
J. Springenberg, A. Dosovitskiy, T. Brox, Martin Riedmiller (2014)
Striving for Simplicity: The All Convolutional NetCoRR, abs/1412.6806
H. Kushner (1987)
Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: Global minimization via Monte CarloSiam Journal on Applied Mathematics, 47
Andrew Saxe, James McClelland, S. Ganguli (2013)
Exact solutions to the nonlinear dynamics of learning in deep linear neural networksCoRR, abs/1312.6120
N. Kevlahan (2012)
Principles of Multiscale ModelingPhysics Today, 65
Radford Neal (2011)
Handbook of Markov Chain Monte Carlo
P. Chaudhari, Carlo Baldassi, R. Zecchina, Stefano Soatto, Ameet Talwalkar (2017)
Parle: parallelizing stochastic gradient descentArXiv, abs/1707.00424
H. Kramers (1940)
Brownian motion in a field of force and the diffusion model of chemical reactionsPhysica D: Nonlinear Phenomena, 7
Carlo Baldassi, C. Borgs, J. Chayes, Alessandro Ingrosso, C. Lucibello, Luca Saglietti, R. Zecchina (2016)
Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemesProceedings of the National Academy of Sciences, 113
Hoon Kim (2000)
Monte Carlo Statistical MethodsTechnometrics, 42
H. Mobahi (2016)
Training Recurrent Neural Networks by DiffusionArXiv, abs/1601.04114
Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, S. Ganguli, Yoshua Bengio (2014)
Identifying and attacking the saddle point problem in high-dimensional non-convex optimizationArXiv, abs/1406.2572
W. Fleming, H. Soner, H. Soner, Div Mathematics, Florence Fleming, Serpil Soner (1992)
Controlled Markov processes and viscosity solutions
Mark Schmidt, Nicolas Roux, F. Bach (2011)
Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Adam Coates, A. Ng, Honglak Lee (2011)
An Analysis of Single-Layer Networks in Unsupervised Feature Learning
T. Lelièvre, Mathias Rousset, G. Stoltz (2010)
Free Energy Computations: A Mathematical Perspective
M. Mézard, G. Parisi, M. Virasoro (1986)
Spin Glass Theory And Beyond: An Introduction To The Replica Method And Its Applications
Anton Bovier (2016)
Metastability: A Potential-Theoretic Approach
Diederik Kingma, Jimmy Ba (2014)
Adam: A Method for Stochastic OptimizationCoRR, abs/1412.6980
Çaglar Gülçehre, Marcin Moczulski, Misha Denil, Yoshua Bengio (2016)
Noisy Activation Functions
Qianxiao Li, Cheng Tai, E. Weinan (2015)
Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms
S. Sharma, H. Patel (2010)
The Fokker-Planck Equation
Nitish Srivastava, Geoffrey Hinton, A. Krizhevsky, Ilya Sutskever, R. Salakhutdinov (2014)
Dropout: a simple way to prevent neural networks from overfittingJ. Mach. Learn. Res., 15
A. Choromańska, Mikael Henaff, Michaël Mathieu, G. Arous, Yann LeCun (2014)
The Loss Surfaces of Multilayer Networks, 38
Stuart GEMANf
DIFFUSIONS FOR GLOBAL OPTIMIZATION
F. Santambrogio (2015)
Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling
Carlo Baldassi, Alessandro Ingrosso, C. Lucibello, Luca Saglietti, R. Zecchina (2015)
Subdominant Dense Clusters Allow for Simple Learning and High Computational Performance in Neural Networks with Discrete Synapses.Physical review letters, 115 12
A. Achille, Stefano Soatto (2016)
Information Dropout: Learning Optimal Representations Through Noisy ComputationIEEE Transactions on Pattern Analysis and Machine Intelligence, 40
Heinz Bauschke, P. Combettes (2011)
Convex Analysis and Monotone Operator Theory in Hilbert Spaces
Y Nesterov (1983)
A method of solving a convex programming problem with convergence rate o (1/k2)Sov. Math. Dokl., 27
J. Moreau (1965)
Proximité et dualité dans un espace hilbertienBulletin de la Société Mathématique de France, 93
P. Chaudhari, Stefano Soatto (2015)
The Effect of Gradient Noise on the Energy Landscape of Deep NetworksarXiv: Learning
M. James (1994)
Controlled markov processes and viscosity solutionsStochastics and Stochastics Reports, 49
(1985)
Diffusions hypercontractives
Daniel Soudry, Y. Carmon (2016)
No bad local minima: Data independent training error guarantees for multilayer neural networksArXiv, abs/1605.08361
Charlie Harper (2005)
Partial Differential EquationsMultivariable Calculus with Mathematica
M. Mézard, G. Parisi, M. Virasoro, D. Thouless (1987)
Spin Glass Theory and Beyond
G. Pavliotis (2014)
Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations
E. Weinan (2011)
Principles of Multiscale Modeling
J. Lasry, P. Lions (2007)
Mean field gamesJapanese Journal of Mathematics, 2
L. Kadanoff (2000)
The Unreasonable Effectiveness ofPhysics Today, 53
Priya Goyal, Piotr Dollár, Ross Girshick, P. Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He (2017)
Accurate, Large Minibatch SGD: Training ImageNet in 1 HourArXiv, abs/1706.02677
J. Lasry, P. Lions (1986)
A remark on regularization in Hilbert spacesIsrael Journal of Mathematics, 55
A. Krizhevsky, Ilya Sutskever, Geoffrey Hinton (2012)
ImageNet classification with deep convolutional neural networksCommunications of the ACM, 60
A. Krizhevsky (2009)
Learning Multiple Layers of Features from Tiny Images
John Duchi, Elad Hazan, Y. Singer (2011)
Adaptive Subgradient Methods for Online Learning and Stochastic OptimizationJ. Mach. Learn. Res., 12
Diederik Kingma, Tim Salimans, M. Welling (2015)
Variational Dropout and the Local Reparameterization Trick
R Jordan, D Kinderlehrer, F Otto (1998)
The variational formulation of the Fokker–Planck equationSIAM J. Math. Anal., 29
Entropy-SGD is a first-order optimization method which has been used successfully to train deep neural networks. This algorithm, which was motivated by statistical physics, is now interpreted as gradient descent on a modified loss function. The modified, or relaxed, loss function is the solution of a viscous Hamilton–Jacobi partial differential equation (PDE). Experimental results on modern, high-dimensional neural networks demonstrate that the algorithm converges faster than the benchmark stochastic gradient descent (SGD). Well-established PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. Stochastic homogenization theory allows us to better understand the convergence of the algorithm. A stochastic control interpretation is used to prove that a modified algorithm converges faster than SGD in expectation.
Research in the Mathematical Sciences – Springer Journals
Published: Jun 28, 2018
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.