Access the full text.
Sign up today, get DeepDyve free for 14 days.
MW Mahoney (2011)
Randomized algorithms for matrices and data, Foundations and Trends®Mach. Learn., 3
Yi Ren, D. Goldfarb (2019)
Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural NetworksArXiv, abs/1906.02353
D. Goldfarb, Yi Ren, Achraf Bahamou (2020)
Practical Quasi-Newton Methods for Training Deep Neural NetworksArXiv, abs/2006.08877
Ryo Karakida, Kazuki Osawa (2020)
Understanding approximate Fisher information for fast convergence of natural gradient descent in wide neural networksJournal of Statistical Mechanics: Theory and Experiment, 2021
S. Amari (1996)
Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient
Vineet Gupta, Tomer Koren, Y. Singer (2018)
Shampoo: Preconditioned Stochastic Tensor Optimization
Mert Pilanci, M. Wainwright (2015)
Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic ConvergenceSIAM J. Optim., 27
Minghan Yang, Dong Xu, Qiwen Cui, Zaiwen Wen, Pengxiang Xu (2021)
NG+ : A Multi-Step Matrix-Product Natural Gradient Method for Deep LearningArXiv, abs/2106.07454
(2020)
Scalable second order optimization for deep learning, arXiv preprint arXiv:2002.09018
Jonathan Lacotte, Mert Pilanci (2020)
Effective Dimension Adaptive Sketching Methods for Faster Regularized Least-Squares OptimizationArXiv, abs/2006.05874
Diederik Kingma, Jimmy Ba (2014)
Adam: A Method for Stochastic OptimizationCoRR, abs/1412.6980
Yi Ren, D. Goldfarb (2021)
Kronecker-factored Quasi-Newton Methods for Convolutional Neural NetworksArXiv, abs/2102.06737
Xiao Wang, Shiqian Ma, D. Goldfarb, W. Liu (2016)
Stochastic Quasi-Newton Methods for Nonconvex Stochastic OptimizationArXiv, abs/1607.01231
Robert Gower, D. Kovalev, Felix Lieder, Peter Richtárik (2019)
RSN: Randomized Subspace Newton
R. Byrd, Samantha Hansen, J. Nocedal, Y. Singer (2014)
A Stochastic Quasi-Newton Method for Large-Scale OptimizationSIAM J. Optim., 26
(2019)
AGram-Gauss-Newtonmethod learning overparameterized deep neural networks for regression problems, ArXiv
Aleksandar Botev, H. Ritter, D. Barber (2017)
Practical Gauss-Newton Optimisation for Deep Learning
James Martens, R. Grosse (2015)
Optimizing Neural Networks with Kronecker-factored Approximate CurvatureArXiv, abs/1503.05671
Minghan Yang, A. Milzarek, Zaiwen Wen, Tong Zhang (2019)
A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimizationMathematical Programming, 194
Michael Mahoney (2011)
Randomized Algorithms for Matrices and DataFound. Trends Mach. Learn., 3
Yi Ren, D. Goldfarb (2021)
Tensor Normal Training for Deep Learning Models
Jonathan Lacotte, Yifei Wang, Mert Pilanci (2021)
Adaptive Newton Sketch: Linear-time Optimization with Quadratic Convergence and Effective Hessian DimensionalityArXiv, abs/2105.07291
A. Bernacchia, M. Lengyel, Guillaume Hennequin (2018)
Exact natural gradient in deep linear networks and its application to the nonlinear case
Christopher Shallue, Jaehoon Lee, J. Antognini, Jascha Sohl-Dickstein, Roy Frostig, George Dahl (2018)
Measuring the Effects of Data Parallelism on Neural Network TrainingArXiv, abs/1811.03600
Shusen Wang, Luo Luo, Zhihua Zhang (2014)
SPSD Matrix Approximation vis Column Selection: Theories, Algorithms, and ExtensionsJ. Mach. Learn. Res., 17
M. Boyd (2010)
Randomized Algorithms for Matrices and Data, 3
Guodong Zhang, James Martens, R. Grosse (2019)
Fast Convergence of Natural Gradient Descent for Overparameterized Neural NetworksArXiv, abs/1905.10961
Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Chuan-Sheng Foo, Rio Yokota (2020)
Scalable and Practical Natural Gradient for Large-Scale Deep LearningIEEE Transactions on Pattern Analysis and Machine Intelligence, 44
Elad Hazan, A. Agarwal, Satyen Kale (2006)
Logarithmic regret algorithms for online convex optimizationMachine Learning, 69
Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, Cyril Zhang (2020)
Disentangling Adaptive Gradient Methods from Learning RatesArXiv, abs/2002.11803
Mert Pilanci, M. Wainwright (2014)
Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-SquaresJ. Mach. Learn. Res., 17
Minghan Yang, Dong Xu, Hongyu Chen, Zaiwen Wen, Mengyun Chen (2021)
Enhance Curvature Information by Structured Stochastic Quasi-Newton Methods2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
James Martens (2014)
New Insights and Perspectives on the Natural Gradient MethodJ. Mach. Learn. Res., 21
Shusen Wang, Alex Gittens, Michael Mahoney (2017)
Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging
H. Robbins (1951)
A Stochastic Approximation MethodAnnals of Mathematical Statistics, 22
Priya Goyal, Piotr Dollár, Ross Girshick, P. Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He (2017)
Accurate, Large Minibatch SGD: Training ImageNet in 1 HourArXiv, abs/1706.02677
N. Keskar, Dheevatsa Mudigere, J. Nocedal, M. Smelyanskiy, P. Tang (2016)
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaArXiv, abs/1609.04836
Ruoyu Sun (2019)
Optimization for deep learning: theory and algorithmsArXiv, abs/1912.08957
R. Grosse, James Martens (2016)
A Kronecker-factored approximate Fisher matrix for convolution layersArXiv, abs/1602.01407
Nicolas Roux, Pierre-Antoine Manzagol, Yoshua Bengio (2007)
Topmoumoute Online Natural Gradient Algorithm
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
In this paper, we develop an efficient sketch-based empirical natural gradient method (SENG) for large-scale deep learning problems. The empirical Fisher information matrix is usually low-rank since the sampling is only practical on a small amount of data at each iteration. Although the corresponding natural gradient direction lies in a small subspace, both the computational cost and memory requirement are still not tractable due to the high dimensionality. We design randomized techniques for different neural network structures to resolve these challenges. For layers with a reasonable dimension, sketching can be performed on a regularized least squares subproblem. Otherwise, since the gradient is a vectorization of the product between two matrices, we apply sketching on the low-rank approximations of these matrices to compute the most expensive parts. A distributed version of SENG is also developed for extremely large-scale applications. Global convergence to stationary points is established under mild assumptions and a fast linear convergence is analyzed under the neural tangent kernel (NTK) case. Extensive experiments on convolutional neural networks show the competitiveness of SENG compared with the state-of-the-art methods. On the task ResNet50 with ImageNet-1k, SENG achieves 75.9% Top-1 testing accuracy within 41 epochs. Experiments on the distributed large-batch training Resnet50 with ImageNet-1k show that the scaling efficiency is quite reasonable.
Journal of Scientific Computing – Springer Journals
Published: Sep 1, 2022
Keywords: Deep learning; Natural gradient methods; Sketch-based methods; Convergence; 90C06; 90C26
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.