Specializing FGPU for Persistent Deep Learning

Rui Ma; Jia-Ching Hsu; Tian Tan; Eriko Nurvitadhi; David Sheffield; Rob Pelt; Martin Langhammer; Jaewoong Sim; Aravind Dasu; Derek Chiou

doi:10.1145/3457886

Loading next page...

References (29)

E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. El Husseini, T. Juhasz, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Y. Xiao, D. Zhang, R. Zhao, D. Burger (2018)
Serving DNNs in real time at datacenter scale with project brainwave
IEEE Micro, 38
Muhammed Al Kadi, Benedikt Janssen, Jones Yudi, Michael Huebner (2018)
General-Purpose computing with soft GPUs on FPGAs
ACM Transactions on Reconfigurable Technology and Systems, 11
cuDNN Developer Guide
Retrieved on Jun 20, 2019 from https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html., 20
Vladimir Rybalkin, Alessandro Pappalardo, Muhammad Mohsin Ghaffar, Giulio Gambardella, Norbert Wehn, Michaela Blott (2018)
FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs
CoRR abs/1807.04093 (2018). arxiv:1807.04093. Retrieved on Jun 20, 2019 from http://arxiv.org/abs/1807.04093., 20
A. Severance, G. G. F. Lemieux (2013)
Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor
2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’13), 2013
Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, Bo Wu (2019)
GRNN: Low-latency and scalable RNN inference on GPUs
Proceedings of the 14th EuroSys Conference 2019
MIPS Technologies (2001)
MIPS32® Architecture For Programmers Volume II: The MIPS32® Instruction Set
MIPS32® Architecture For Programmers Volume II: The MIPS32® Instruction Set.
R. Dey, F. M. Salem (2017)
Gate-variants of gated recurrent unit (GRU) neural networks
2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS’17), 2017
Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, Sanjeev Satheesh (2016)
Persistent RNNs: Stashing recurrent weights on-chip
International Conference on Machine Learning
Intel Corporation (2020)
Open Programmable Acceleration Engine
Retrieved on Jun 20, 2019 from https://01.org/opae., 20
Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al (2017)
ESE: Efficient speech recognition engine with sparse LSTM on FPGA
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017
Zhiqiang Que, Yongxin Zhu, Hongxiang Fan, Jiuxi Meng, Xinyu Niu, Wayne Luk (2020)
Mapping large LSTMs to FPGAs with weight reuse
Journal of Signal Processing Systems 92 (2020), 92
Joao Canas Ferreira, Jose Fonseca (2016)
An FPGA implementation of a long short-term memory neural network
2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig’16). IEEE, 2016
R. Balasubramanian, V. Gangadhar, Z. Guo, C. Ho, C. Joseph, J. Menon, M. P. Drumond, R. Paul, S. Prasad, P. Valathol, K. Sankaralingam (2015)
MIAOW—An open source RTL implementation of a GPGPU
2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII), 2015
Daniele Bagni, A. Di Fresco, J. Noguera, F. M. Vallina (2016)
A Zynq accelerator for floating point matrix multiplication designed with vivado HLS
Application Note (2016)
Vladimir Rybalkin, Norbert Wehn, Mohammad Reza Yousefi, Didier Stricker (2017)
Hardware architecture of bidirectional long short-term memory neural network for optical character recognition
Design
E. Nurvitadhi, D. Kwon, A. Jafari, A. Boutros, J. Sim, P. Tomson, H. Sumbul, G. Chen, P. Knag, R. Kumar, R. Krishnamurthy, S. Gribok, B. Pasca, M. Langhammer, D. Marr, A. Dasu (2019)
Why compete when you can work together: FPGA-ASIC integration for persistent RNNs
2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19), 2019
PDL-FGPU Kernel Sources
Retrieved on Jun 20, 2019 from https://github
Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, Yuxiong He (2018)
DeepCPU: Serving RNN-based deep learning models 10x faster
2018 USENIX Annual Technical Conference (USENIX ATC’18), 2018
Feiwen Zhu, Jeff Pool, Michael Andersch, Jeremy Appleyard, Fung Xie (2018)
Sparse persistent RNNs: Squeezing large recurrent networks on-chip
International Conference on Learning Representations. https://openreview.net/forum?id=HkxF5RgC-
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio (2014)
Learning phrase representations using RNN encoder-decoder for statistical machine translation
arXiv:1406.1078.
Peter Yiannacouras, J. Gregory Steffan, Jonathan Rose (2008)
VESPA: Portable, scalable, and flexible FPGA-based vector processors
Proceedings of the 2008 International Conference on Compilers, 2008
Rui Ma, Jia-Ching Hsu, Tian Tan, Eriko Nurvitadhi, David Sheffield, Rob Pelt, Martin Langhammer, Jaewoong Sim, Aravind Dasu, Derek Chiou (2019)
Specializing FGPU for persistent deep learning
2019 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 2019
Baidu (2020)
DeepBench
Retrieved on Jun 20, 2019 from https://github.com/baidu-research/DeepBench., 20
J. Kingyens, J. Gregory Steffan (2010)
A GPU-inspired soft processor for high-throughput acceleration
2010 IEEE International Symposium on Parallel Distributed Processing, 2010
Zhiqiang Que, Hiroki Nakahara, Eriko Nurvitadhi, Hongxiang Fan, Chenglong Zeng, Jiuxi Meng, Xinyu Niu, Wayne Luk (2020)
Optimizing reconfigurable recurrent neural networks
2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 2020
Intel Corporation (2018)
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Intel® 64 and IA-32 Architectures Software Developer’s Manual.
Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, Jason Cong (2017)
FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates
2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 2017
Yijin Guan, Zhihang Yuan, Guangyu Sun, Jason Cong (2017)
FPGA-based accelerator for long short-term memory recurrent neural networks
2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC’17). IEEE, 2017

Publisher: Association for Computing Machinery
Copyright: Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISSN: 1936-7406
eISSN: 1936-7414
DOI: 10.1145/3457886
Publisher site: See Article on Publisher Site

Abstract

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our persistent deep learning (PDL)-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4–3× more ALMs, 4.4–6.4× more M20ks, and 1–9.5× more DSPs than baseline, but improves performance by 56–693× for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance/power and demonstrate that PDL-FGPU is only 4.0–10.4× slower than the Nvidia V100.

Journal

ACM Transactions on Reconfigurable Technology and Systems (TRETS) – Association for Computing Machinery

Published: Jul 15, 2021

Keywords: Overlay

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Specializing FGPU for Persistent Deep Learning

Specializing FGPU for Persistent Deep Learning

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Specializing FGPU for Persistent Deep Learning

Specializing FGPU for Persistent Deep Learning

References (29)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies