Access the full text.
Sign up today, get DeepDyve free for 14 days.
Tony Nowatzki, Michael Sartin-Tarm, Lorenzo Carli, K. Sankaralingam, Cristian Estan, Behnam Robatmili (2013)
A general constraint-centric scheduling framework for spatial architecturesProceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
Uday Bondhugula, Albert Hartono, J. Ramanujam, P. Sadayappan (2008)
A practical automatic polyhedral parallelizer and locality optimizer
Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, Jonghee Yoon, Doosan Cho, Y. Paek (2011)
High Throughput Data Mapping for Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30
Yann LeCun (2019)
1.1 Deep Learning Hardware: Past, Present, and Future2019 IEEE International Solid- State Circuits Conference - (ISSCC)
Bernhard Egger, Hochan Lee, Duseok Kang, M. Moghaddam, Youngchul Cho, Yeonbok Lee, Sukjin Kim, S. Ha, Kiyoung Choi (2017)
A space- and energy-efficient code compression/decompression technique for coarse-grained reconfigurable architectures2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Jongsoo Park, M. Naumov, P. Basu, Summer Deng, Aravind Kalaiah, D. Khudia, James Law, Parth Malani, Andrey Malevich, N. Satish, J. Pino, Martin Schatz, Alexander Sidorov, V. Sivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu, Hector Yuen, Utku Diril, Dmytro Dzhulgakov, K. Hazelwood, Bill Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, S. Yoo, M. Smelyanskiy (2018)
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware ImplicationsArXiv, abs/1811.09886
B. Fleischer, Sunil Shukla, M. Ziegler, J. Silberman, Jinwook Oh, V. Srinivasan, Jungwook Choi, S. Mueller, A. Agrawal, Tina Babinsky, N. Cao, Chia-Yu Chen, P. Chuang, T. Fox, G. Gristede, Michael Guillorn, Howard Haynie, M. Klaiber, Dongsoo Lee, S. Lo, G. Maier, M. Scheuermann, Swagath Venkataramani, Christos Vezyrtzis, Naigang Wang, F. Yee, Ching Zhou, P. Lu, B. Curran, Leland Chang, K. Gopalakrishnan (2018)
A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference2018 IEEE Symposium on VLSI Circuits
Shail Dave, M. Balasubramanian, Aviral Shrivastava (2018)
RAMP: Resource-Aware Mapping for CGRAs2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)
Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Bell, Jeff Setter, Kaidi Cao, Heonjae Ha, C. Kozyrakis, M. Horowitz (2018)
DNN Dataflow Choice Is OverratedArXiv, abs/1809.04070
SCALE-Sim (2018)
https://github([n. d.]). Accessed: November, 5
Tianqi Chen, T. Moreau, Ziheng Jiang, Haichen Shen, Eddie Yan, Leyuan Wang, Yuwei Hu, L. Ceze, Carlos Guestrin, A. Krishnamurthy (2018)
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
A. Samajdar, Yuhao Zhu, P. Whatmough, Matthew Mattina, T. Krishna (2018)
SCALE-Sim: Systolic CNN AcceleratorArXiv, abs/1811.02883
(2018)
Dnn energy model and optimizer
I. Issenin, E. Brockmeyer, M. Corbalan, N. Dutt (2007)
DRDU: A data reuse analysis technique for efficient scratch-pad memory managementACM Trans. Design Autom. Electr. Syst., 12
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy (2018)
TVM: End-to-end optimization stack for deep learningarXiv preprint arXiv:1802.04799 (2018)
Hongbo Rong (2017)
Programmatic Control of a Compiler for Generating High-performance Spatial HardwareArXiv, abs/1711.07606
Petra Kaufmann (2016)
Compilers Principles Techniques And Tools
Chris Lattner, Vikram Adve (2004)
LLVM: A compilation framework for lifelong program analysis 8 transformationProceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. IEEE Computer Society
Arthur Stoutchinin, Francesco Conti, L. Benini (2019)
Optimally Scheduling CNN Convolutions for Efficient Memory AccessArXiv, abs/1902.01492
Kartik Hegde, R. Agrawal, Yulun Yao, Christopher Fletcher (2018)
Morph: Flexible Acceleration for 3D CNN-Based Video Understanding2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
fmincon (2018)
https://wwwAccessed: November, 5
Zidong Du, Robert Fasthuber, Tianshi Chen, P. Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, O. Temam (2015)
ShiDianNao: Shifting vision processing closer to the sensor2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)
Yang You, Zhao Zhang, Cho-Jui Hsieh, J. Demmel, K. Keutzer (2019)
Fast Deep Neural Network Training on Distributed Systems and Cloud TPUsIEEE Transactions on Parallel and Distributed Systems, 30
Kaiming He, X. Zhang, Shaoqing Ren, Jian Sun (2015)
Deep Residual Learning for Image Recognition2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012)
Imagenet classification with deep convolutional neural networksAdvances in Neural Information Processing Systems
Yann LeCun (2019)
12019 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2019
Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, Yiran Chen (2019)
HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)
Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, Xiaowei Li (2017)
FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)
Y. Yu, Yingmin Li, Shuai Che, N. Jha, Weifeng Zhang (2019)
Software-Defined Design Space Exploration for an Efficient DNN Accelerator ArchitectureIEEE Transactions on Computers, 70
Suyog Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan (2015)
Deep Learning with Limited Numerical Precision
S. Carr, K. McKinley, C. Tseng (1994)
Compiler optimizations for improving data locality
S. Yin, Ouyang Peng, Shibin Tang, Fengbin Tu, Xiudong Li, Shixuan Zheng, Tianyi Lu, Jiangyuan Gu, Leibo Liu, Shaojun Wei (2018)
A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning ApplicationsIEEE Journal of Solid-State Circuits, 53
Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, Behnam Robatmili (2013)
A general constraint-centric scheduling framework for spatial architecturesACM SIGPLAN Notices
J. Cong, Jie Wang (2018)
PolySA: Polyhedral-Based Systolic Array Auto-Compilation2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
Yu-hsin Chen, J. Emer, V. Sze (2016)
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
Norman Jouppi, C. Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Taraneh Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Ho, Doug Hogberg, John Hu, R. Hundt, Dan Hurt, Julian Ibarz, A. Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, R. Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Yoon (2017)
In-datacenter performance analysis of a tensor processing unit2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
Saining Xie, Ross Girshick, Piotr Dollár, Z. Tu, Kaiming He (2016)
Aggregated Residual Transformations for Deep Neural Networks2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Chris Lattner, Vikram Adve (2004)
LLVM: a compilation framework for lifelong program analysis & transformationInternational Symposium on Code Generation and Optimization, 2004. CGO 2004.
H. Kung, Bradley McDanel, S. Zhang (2018)
Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint OptimizationProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
Michael Pellauer, A. Parashar, Michael Adler, Bushra Ahsan, R. Allmon, N. Crago, Kermin Fleming, M. Gambhir, A. Jaleel, T. Krishna, Daniel Lustig, S. Maresh, Vladimir Pavlov, Rachid Rayess, Antonia Zhai, J. Emer (2015)
Efficient Control and Communication Paradigms for Coarse-Grained Spatial ArchitecturesACM Transactions on Computer Systems (TOCS), 33
Manupa Karunaratne, Aditi Mohite, T. Mitra, L. Peh (2017)
HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC)
Naveen Suda, V. Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, S. Vrudhula, Jae-sun Seo, Yu Cao (2016)
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural NetworksProceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, J. Cong (2015)
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural NetworksProceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Hyoukjun Kwon, Michael Pellauer, T. Krishna (2018)
MAESTRO: An Open-source Infrastructure for Modeling Dataflows within Deep Learning AcceleratorsArXiv, abs/1805.02566
M. Kistler, M. Perrone, F. Petrini (2006)
Cell Multiprocessor Communication Network: Built for SpeedIEEE Micro, 26
Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, Vivienne Sze (2016)
Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networksIEEE Journal of Solid-State Circuits, 52
F. Balasa, P. Kjeldsberg, Arnout Vandecappelle, M. Palkovic, Qubo Hu, Hongwei Zhu, F. Catthoor (2008)
Storage Estimation and Design Space Exploration Methodologies for the Memory Management of Signal Processing ApplicationsJournal of Signal Processing Systems, 53
Shail Dave, M. Balasubramanian, Aviral Shrivastava (2018)
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, T. Krishna (2019)
mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Citation Chen, Yu-Hsin, T. Krishna, J. Emer, V. Sze
Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use
A. Krizhevsky, Ilya Sutskever, Geoffrey Hinton (2012)
ImageNet classification with deep convolutional neural networksCommunications of the ACM, 60
Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner -- to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16 better in Energy-Delay-Product (EDP) and 5.83 better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.
ACM Transactions on Embedded Computing Systems (TECS) – Association for Computing Machinery
Published: Oct 8, 2019
Keywords: Coarse-grained reconfigurable array
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.