Efficient compilation of CUDA kernels for high-performance computing on FPGAs

Alexandros Papakonstantinou; Karthik Gururaj; John A. Stratton; Deming Chen; Jason Cong; Wen-Mei W. Hwu

doi:10.1145/2514641.2514652

Loading next page...

References (37)

Allen (2004)
Optimizing Compilers for Modern Architectures
Junguk Cho, Shahnam Mirzaei, J. Oberg, R. Kastner (2009)
Fpga-based face detection system using Haar classifiers
John Stratton, S. Stone, Wen-mei Hwu (2008)
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
GeForce 8 series
R. Allen, K. Kennedy (2001)
Optimizing Compilers for Modern Architectures: A Dependence-based Approach
(2012)
Accelerated processing units. http://www.amd.com/us/products/technologies/fusion/Pages/fusion
Amir Hormati, M. Kudlur, S. Mahlke, D. Bacon, R. Rabbah (2008)
Optimus: efficient realization of streaming applications on FPGAs
P. Diniz, Mary Hall, Joonseok Park, Byoungro So, H. Ziegler (2005)
Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system
Microprocess. Microsystems, 29
S. Gupta, Rajesh Gupta, N. Dutt, A. Nicolau (2004)
Coordinated parallelizing compiler optimizations and high-level synthesis
ACM Trans. Design Autom. Electr. Syst., 9
M. Showerman, J. Enos, A. Pant, V. Kindratenko, C. Steffen, R. Pennington, Wen-mei Hwu (2011)
QP: A Heterogeneous Multi-Accelerator Cluster
Jason Williams, A. George, J. Richardson, Kunal Gosrani, S. Suresh (2008)
Computational Density of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration
Virtex-5 FXT ML510 embedded development platform. http://www.xilinx.com/products/boards- and-kits
(2011)
OpenCL specification, version 1.1
J. Cong, B. Liu, S. Neuendorffer, Juanjo Noguera, K. Vissers, Zhiru Zhang (2011)
High-Level Synthesis for FPGAs: From Prototyping to Deployment
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30
Tilera corporation
Liu Ling, Neal Oliver, Bhushan Chitlur, Qigang Wang, A. Chen, Wenbo Shen, Zhihong Yu, Arthur Sheiman, I. McCallum, Joseph Grecco, H. Mitchel, Dong Liu, Prabhat Gupta (2009)
High-performance, energy-efficient platforms using in-socket FPGA accelerators
Deming Chen, J. Cong, Yiping Fan, Guoling Han, Wei Jiang, Zhiru Zhang (2005)
xPilot: A Platform-Based Behavioral Synthesis System
(2006)
The cell architecture
(2012)
Parboil benchmarks
David Thomas, Lee Howes, W. Luk (2009)
A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation
A. Aho, M. Lam, R. Sethi, J. Ullman (2006)
Compilers: Principles, Techniques, and Tools (2nd Edition)
Sang Lee, Troy Johnson, R. Eigenmann (2003)
Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation
Mingjie Lin, Ilia Lebedev, J. Wawrzynek (2010)
OpenRCL: Low-Power High-Performance Computing with Reconfigurable Devices
2010 International Conference on Field Programmable Logic and Applications
Chunhui He, Alexandros Papakonstantinou, Deming Chen (2009)
A novel SoC architecture on FPGA for ultra fast face detection
2009 IEEE International Conference on Computer Design
Shuai Che, Jie Li, J. Sheaffer, K. Skadron, J. Lach (2008)
Accelerating Compute-Intensive Applications with GPUs and FPGAs
2008 Symposium on Application Specific Processors
J. Cong, Yi Zou (2008)
Lithographic aerial image simulation with FPGA-based hardwareacceleration
S. Huang, Amir Hormati, D. Bacon, R. Rabbah (2008)
Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary
(2007)
The LLVM compiler infrastructure
Muhsen Owaida, Nikolaos Bellas, Konstantis Daloukas, C. Antonopoulos (2011)
Synthesis of Platform Architectures from OpenCL Programs
2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines
CUDA developer zone. http://developer.nvidia.com/category/zone/cuda-zone
Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, J. Cong (2008)
AutoPilot: A Platform-Based ESL Synthesis System
(2003)
Impulse accelerated technologies inc
D. Gajski (2003)
NISC: The Ultimate Reconfigurable Component
(2012)
Catapult C synthesis overview
Michael Parker (2012)
DesignCon 2011 Hardware-Based Floating-Point Design Flow
(2010)
The AutoESL AutoPilot High-Level Synthesis Tool
DATA v5. http://www.nallatech.com/Modules/data-v5-xilinx-virtex-5-fpga-ddr2-sdramqdr- ii-sram-and-io-module.html

Publisher: Association for Computing Machinery
Copyright: Copyright © 2013 by ACM Inc.
ISSN: 1539-9087
DOI: 10.1145/2514641.2514652
Publisher site: See Article on Publisher Site

Abstract

Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs ALEXANDROS PAPAKONSTANTINOU, University of Illinois at Urbana-Champaign KARTHIK GURURAJ, University of California, Los Angeles JOHN A. STRATTON and DEMING CHEN, University of Illinois at Urbana-Champaign JASON CONG, University of California, Los Angeles WEN-MEI W. HWU, University of Illinois at Urbana-Champaign The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which

Journal

ACM Transactions on Embedded Computing Systems (TECS) – Association for Computing Machinery

Published: Sep 1, 2013

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Efficient compilation of CUDA kernels for high-performance computing on FPGAs

Efficient compilation of CUDA kernels for high-performance computing on FPGAs

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Efficient compilation of CUDA kernels for high-performance computing on FPGAs

Efficient compilation of CUDA kernels for high-performance computing on FPGAs

References (37)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies