Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Power-efficient prefetching for embedded processors

Power-efficient prefetching for embedded processors Because of stringent power constraints, aggressive latency-hiding approaches, such as prefetching, are absent in the state-of-the-art embedded processors. There are two main reasons that make prefetching power inefficient. First, compiler-inserted prefetch instructions increase code size and, therefore, could increase I-cache power. Second, inaccurate prefetching (especially for hardware prefetching) leads to high D-cache power consumption because of useless accesses. In this work, we show that it is possible to support power-efficient prefetching through bit-differential offset assignment. We target the prefetching of relocatable stack variables with a high degree of precision. By assigning the offsets of stack variables in such a way that most consecutive addresses differ by 1 bit, we can prefetch them with compact prefetch instructions to save I-cache power. The compiler first generates an access graph of consecutive memory references and then attempts a layout of the memory locations in the smallest hypercube. Each dimension of the hypercube represents a 1-bit differential addressing. The embedding is carried out in as compact a hypercube as possible in order to save memory space. Each load/store instruction carries a hint regarding prefetching the next memory reference by encoding its differential address with respect to the current one. To reduce D-cache power cost, we further attempt to assign offsets so that most of the consecutive accesses map to the same cache line. Our prefetching is done using a one entry line buffer Wilson et al. 1996. Consequently, many look-ups in D-cache reduce to incremental ones. This results in D-cache activity reduction and power savings. Our prefetcher requires both compiler and hardware support. In this paper, we provide implementation on the processor model close to ARM with small modification to the ISA. We tackle issues such as out-of-order commit, predication, and speculation through simple modifications to the processor pipeline on noncritical paths. Our goal in this work is to boost performance while maintaining/lowering power consumption. Our results show 12% speedup and slight power reduction. The runtime virtual space loss for stack and static data is about 11.8%. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Embedded Computing Systems (TECS) Association for Computing Machinery

Power-efficient prefetching for embedded processors

Loading next page...
 
/lp/association-for-computing-machinery/power-efficient-prefetching-for-embedded-processors-4dW4ryHpPX

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Association for Computing Machinery
Copyright
Copyright © 2007 by ACM Inc.
ISSN
1539-9087
DOI
10.1145/1210268.1210271
Publisher site
See Article on Publisher Site

Abstract

Because of stringent power constraints, aggressive latency-hiding approaches, such as prefetching, are absent in the state-of-the-art embedded processors. There are two main reasons that make prefetching power inefficient. First, compiler-inserted prefetch instructions increase code size and, therefore, could increase I-cache power. Second, inaccurate prefetching (especially for hardware prefetching) leads to high D-cache power consumption because of useless accesses. In this work, we show that it is possible to support power-efficient prefetching through bit-differential offset assignment. We target the prefetching of relocatable stack variables with a high degree of precision. By assigning the offsets of stack variables in such a way that most consecutive addresses differ by 1 bit, we can prefetch them with compact prefetch instructions to save I-cache power. The compiler first generates an access graph of consecutive memory references and then attempts a layout of the memory locations in the smallest hypercube. Each dimension of the hypercube represents a 1-bit differential addressing. The embedding is carried out in as compact a hypercube as possible in order to save memory space. Each load/store instruction carries a hint regarding prefetching the next memory reference by encoding its differential address with respect to the current one. To reduce D-cache power cost, we further attempt to assign offsets so that most of the consecutive accesses map to the same cache line. Our prefetching is done using a one entry line buffer Wilson et al. 1996. Consequently, many look-ups in D-cache reduce to incremental ones. This results in D-cache activity reduction and power savings. Our prefetcher requires both compiler and hardware support. In this paper, we provide implementation on the processor model close to ARM with small modification to the ISA. We tackle issues such as out-of-order commit, predication, and speculation through simple modifications to the processor pipeline on noncritical paths. Our goal in this work is to boost performance while maintaining/lowering power consumption. Our results show 12% speedup and slight power reduction. The runtime virtual space loss for stack and static data is about 11.8%.

Journal

ACM Transactions on Embedded Computing Systems (TECS)Association for Computing Machinery

Published: Feb 1, 2007

There are no references for this article.