Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

Kelefouras Vasilios; Keramidas Georgios; Voros Nikolaos

doi:10.1145/3202663

Loading next page...

References (65)

Frank Werner, Y. Sotskov (2006)
Linear equations and inequalities
Uday Bondhugula, Albert Hartono, J. Ramanujam, P. Sadayappan (2008)
A practical automatic polyhedral parallelizer and locality optimizer
Xiao Zhang, S. Dwarkadas, Kai Shen (2009)
Towards practical page coloring-based multicore cache management
Hyoseung Kim, Arvind Kandhalu, R. Rajkumar (2013)
A Coordinated Approach for Practical OS-Level Cache Management in Multi-core Real-Time Systems
2013 25th Euromicro Conference on Real-Time Systems
A. Monsifrot, F. Bodin, R. Quiniou (2002)
A Machine Learning Approach to Automatic Production of Compiler Heuristics
F. Agakov, Edwin Bonilla, John Cavazos, Björn Franke, G. Fursin, M. O’Boyle, John Thomson, M. Toussaint, Christopher Williams (2006)
Using machine learning to focus iterative optimization
International Symposium on Code Generation and Optimization (CGO'06)
J. Shawcross, Filippo Falcone (2010)
— — — — — — — — — — — — ACME –
Dimitrios Nikolopoulos (2003)
Code and Data Transformations for Improving Shared Cache Performance on SMT Processors
Jacob Lidman, Daniel J. Quinlan, Chunhua Liao, Sally A. McKee (2012)
ROSE: FTTransform-A source-to-source translation framework for exascale fault-tolerance research
Proceedings of the 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W’12). IEEE, 2012
Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven Reeves, Devika Subramanian, Linda Torczon, Todd Waterman (2005)
ACME: Adaptive compilation made efficient
ACM SIGPLAN Not., 40
Dimitris Kaseridis, Jeffrey Stuecheli, L. John (2009)
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
2009 International Conference on Parallel Processing
Sheng Li, Jung Ahn, Richard Strong, J. Brockman, D. Tullsen, N. Jouppi (2009)
McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures
2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Jun Liu, Yuanrui Zhang, Wei Ding, Mahmut T. Kandemir (2011)
On-chip cache hierarchy-aware tile scheduling for multicore machines
Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE
I-Jui Sung, N. Anssari, John Stratton, Wen-mei Hwu (2010)
Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications
International Journal of Parallel Programming, 40
Vasilios Kelefouras, A. Kritikakou, C. Goutis (2015)
A methodology for speeding up loop kernels by exploiting the software information and the memory architecture
Comput. Lang. Syst. Struct., 41
L. Almagor, K. Cooper, Alexander Grosul, T. Harvey, Steven Reeves, D. Subramanian, L. Torczon, Todd Waterman (2004)
Finding effective compilation sequences
Jacob Lidman, D. Quinlan, C. Liao, S. Mckee (2012)
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
P. Kulkarni, D. Whalley, G. Tyson, J. Davidson (2009)
Practical exhaustive optimization phase order exploration and evaluation
ACM Trans. Archit. Code Optim., 6
Y. Ye, R. West, Zhuoqun Cheng, Ye Li (2014)
COLORIS: A dynamic cache partitioning system using page coloring
2014 23rd International Conference on Parallel Architecture and Compilation (PACT)
U. Bondhugula, A. Hartono, J. Ramanujam, P. Sadayappan (2008a)
A practical automatic polyhedral parallelizer and locality optimizer
ACM SIGPLAN Not., 43
Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, Michelle Mills Strout (2007)
Parameterized tiled loops for free
ACM SIGPLAN Not., 42
K. Cooper, Alexander Grosul, T. Harvey, Steven Reeves, D. Subramanian, L. Torczon, Todd Waterman (2005)
ACME: adaptive compilation made efficient
Bin Bao, C. Ding (2013)
Defensive loop tiling for shared cache
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
E. Gutiérrez, O. Plata, Emilio Zapata (2004)
Data partitioning-based parallel irregular reductions: Research Articles
Concurrency and Computation: Practice and Experience, 16
L. Almagor, K. Cooper, Alexander Grosul, T. Harvey, Steven Reeves, D. Subramanian, L. Torczon, Todd Waterman (2004)
Compilation Order Matters: Exploring the Structure of the Space of Compilation Sequences Using Randomized Search Algorithms†
. Harvey , Steven Reeves , Devika Subramanian , Linda Torczon , and
Miquel Moret, Francisco J. Cazorla, Alex Ramrez, Mateo Valero (2008)
MLP-aware dynamic cache partitioning
In HiPEAC. Lecture Notes in Computer Science, Vol. 4917. Springer, 337--352. Retrieved from http://dblp.uni-trier.de/db/conf/hipeac/hipeac2008.html.
P. Knijnenburg, T. Kisuki, K. Gallivan, M. O’Boyle (2004)
The effect of cache models on iterative compilation for combined tiling and unrolling
Concurrency and Computation: Practice and Experience, 16
(2003)
AND O’REILLY, U.-M
N. Binkert, Bradford Beckmann, Gabriel Black, S. Reinhardt, A. Saidi, Arkaprava Basu, Joel Hestness, Derek Hower, T. Krishna, S. Sardashti, Rathijit Sen, Korey Sewell, Muhammad Altaf, Nilay Vaish, M. Hill, D. Wood (2011)
The gem5 simulator
SIGARCH Comput. Archit. News, 39
Jichuan Chang, G. Sohi (2007)
Cooperative cache partitioning for chip multiprocessors
Eunjung Park, Sameer Kulkarni, John Cavazos (2011)
An evaluation of different modeling techniques for iterative compilation
2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES)
Jun Liu, Yuanrui Zhang, W. Ding, M. Kandemir (2011)
On-chip cache hierarchy-aware tile scheduling for multicore machines
International Symposium on Code Generation and Optimization (CGO 2011)
Mark Stephenson, Saman Amarasinghe, Martin Martin, Una-May O’Reilly (2003)
Meta optimization: Improving compiler heuristics with machine learning
ACM SIGPLAN Not., 38
(2012)
PolyBench/C Benchmark Suite
Shi-Kuo Chang (2003)
Data Structures and Algorithms
Miquel Moretó, F. Cazorla, Alex Ramírez, M. Valero (2007)
MLP-Aware Dynamic Cache Partitioning
16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007)
H. Dybdahl, P. Stenström (2007)
An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors
2007 IEEE 13th International Symposium on High Performance Computer Architecture
Xiaoning Ding, Kaibo Wang, Xiaodong Zhang (2011)
ULCC: a user-level facility for optimizing shared cache performance on multicores
M. M. Baskaran, N. Vydyanathan, U. K. R. Bondhugula, J. Ramanujam, A. Rountev, P. Sadayappan (2009)
Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors
ACM SIGPLAN Not., 44
Lakshminarayanan Renganarayanan, DaeGon Kim, S. Rajopadhye, M. Strout (2007)
Parameterized tiled loops for free
David Tam, R. Azimi, Livio Soares, M. Stumm (2007)
Managing Shared L2 Caches on Multicore Systems in Software
Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steve Reeves, Devika Subramanian, Linda Torczon, Todd Waterman (2006)
Exploring the structure of the space of compilation sequences using randomized search algorithms
J. Supercomput., 36
Q. Lu, Christophe Alias, Uday Bondhugula, Thomas Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Yongjian Chen, Haibo Lin, Tin-fook Ngai (2009)
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors
2009 18th International Conference on Parallel Architectures and Compilation Techniques
M. Kandemir, Taylan Yemliha, Sai Muralidhara, Shekhar Srikantaiah, M. Irwin, Yuanrui Zhang (2010)
Cache topology aware computation mapping for multicores
R. Whaley, A. Petitet, J. Dongarra (2001)
Automated empirical optimizations of software and the ATLAS project
Parallel Comput., 27
Xing Zhou, J. Giacalone, M. Garzarán, R. Kuhn, Yang Ni, D. Padua (2012)
Hierarchical overlapped tiling
M. Haneda, P. Knijnenburg, H. Wijshoff (2005)
Automatic selection of compiler options using non-parametric inferential statistics
14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05)
Jiang Lin, Q. Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, P. Sadayappan (2008)
Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems
2008 IEEE 14th International Symposium on High Performance Computer Architecture
M. Tartara, S. Crespi-Reghizzi (2013)
Continuous learning of compiler heuristics
ACM Trans. Archit. Code Optim., 9
P. Kulkarni, S. Hines, Jason Hiser, D. Whalley, J. Davidson, Douglas Jones (2004)
Fast searches for effective optimization phase sequences
R. Reddy, Peter Petrov (2010)
Cache partitioning for energy-efficient and interference-free embedded multitasking
ACM Trans. Embed. Comput. Syst., 9
P. Kulkarni, D. Whalley, G. Tyson (2007)
Evaluating Heuristic Optimization Phase Order Search Algorithms
International Symposium on Code Generation and Optimization (CGO'07)
DaeGon Kim, Lakshminarayanan Renganarayanan, D. Rostron, S. Rajopadhye, M. Strout (2007)
Multi-level tiling: M for the price of one
Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07)
Uday Bondhugula, J. Ramanujam, P. Sadayappan (2015)
PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System
Karthik Sundararajan, Vasileios Porpodas, Timothy Jones, N. Topham, Björn Franke (2012)
Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs
IEEE International Symposium on High-Performance Comp Architecture
B. Bui, M. Caccamo, L. Sha, Joseph Martinez (2008)
Impact of Cache Partitioning on Multi-tasking Real Time Embedded Systems
2008 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications
L. Almagor, Keith D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves, D. Subramanian, L. Torczon, T. Waterman (2004)
Finding effective compilation sequences
ACM SIGPLAN Not., 39
Mahmut Kandemir, Taylan Yemliha, SaiPrashanth Muralidhara, Shekhar Srikantaiah, Mary Jane Irwin, Yuanrui Zhnag (2010)
Cache topology aware computation mapping for multicores
ACM SIGPLAN Not., 45
Zbigniew Chamski (1994)
Nested loop sequences: towards efficient loop structures in automatic parallelization
1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences, 2
M. Kandemir, Sai Muralidhara, S. Narayanan, Yuanrui Zhang, O. Ozturk (2009)
Optimizing shared cache behavior of chip multiprocessors
2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Prasad Kulkarni, Stephen Hines, Jason Hiser, David Whalley, Jack Davidson, Douglas Jones (2004)
Fast searches for effective optimization phase sequences
ACM SIGPLAN Not., 39
O. Yasar, Y. Deng, Robert Tuzun, D. Saltz (2001)
New trends in high performance computing
Parallel Comput., 27
Chenjie Yu, Peter Petrov (2010)
Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms
Design Automation Conference
M. Stephenson, Saman Amarasinghe, M. Martin, U. O'Reilly (2003)
Meta optimization: improving compiler heuristics with machine learning

Publisher: Association for Computing Machinery
Copyright: Copyright © 2018 ACM
ISSN: 1539-9087
eISSN: 1558-3465
DOI: 10.1145/3202663
Publisher site: See Article on Publisher Site

Abstract

One of the biggest challenges in multicore platforms is shared cache management, especially for data-dominant applications. Two commonly used approaches for increasing shared cache utilization are cache partitioning and loop tiling. However, state-of-the-art compilers lack efficient cache partitioning and loop tiling methods for two reasons. First, cache partitioning and loop tiling are strongly coupled together, and thus addressing them separately is simply not effective. Second, cache partitioning and loop tiling must be tailored to the target shared cache architecture details and the memory characteristics of the corunning workloads. To the best of our knowledge, this is the first time that a methodology provides (1) a theoretical foundation in the above-mentioned cache management mechanisms and (2) a unified framework to orchestrate these two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by an order of magnitude keeping at the same time the number of arithmetic/addressing instructions to a minimal level. We motivate this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details (i.e., cache size and associativity), and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)-optimal solution is requested. To this end, we present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space.

Journal

ACM Transactions on Embedded Computing Systems (TECS) – Association for Computing Machinery

Published: May 22, 2018

Keywords: Cache partitioning

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

References (65)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies