Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Fail-Slow at Scale

Fail-Slow at Scale Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Storage (TOS) Association for Computing Machinery

Loading next page...
 
/lp/association-for-computing-machinery/fail-slow-at-scale-kCIpvcEoC8

References (51)

Publisher
Association for Computing Machinery
Copyright
Copyright © 2018 ACM
ISSN
1553-3077
eISSN
1553-3093
DOI
10.1145/3242086
Publisher site
See Article on Publisher Site

Abstract

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

Journal

ACM Transactions on Storage (TOS)Association for Computing Machinery

Published: Oct 3, 2018

Keywords: Hardware fault

There are no references for this article.