Access the full text.
Sign up today, get DeepDyve free for 14 days.
Brian D. Strom, SungChang Lee, George W. Tyndall, Andrei Khurshudov (2007)
Hard disk drive reliability modeling and failure predictionIEEE Transactions on Magnetics (TMAG), 43
Lakshmi Bairavasundaram, Garth Goodson, Bianca Schroeder, A. Arpaci-Dusseau, Remzi Arpaci-Dusseau (2008)
An analysis of data corruption in the storage stack
Haryadi Gunawi, M. Hao, Riza Suminto, Agung Laksono, A. Satria, J. Adityatama, Kurnia Eliazar (2016)
Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service OutagesProceedings of the Seventh ACM Symposium on Cloud Computing
Michael Kasick, Jiaqi Tan, R. Gandhi, P. Narasimhan (2010)
Black-Box Problem Diagnosis in Parallel File Systems
M. Hao, Huaicheng Li, M. Tong, Chrisma Pakha, Riza Suminto, Cesar Stuardo, A. Chien, Haryadi Gunawi (2017)
MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-Aware OS InterfaceProceedings of the 26th Symposium on Operating Systems Principles
Bianca Schroeder, Eduardo Pinheiro, W. Weber (2009)
DRAM errors in the wild: a large-scale field study
George Candea, Armando Fox (2003)
Crash-only softwareProceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX).
Jaeho Kim, Donghee Lee, S. Noh (2015)
Towards SLO Complying SSDs Through OPS Isolation
Christine Chan, Boxiang Pan, K. Gross, K. Vaidyanathan, T. Simunic (2014)
Correcting vibration-induced performance degradation in enterprise serversSIGMETRICS Perform. Evaluation Rev., 41
Bianca Schroeder, Sotirios Damouras, Phillipa Gill (2010)
Understanding latent sector errors and how to protect against themACM Trans. Storage, 6
Remzi Arpaci-Dusseau, A. Arpaci-Dusseau (2001)
Fail-stutter fault toleranceProceedings Eighth Workshop on Hot Topics in Operating Systems
Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob Lorch, Yingnong Dang, Murali Chintalapati, Randolph Yao (2017)
Gray Failure: The Achilles' Heel of Cloud-Scale SystemsProceedings of the 16th Workshop on Hot Topics in Operating Systems
Thanh Do, M. Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Haryadi Gunawi (2013)
Limplock: understanding the impact of limpware on scale-out cloud systemsProceedings of the 4th annual Symposium on Cloud Computing
Yu Cai, Yixin Luo, Saugata Ghose, O. Mutlu (2015)
Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Thanh Do, T. Harter, Yingchao Liu, Haryadi Gunawi, A. Arpaci-Dusseau, Remzi Arpaci-Dusseau (2013)
HARDFS: hardening HDFS with selective and lightweight versioning
Bianca Schroeder, Raghav Lagisetty, A. Merchant (2016)
Flash Reliability in Production: The Expected and the Unexpected
Ao Ma, F. Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, W. Hsu (2015)
RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures
(2011)
NAND Flash Media Management Through RAIN. Micron
NAND Flash Media Management Through RAIN
Micron.
D. Dean, H. Nguyen, Xiaohui Gu, Hui Zhang, J. Rhee, Nipun Arora, Geoff Jiang (2014)
PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing InfrastructuresProceedings of the ACM Symposium on Cloud Computing
Tanakorn Leesatapornwongsa, Jeffrey Lukman, Shan Lu, Haryadi Gunawi (2016)
TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed SystemsProceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
Justin Meza, Qiang Wu, Sanjeva Kumar, O. Mutlu (2015)
A Large-Scale Study of Flash Memory Failures in the FieldACM SIGMETRICS Performance Evaluation Review, 43
Asim Kadav, Matthew Renzelmann, M. Swift (2009)
Tolerating hardware device failures in software
Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Andrew A. Chien, Haryadi S. Gunawi (2017)
Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDsProceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).
Muthu Dayalan (2004)
MapReduce: simplified data processing on large clustersCommun. ACM, 51
Justin Meza, Qiang Wu, Sanjeev Kumar, Onur Mutlu (2015)
A large-scale study of flash memory failures in the fieldProceedings of the 2015 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15), 2015
Bianca Schroeder, Garth Gibson (2007)
Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You?
Eitan Yaakobi, Laura Grupp, P. Siegel, S. Swanson, J. Wolf (2012)
Characterization and error-correcting codes for TLC flash memories2012 International Conference on Computing, Networking and Communications (ICNC)
Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau (2017)
Application crash consistency and performance with CCFSProceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).
Lakshmi Bairavasundaram, Garth Goodson, S. Pasupathy, J. Schindler (2007)
An analysis of latent sector errors in disk drives
Bianca Schroeder, Garth A. Gibson (2007)
Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST’07)
Shiqin Yan, Huaicheng Li, M. Hao, M. Tong, S. Sundararaman, A. Chien, Haryadi Gunawi (2017)
Tiny-Tail FlashACM Transactions on Storage (TOS), 13
Haryadi Gunawi, M. Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, J. Adityatama, Kurnia Eliazar, Agung Laksono, Jeffrey Lukman, V. Martin, A. Satria (2014)
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud SystemsProceedings of the ACM Symposium on Cloud Computing
Allen Clement, Edmund Wong, L. Alvisi, M. Dahlin, Mirco Marchetti (2009)
Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults
Yu Cai, Yixin Luo, E. Haratsch, K. Mai, O. Mutlu (2015)
Data retention in MLC NAND flash memory: Characterization, optimization, and recovery2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
Open Hardware Monitor
Retrieved December 2017 from http://openhardwaremonitor.org., 2017
Eric Brewer (2016)
Spinning disks and their cloudy future (keynote), In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16)Spinning disks and their cloudy future (keynote)
Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi Bairavasundaram, S. Pasupathy (2011)
An empirical study on configuration errors in commercial and open source systemsProceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Nosayba El-Sayed, Ioan Stefanovici, George Amvrosiadis, Andy Hwang, Bianca Schroeder (2012)
Temperature management in data centers: why some (might) like it hot
R. Baumann (2005)
Radiation-induced soft errors in advanced semiconductor technologiesIEEE Transactions on Device and Materials Reliability, 5
Riza Suminto, Cesar Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu, D. Kurniawan, V. Martin, Maheswara Uma, Haryadi Gunawi (2017)
PBSE: a robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworksProceedings of the 2017 Symposium on Cloud Computing
Mona Attariyan, J. Flinn (2010)
Automating Configuration Troubleshooting with Dynamic Information Flow Analysis
M. Hao, G. Soundararajan, Deepak Kenchammana-Hosekote, A. Chien, Haryadi Gunawi (2016)
The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments
B. Strom, S. Lee, G. Tyndall, A. Khurshudov (2006)
Hard Disk Drive Reliability Modeling and Failure PredictionAsia-Pacific Magnetic Recording Conference 2006
Vijayan Prabhakaran, Lakshmi Bairavasundaram, Nitin Agrawal, Haryadi Gunawi, A. Arpaci-Dusseau, Remzi Arpaci-Dusseau (2005)
IRON file systems
UCARE: Fail-Slow Database
E. Brewer (2016)
Spinning Disks and Their Cloudy Future
T. Pillai, R. Alagappan, Lanyue Lu, Vijay Chidambaram, A. Arpaci-Dusseau, Remzi Arpaci-Dusseau (2017)
Application Crash Consistency and Performance with CCFSACM Transactions on Storage (TOS), 13
R. Alagappan, Aishwarya Ganesan, Yuvraj Patel, T. Pillai, A. Arpaci-Dusseau, Remzi Arpaci-Dusseau (2016)
Correlated Crash Vulnerabilities
George Candea (2003)
Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems
Aishwarya Ganesan, R. Alagappan, A. Arpaci-Dusseau, Remzi Arpaci-Dusseau (2017)
Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions
Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.
ACM Transactions on Storage (TOS) – Association for Computing Machinery
Published: Oct 3, 2018
Keywords: Hardware fault
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.