Tools for Predicting the Reliability of Large-Scale Storage Systems

Robert J. Hall

doi:10.1145/2911987

Loading next page...

References (28)

Asaf Cidon, Stephen Rumble, Ryan Stutsman, S. Katti, J. Ousterhout, M. Rosenblum (2013)
Copysets: Reducing the Frequency of Data Loss in Cloud Storage
I. Iliadis, V. Venkatesan (2015)
Rebuttal to “Beyond MTTDL: A Closed-Form RAID-6 Reliability Equation”
ACM Transactions on Storage (TOS), 11
(2015)
ACM Transactions on Storage
S. Weil, S. Brandt, E. Miller, C. Maltzahn (2006)
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
ACM/IEEE SC 2006 Conference (SC'06)
V. Venkatesan, I. Iliadis, C. Fragouli, R. Urbanke (2011)
Reliability of Clustered vs. Declustered Replica Placement in Data Storage Systems
2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems
(2007)
The Hadoop Distributed File System: Architecture and Design
K. Greenan, J. Plank, Jay Wylie (2010)
Mean Time to Meaningless: MTTDL, Markov Models, and Storage System Reliability
Eduardo Pinheiro, W. Weber, L. Barroso (2007)
Failure Trends in a Large Disk Drive Population
Bianca Schroeder, Garth Gibson (2007)
Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You?
Lakshmi Bairavasundaram, Garth Goodson, S. Pasupathy, J. Schindler (2007)
An analysis of latent sector errors in disk drives
Peter Chen, Edward Lee, Garth Gibson, R. Katz, D. Patterson (1994)
RAID: high-performance, reliable secondary storage
ACM Comput. Surv., 26
(2016)
Article 24, Publication date
J. Elerath, J. Schindler (2014)
Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation
ACM Trans. Storage, 10
Vincenzo Guerriero (2012)
Power Law Distribution: Method of Multi-scale Inferential Statistics
, 1
S. Weil, S. Brandt, E. Miller, D. Long, C. Maltzahn (2006)
Ceph: a scalable, high-performance distributed file system
V. Venkatesan, I. Iliadis (2012)
A General Reliability Model for Data Storage Systems
2012 Ninth International Conference on Quantitative Evaluation of Systems
J. Angus (1988)
On computing MTBF for a k-out-of-n:G repairable system
IEEE Transactions on Reliability, 37
Bianca Schroeder, Sotirios Damouras, Phillipa Gill (2010)
Understanding latent sector errors and how to protect against them
ACM Trans. Storage, 6
V. Venkatesan, I. Iliadis, R. Haas (2012)
Reliability of Data Storage Systems under Network Rebuild Bandwidth Constraints
2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Qin Xin, E. Miller, T. Schwarz, D. Long, S. Brandt, W. Litwin (2003)
Reliability mechanisms for very large storage systems
20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings.
(2014)
Freezing exabytes of data at Facebook's cold storage
Hsu Kao, Jehan-Francois Pâris, T. Schwarz, D. Long (2013)
A flexible simulation tool for estimating data loss risks in storage arrays
2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST)
D. Ford, François Labelle, Florentina Popovici, M. Stokely, Van-Anh Truong, L. Barroso, C. Grimes, Sean Quinlan (2010)
Availability in Globally Distributed Storage Systems
J. Elerath, M. Pecht (2007)
Enhanced Reliability Modeling of RAID Storage Systems
37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
Jason Resch, Ilya Volvovski (2013)
Reliability Models for Highly Fault-tolerant Storage Systems
ArXiv, abs/1310.4702
Michael Ovsiannikov, S. Rus, Damian Reeves, Paul Sutter, Sriram Rao, Jim Kelly (2013)
A The Quantcast File System
Proc. VLDB Endow., 6
KK Rao, J. Hafner, Richard Golding (2006)
Reliability for Networked Storage Nodes
IEEE Transactions on Dependable and Secure Computing, 8
M. Storer, K. Greenan, E. Miller, K. Voruganti (2008)
Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage

Publisher: Association for Computing Machinery
Copyright: Copyright © 2016 by ACM Inc.
ISSN: 1553-3077
DOI: 10.1145/2911987
Publisher site: See Article on Publisher Site

Abstract

Tools for Predicting the Reliability of Large-Scale Storage Systems ROBERT J. HALL, AT&T Labs Research Data-intensive applications require extreme scaling of their underlying storage systems. Such scaling, together with the fact that storage systems must be implemented in actual data centers, increases the risk of data loss from failures of underlying components. Accurate engineering requires quantitatively predicting reliability, but this remains challenging due to the need to account for extreme scale, redundancy scheme type and strength, distribution architecture, and component dependencies. This article introduces CQSIM-R, a tool suite for predicting the reliability of large-scale storage system designs and deployments. CQSIM-R includes (a) direct calculations based on an only-drives-fail failure model and (b) an event-based simulator for detailed prediction that handles failures of and failure dependencies among arbitrary (drive or nondrive) components. These are based on a common combinatorial framework for modeling placement strategies. The article demonstrates CQSIM-R using models of common storage systems, including replicated and erasure coded designs. New results, such as the poor reliability scaling of spread-placed systems and a quantification of the impact of data center distribution and rack-awareness on reliability, demonstrate the usefulness and generality of the tools. Analysis and empirical studies show the

Journal

ACM Transactions on Storage (TOS) – Association for Computing Machinery

Published: Aug 16, 2016

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Tools for Predicting the Reliability of Large-Scale Storage Systems

Tools for Predicting the Reliability of Large-Scale Storage Systems

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Tools for Predicting the Reliability of Large-Scale Storage Systems

Tools for Predicting the Reliability of Large-Scale Storage Systems

References (28)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies