Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Tools for Predicting the Reliability of Large-Scale Storage Systems

Tools for Predicting the Reliability of Large-Scale Storage Systems Tools for Predicting the Reliability of Large-Scale Storage Systems ROBERT J. HALL, AT&T Labs Research Data-intensive applications require extreme scaling of their underlying storage systems. Such scaling, together with the fact that storage systems must be implemented in actual data centers, increases the risk of data loss from failures of underlying components. Accurate engineering requires quantitatively predicting reliability, but this remains challenging due to the need to account for extreme scale, redundancy scheme type and strength, distribution architecture, and component dependencies. This article introduces CQSIM-R, a tool suite for predicting the reliability of large-scale storage system designs and deployments. CQSIM-R includes (a) direct calculations based on an only-drives-fail failure model and (b) an event-based simulator for detailed prediction that handles failures of and failure dependencies among arbitrary (drive or nondrive) components. These are based on a common combinatorial framework for modeling placement strategies. The article demonstrates CQSIM-R using models of common storage systems, including replicated and erasure coded designs. New results, such as the poor reliability scaling of spread-placed systems and a quantification of the impact of data center distribution and rack-awareness on reliability, demonstrate the usefulness and generality of the tools. Analysis and empirical studies show the http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Storage (TOS) Association for Computing Machinery

Tools for Predicting the Reliability of Large-Scale Storage Systems

ACM Transactions on Storage (TOS) , Volume 12 (4) – Aug 16, 2016

Loading next page...
 
/lp/association-for-computing-machinery/tools-for-predicting-the-reliability-of-large-scale-storage-systems-ZV5dGCyAZW

References (28)

Publisher
Association for Computing Machinery
Copyright
Copyright © 2016 by ACM Inc.
ISSN
1553-3077
DOI
10.1145/2911987
Publisher site
See Article on Publisher Site

Abstract

Tools for Predicting the Reliability of Large-Scale Storage Systems ROBERT J. HALL, AT&T Labs Research Data-intensive applications require extreme scaling of their underlying storage systems. Such scaling, together with the fact that storage systems must be implemented in actual data centers, increases the risk of data loss from failures of underlying components. Accurate engineering requires quantitatively predicting reliability, but this remains challenging due to the need to account for extreme scale, redundancy scheme type and strength, distribution architecture, and component dependencies. This article introduces CQSIM-R, a tool suite for predicting the reliability of large-scale storage system designs and deployments. CQSIM-R includes (a) direct calculations based on an only-drives-fail failure model and (b) an event-based simulator for detailed prediction that handles failures of and failure dependencies among arbitrary (drive or nondrive) components. These are based on a common combinatorial framework for modeling placement strategies. The article demonstrates CQSIM-R using models of common storage systems, including replicated and erasure coded designs. New results, such as the poor reliability scaling of spread-placed systems and a quantification of the impact of data center distribution and rack-awareness on reliability, demonstrate the usefulness and generality of the tools. Analysis and empirical studies show the

Journal

ACM Transactions on Storage (TOS)Association for Computing Machinery

Published: Aug 16, 2016

There are no references for this article.