Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

RT-ZooKeeper: Taming the Recovery Latency of a Coordination Service

RT-ZooKeeper: Taming the Recovery Latency of a Coordination Service Fault-tolerant coordination services have been widely used in distributed applications in cloud environments. Recent years have witnessed the emergence of time-sensitive applications deployed in edge computing environments, which introduces both challenges and opportunities for coordination services. On one hand, coordination services must recover from failures in a timely manner. On the other hand, edge computing employs local networked platforms that can be exploited to achieve timely recovery. In this work, we first identify the limitations of the leader election and recovery protocols underlying Apache ZooKeeper, the prevailing open-source coordination service. To reduce recovery latency from leader failures, we then design RT-Zookeeper with a set of novel features including a fast-convergence election protocol, a quorum channel notification mechanism, and a distributed epoch persistence protocol. We have implemented RT-Zookeeper based on ZooKeeper version 3.5.8. Empirical evaluation shows that RT-ZooKeeper achieves 91% reduction in maximum recovery latency in comparison to ZooKeeper. Furthermore, a case study demonstrates that fast failure recovery in RT-ZooKeeper can benefit a common messaging service like Kafka in terms of message latency. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Embedded Computing Systems (TECS) Association for Computing Machinery

RT-ZooKeeper: Taming the Recovery Latency of a Coordination Service

Loading next page...
 
/lp/association-for-computing-machinery/rt-zookeeper-taming-the-recovery-latency-of-a-coordination-service-bE3spfGErU
Publisher
Association for Computing Machinery
Copyright
Copyright © 2021 Association for Computing Machinery.
ISSN
1539-9087
eISSN
1558-3465
DOI
10.1145/3477034
Publisher site
See Article on Publisher Site

Abstract

Fault-tolerant coordination services have been widely used in distributed applications in cloud environments. Recent years have witnessed the emergence of time-sensitive applications deployed in edge computing environments, which introduces both challenges and opportunities for coordination services. On one hand, coordination services must recover from failures in a timely manner. On the other hand, edge computing employs local networked platforms that can be exploited to achieve timely recovery. In this work, we first identify the limitations of the leader election and recovery protocols underlying Apache ZooKeeper, the prevailing open-source coordination service. To reduce recovery latency from leader failures, we then design RT-Zookeeper with a set of novel features including a fast-convergence election protocol, a quorum channel notification mechanism, and a distributed epoch persistence protocol. We have implemented RT-Zookeeper based on ZooKeeper version 3.5.8. Empirical evaluation shows that RT-ZooKeeper achieves 91% reduction in maximum recovery latency in comparison to ZooKeeper. Furthermore, a case study demonstrates that fast failure recovery in RT-ZooKeeper can benefit a common messaging service like Kafka in terms of message latency.

Journal

ACM Transactions on Embedded Computing Systems (TECS)Association for Computing Machinery

Published: Sep 22, 2021

Keywords: Real-time fault tolerance

References