Distributed systems are traditionally designed independently from the underlying network, making worst-case assumptions about its behavior. Such an approach is well-suited for the Internet, where one cannot predict what paths messages might take or what might happen to them along the way. However, many distributed applications are today deployed in datacenters, where the network is more reliable, predictable, and extensible. We argue that in these environments, it is possible to co-design distributed systems with their network layer, and doing so can offer substantial benefits.
Our recent work uses this approach to the performance of coordination systems. This includes state machine replication protocols, which are the standard mechanism for ensuring availability of critical datacenter services, and distributed transactions, which simplify programming for large-scale distributed storage systems.
Our first project demonstrates the feasibility of optimizing protocols for the data center environment. It uses network-level techniques to provide a Mostly-Ordered Multicast primitive (MOM) with a best-effort ordering property for concurrent multicast operations. We use this primitive to build Speculative Paxos, a new replication protocol that relies on the network to order requests in the normal case, while still remaining correct if messages are delivered out of order. By leveraging the datacenter network properties, Speculative Paxos can provide substantially higher throughput and lower latency than the standard Paxos protocol. It simultaneously outperforms latency- and throughput-optimized protocols on their respective metrics.
NOPaxos demonstrates that a small amount of in-network computational power can be used to nearly eliminate the performance overhead of replication. carefully divides replication responsibility between the network and protocol layers. The network orders requests but does not ensure reliable delivery – using a new primitive we call ordered unreliable multicast (OUM). Implementing this primitive can be achieved with near-zero-cost in the data center using upcoming flexible switch architectures or existing middlebox designs. Our new replication protocol, Network-Ordered Paxos (NOPaxos), exploits network ordering to provide strongly consistent replication without coordination. NOPaxos achieves throughput within 2% and latency within 16us of an unreplicated system, showing that fault-tolerance does not have to come at the cost of performance.
Distributed storage systems now rely on multiple layers of coordination to apply transactional updates atomically. They are partitioned over multiple shards for scalability and replicated for fault tolerance, requiring both atomic commitment and consensus protocols on each operation. We are developing a new system, Eris, which takes a different approach. It bypasses the need for these expensive protocols by exploiting a new network property, network multi-sequencing, which we show can be efficiently provided by datacenter networks. Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas.
The Speculative Paxos source code is available on GitHub.
NOPaxos source code will be available soon.