University of Washington
Computer Systems Lab

Publications

Download BibTeX.

2019
August
Zooming in on Wide-area Latencies to a Global Cloud Provider.
Yuchen Jin, R. Sundararajan, Ganesh Ananthanarayanan, Junchen Jiang, Venkat Padmanabhan, Manuel Schroder, Matt Calder, and Arvind Krishnamurthy.
Proceedings of the ACM SIGCOMM Conference (SIGCOMM).
2019
August
iPipe: A Framework for Building Distributed Applications on Multicore SoC SmartNICs.
Ming Liu, Tianyi Cui, Henrik Schuh, Arvind Krishnamurthy, Simon Peter, and Karan Gupta.
Proceedings of the ACM SIGCOMM Conference (SIGCOMM).
2019
April
TCP Acceleration as an OS Service.
Antoine Kaufmann, Tim Stamler, Simon Peter, Naveen Sharma, Thomas Anderson, and Arvind Krishnamurthy.
Proceedings of the European Conference on Computer Systems (Eurosys).
pdf
2019
March
Teaching Rigorous Distributed Systems With Efficient Model Checking.
Ellis Michael, Doug Woos, Thomas Anderson, Michael D. Ernst, and Zachary Tatlock.
Proceedings of the European Conference on Computer Systems (Eurosys).
abstract pdf slides
Writing correct distributed systems code is difficult, especially for novice programmers. The inherent asynchrony and need for fault-tolerance make errors almost inevitable. Industrial-strength testing and model checking have been shown to be effective at uncovering bugs, but they come at a cost -- in both time and effort -- that is far beyond what - in both time and effort - that is far beyond what students can afford. To address this, we have developed an efficient model checking framework and visual debugger for distributed systems, with the goal of helping students find and fix bugs in near real-time. We identify two novel techniques for reducing the search state space to more efficiently find bugs in student implementations. We report our experiences using these tools to help over two hundred students build a correct, linearizable, fault-tolerant, dynamically-sharded key--value store.
2019
February
Slim: OS Kernel Support for a Low-Overhead Container Overlay Network.
Danyang Zhuo, Kaiyuan Zhang, Yibo Zhu, Harry Liu, Matthew Rockett, Arvind Krishnamurthy, and Thomas Anderson.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).
pdf
2019
February
Stable and Practical AS Relationship Inference with ProbLink.
Yuchen Jin, Colin Scott, Amogh Dhamdhere, Vasilis Giotsas, Arvind Krishnamurthy, and S. Shenker.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).
pdf
2018
October
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI).
pdf
2018
August
Revisiting Network Support for RDMA.
Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy, and Scott Shenker.
Proceedings of the ACM SIGCOMM Conference (SIGCOMM).
pdf
2018
April
Towards Causal Datacenter Networks.
Ellis Michael and Dan R. K. Ports.
Proceedings of the ACM Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC).
abstract pdf
Traditionally, distributed systems conservatively assume an asynchronous network. However, recent work on the co-design of networks and distributed systems has shown that stronger ordering properties are achievable in datacenter networks and yield performance improvements for the distributed systems they support. We build on that trend and ask whether it is possible for the datacenter network to order all messages in a protocol-agnostic way. This approach, which we call omnisequencing, would ensure causal delivery of all messages, making consistency a network- level guarantee.
2018
April
MultiNyx: A Multi-level Abstraction Framework for Systematic Analysis of Hypervisors.
Pedro Fonseca, Xi Wang, and Arvind Krishnamurthy.
Proceedings of the European Conference on Computer Systems (Eurosys).
pdf
2018
April
Approximating Fair Queueing on Reconfigurable Switches.
Naveen Sharma, Ming Liu, Kishore Atreya, and Arvind Krishnamurthy.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).
pdf
2018
April
Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure.
Qiao Zhang, Guo Yu, Chuanxiong Guo, Yingnong Dang, Nick Swanson, Xinsheng Yang, Randolph Yao, Murali Chintalapati, Arvind Krishnamurthy, and Thomas Anderson.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).
pdf
2017
October
Hyperkernel: Push-Button Verification of an OS Kernel.
Luke Nelson, Helgi Sigurbjarnarson, Kaiyuan Zhang, Dylan Johnson, James Bornholt, Emina Torlak, and Xi Wang.
Proceedings of the ACM Symposium on Operating Systems Principles (SOSP).
abstract pdf slides video
This paper describes an approach to designing, implementing, and formally verifying the functional correctness of an OS kernel, named Hyperkernel, with a high degree of proof automation and low proof burden.We base the design of Hyperkernel's interface on xv6, a Unix-like teaching operating system. Hyperkernel introduces three key ideas to achieve proof automation: it finitizes the kernel interface to avoid unbounded loops or recursion; it separates kernel and user address spaces to simplify reasoning about virtual memory; and it performs verification at the LLVM intermediate representation level to avoid modeling complicated C semantics. We have verified the implementation of Hyperkernel with the Z3 SMT solver, checking a total of 50 system calls and other trap handlers. Experience shows that Hyperkernel can avoid bugs similar to those found in xv6, and that the verification of Hyperkernel can be achieved with a low proof burden.
2017
October
Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control.
Jialin Li, Ellis Michael, and Dan R. K. Ports.
Proceedings of the ACM Symposium on Operating Systems Principles (SOSP).
abstract pdf slides video
Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols -- introducing extensive coordination overhead. Our system, Eris, takes a different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity. The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas in the normal case. It provides atomicity, consistency, and fault tolerance with less than 10% overhead -- achieving throughput 3.6--35x higher and latency 72--80% lower than a conventional design on standard benchmarks.
2017
October
Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control (Extended Version).
Jialin Li, Ellis Michael, and Dan R. K. Ports.
Technical Report UW-CSE-TR-17-01-01, University of Washington CSE.
abstract pdf
Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols -- introducing extensive coordination overhead. Our system, Eris, takes a different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity. The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas in the normal case. It provides atomicity, consistency, and fault tolerance with less than 10% overhead -- achieving throughput 3.6--35x higher and latency 72--80% lower than a conventional design on standard benchmarks.
2017
October
Recovering Shared Objects Without Stable Storage.
Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres.
Proceedings of the International Symposium on Distributed Computing (DISC).
abstract pdf
This paper considers the problem of building fault-tolerant shared objects when processes can crash and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model matches the way many long-lived systems are built. We show that it presents new challenges, as operations that are recorded at a quorum may not persist after some of the processes in that quorum crash and then recover. To address this problem, we introduce the notion of crash-consistent quorums, where no recoveries happen during the quorum responses. We show that relying on crash-consistent quorums enables a recovery procedure that can recover all operations that successfully finished. Crash-consistent quorums can be easily identified using a mechanism we term the crash vector, which tracks the causal relationship between crashes, recoveries, and other operations. We apply crash-consistent quorums and crash vectors to build two storage primitives. We give a new algorithm for multi-writer, multi-reader atomic registers in the DCR model that guarantees safety under all conditions and termination under a natural condition. It improves on the best prior protocol for this problem by requiring fewer rounds, fewer nodes to participate in the quorum, and a less restrictive liveness condition. We also present a more efficient single-writer, single-reader atomic set---a virtual stable storage abstraction. It can be used to lift any existing algorithm from the traditional Crash-Recovery model to the DCR model. We examine a specific application, state machine replication, and show that existing diskless protocols can violate their correctness guarantees, while ours offers a general and correct solution.
2017
August
Recovering Shared Objects Without Stable Storage (Extended Version).
Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres.
Technical Report UW-CSE-TR-17-08-01, University of Washington CSE.
abstract pdf
This paper considers the problem of building fault-tolerant shared objects when processes can crash and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model matches the way many long-lived systems are built. We show that it presents new challenges, as operations that are recorded at a quorum may not persist after some of the processes in that quorum crash and then recover. To address this problem, we introduce the notion of crash-consistent quorums, where no recoveries happen during the quorum responses. We show that relying on crash-consistent quorums enables a recovery procedure that can recover all operations that successfully finished. Crash-consistent quorums can be easily identified using a mechanism we term the crash vector, which tracks the causal relationship between crashes, recoveries, and other operations. We apply crash-consistent quorums and crash vectors to build two storage primitives. We give a new algorithm for multi-writer, multi-reader atomic registers in the DCR model that guarantees safety under all conditions and termination under a natural condition. It improves on the best prior protocol for this problem by requiring fewer rounds, fewer nodes to participate in the quorum, and a less restrictive liveness condition. We also present a more efficient single-writer, single-reader atomic set---a virtual stable storage abstraction. It can be used to lift any existing algorithm from the traditional Crash-Recovery model to the DCR model. We examine a specific application, state machine replication, and show that existing diskless protocols can violate their correctness guarantees, while ours offers a general and correct solution.
2017
July
Fast Video Classification via Adaptive Cascading of Deep Models.
Haichen Shen, Seungyeop Han, Matthai Philipose, and Arvind Krishnamurthy.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Spotlight.
abstract pdf
Recent advances have enabled "oracle" classifiers that can classify across many classes and input distributions with high accuracy without retraining. However, these classifiers are relatively heavyweight, so that applying them to classify video is costly. We show that day-to-day video exhibits highly skewed class distributions over the short term, and that these distributions can be classified by much simpler models. We formulate the problem of detecting the short-term skews online and exploiting models based on it as a new sequential decision making problem dubbed the Online Bandit Problem, and present a new algorithm to solve it. When applied to recognizing faces in TV shows and movies, we realize end-toend classification speedups of 2.4-7.8x/2.6-11.2x (on GPU/CPU) relative to a state-of-the-art convolutional neural network, at competitive accuracy.
2017
April
IncBricks: Toward In-Network Computation with an In-Network Cache.
Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya.
Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
pdf
2017
April
An Empirical Study on the Correctness of Formally Verified Distributed Systems.
Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy.
Proceedings of the European Conference on Computer Systems (Eurosys).
pdf
2017
March
RAIL: A Case for Redundant Arrays of Inexpensive Links in Data Center Networks.
Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Xuan Zou, Hang Guan, Arvind Krishnamurthy, and Thomas Anderson.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).
pdf
2017
March
Evaluating the Power of Flexible Packet Processing for Network Resource Allocation.
Naveen Sharma, Antoine Kaufmann, Thomas Anderson, Changhoon Kim, Arvind Krishnamurthy, Jacob Nelson, and S. Peter.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).
pdf
2017
March
SCL: Simplifying Distributed SDN Control Planes.
Aurojit Panda, Wenting Zheng, Xiaohe Hu, Arvind Krishnamurthy, and Scott Shenker.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).
pdf
2017
March
Curator: Self-Managing Storage for Enterprise Clusters.
Ignacio Cano, Srinivas Aiyar, Varun Arora, Manosiz Bhattacharyya, Akhilesh Chaganti, Chern Cheah, Brent Chun, Karan Gupta, Vinayak Khot, and A. Krishnamurthy.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).
pdf
2016
November
Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering.
Jialin Li, Ellis Michael, Adriana Szekeres, Naveen Kr. Sharma, and Dan R. K. Ports.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI).
abstract pdf slides
Distributed applications use replication, implemented by protocols like Paxos, to ensure data availability and transparently mask server failures. This paper presents a new approach to achieving replication in the data center without the performance cost of traditional methods. Our work carefully divides replication responsibility between the network and protocol layers. The network orders requests but does not ensure reliable delivery -- using a new primitive we call ordered unreliable multicast (OUM). Implementing this primitive can be achieved with near-zero-cost in the data center. Our new replication protocol, Network-Ordered Paxos (NOPaxos), exploits network ordering to provide strongly consistent replication without coordination. The resulting system not only outperforms both latency- and throughput-optimized protocols on their respective metrics, but also yields throughput within 2% and latency within 16 us of an unreplicated system -- providing replication without the performance cost.
2016
November
Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering (Extended Version).
Jialin Li, Ellis Michael, Adriana Szekeres, Naveen Kr. Sharma, and Dan R. K. Ports.
Technical Report UW-CSE-2016-09-02, University of Washington CSE.
abstract pdf
Distributed applications use replication, implemented by protocols like Paxos, to ensure data availability and transparently mask server failures. This paper presents a new approach to achieving replication in the data center without the performance cost of traditional methods. Our work carefully divides replication responsibility between the network and protocol layers. The network orders requests but does not ensure reliable delivery -- using a new primitive we call ordered unreliable multicast (OUM). Implementing this primitive can be achieved with near-zero-cost in the data center. Our new replication protocol, Network-Ordered Paxos (NOPaxos), exploits network ordering to provide strongly consistent replication without coordination. The resulting system not only outperforms both latency- and throughput-optimized protocols on their respective metrics, but also yields throughput within 2% and latency within 16 us of an unreplicated system -- providing replication without the performance cost.
2016
November
Push-Button Verification of File Systems via Crash Refinement.
Helgi Sigurbjarnarson, James Bornholt, Emina Torlak, and Xi Wang.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). Best Paper Award.
abstract pdf
The file system is an essential operating system component for persisting data on storage devices. Writing bug-free file systems is non-trivial, as they must correctly implement and maintain complex on-disk data structures even in the presence of system crashes and reorderings of disk operations. This paper presents Yggdrasil, a toolkit for writing file systems with push-button verification: Yggdrasil requires no manual annotations or proofs about the implementation code, and it produces a counterexample if there is a bug. Yggdrasil achieves this automation through a novel definition of file system correctness called crash refinement, which requires the set of possible disk states produced by an implementation (including states produced by crashes) to be a subset of those allowed by the specification. Crash refinement is amenable to fully automated satisfiability modulo theories (SMT) reasoning, and enables developers to implement file systems in a modular way for verification. With Yggdrasil, we have implemented and verified the Yxv6 journaling file system, the Ycp file copy utility, and the Ylog persistent log. Our experience shows that the ease of proof and counterexample-based debugging support make Yggdrasil practical for building reliable storage applications.
2016
November
Diamond: Automating Data Management and Storage for Wide-area, Reactive Applications.
Irene Zhang, Niel Lebeck, Pedro Fonseca, Brandon Holt, Raymond Cheng, Ariadna Norberg, Arvind Krishnamurthy, and Henry M. Levy.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI).
abstract pdf slides
Users of today's popular wide-area apps (e.g., Twitter, Google Docs, and Words with Friends) must no longer save and reload when updating shared data; instead, these applications are reactive, providing the illusion of continuous synchronization across mobile devices and the cloud. Achieving this illusion poses a complex distributed data management problem for programmers. This paper presents the first reactive data management service, called Diamond, which provides persistent cloud storage, reliable synchronization between storage and mobile devices, and automated execution of application code in response to shared data updates. We demonstrate that Diamond greatly simplifies the design of reactive applications, strengthens distributed data sharing guarantees, and supports automated reactivity with low performance overhead.
2016
October
Disciplined Inconsistency with Consistency Types.
Brandon Holt, James Bornholt, Irene Zhang, Dan R. K. Ports, Mark Oskin, and Luis Ceze.
Proceedings of the ACM Symposium on Cloud Computing (SoCC).
2016
October
Radiatus: Strong User Isolation for Scalable Web Applications.
Raymond Cheng, William Scott, Paul Ellenbogen, Jon Howell, Franziska Roesner, Arvind Krishnamurthy, and Thomas Anderson.
Proceedings of the ACM Symposium on Cloud Computing (SoCC).
2016
October
Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters.
Ignacio Cano, Srinivas Aiyar, and Arvind Krishnamurthy.
Proceedings of the ACM Symposium on Cloud Computing (SoCC).
2016
August
Providing Stable Storage for the Diskless Crash-Recovery Failure Model.
Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres.
Technical Report UW-CSE-TR-16-08-02, University of Washington CSE.
abstract pdf
Many classic protocols in the fault tolerant distributed computing literature assume a Crash-Fail model in which processes either are up, or have crashed and are permanently down. While this model is useful, it does not fully capture the difficulties many real systems must contend with. In particular, real-world systems are long-lived and must have a recovery mechanism so that crashed processes can rejoin the system and restore its fault-tolerance. When processes are assumed to have access to stable storage that is persistent across failures, the Crash-Recovery model is trivial. However, because disk failures are common and because having a disk on a protocol's critical path is often performance concern, diskless recovery protocols are needed. While such protocols do exist in the state machine replication literature, several well-known protocols have flawed recovery mechanisms. We examine these errors to elucidate the problem of diskless recovery and present our own protocol for providing virtual stable storage, transforming any protocol in the Crash-Recovery with stable storage model into a protocol in the Diskless Crash-Recover model.
2016
June
MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints.
Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy.
Proceedings of the ACM International Conference on Mobile Systems, Applications, and Services (MobiSys).
abstract pdf
We consider applying computer vision to video on cloud-backed mobile devices using Deep Neural Networks (DNNs). The computational demands of DNNs are high enough that, without careful resource management, such applications strain device battery, wireless data, and cloud cost budgets. We pose the corresponding resource management problem, which we call Approximate Model Scheduling, as one of serving a stream of heterogeneous (i.e., solving multiple classification problems) requests under resource constraints. We present the design and implementation of an optimizing compiler and runtime scheduler to address this problem. Going beyond traditional resource allocators, we allow each request to be served approximately, by systematically trading off DNN classification accuracy for resource use, and remotely, by reasoning about on-device/cloud execution trade-offs. To inform the resource allocator, we characterize how several common DNNs, when subjected to state-of-the art optimizations, trade off accuracy for resource use such as memory, computation, and energy. The heterogeneous streaming setting is a novel one for DNN execution, and we introduce two new and powerful DNN optimizations that exploit it. Using the challenging continuous mobile vision domain as a case study, we show that our techniques yield significant reductions in resource usage and perform effectively over a broad range of operating conditions.
2016
June
Satellite: Joint Analysis of CDNs and Network-Level Interference.
Will Scott, Thomas Anderson, Tadayoshi Kohno, and Arvind Krishnamurthy.
Proceedings of the USENIX Annual Technical Conference (ATC). Best Student Paper Award.
pdf
2016
April
Specifying and Checking File System Crash-Consistency Models.
James Bornholt, Antoine Kaufmann, Jialin Li, Arvind Krishnamurthy, Emina Torlak, and Xi Wang.
Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
pdf
2016
April
High Performance Packet Processing with FlexNIC.
Antoine Kaufmann, Simon Peter, Thomas Anderson, and Arvind Krishnamurthy.
Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
abstract pdf
The recent surge of network I/O performance has put enormous pressure on memory and software I/O processing subsystems. We argue that the primary reason for high memory and processing overheads is the inefficient use of these resources by current commodity network interface cards (NICs). We propose FlexNIC, a flexible network DMA interface that can be used by operating systems and applications alike to reduce packet processing overheads. FlexNIC allows services to install packet processing rules into the NIC, which then executes simple operations on packets while exchanging them with host memory. Thus, our proposal moves some of the packet processing traditionally done in software to the NIC, where it can be done flexibly and at high speed. We quantify the potential benefits of FlexNIC by emulating the proposed FlexNIC functionality with existing hardware or in software. We show that significant gains in application performance are possible, in terms of both latency and throughput, for several widely used applications, including a key-value store, a stream processing system, and an intrusion detection system.
2016
March
Speeding up Web Page Loads with Shandian.
Xiao Sophia Wang, Arvind Krishnamurthy, and David Wetherall.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI).
abstract pdf
Web page loads are slow due to intrinsic inefficiencies in the page load process. Our study shows that the inefficiencies are attributable not only to the contents and structure of the Web pages (e.g., three-fourths of the CSS resources are not used during the initial page load) but also the way that pages are loaded (e.g., 15% of page load times are spent waiting for parsing-blocking resources to be loaded). To address these inefficiencies, this paper presents Shandian (which means lightening in Chinese) that restructures the page load process to speed up page loads. Shandian exercises control over what portions of the page gets communicated and in what order so that the initial page load is optimized. Unlike previous techniques, Shandian works on demand without requiring a training period, is compatible with existing latency-reducing techniques (e.g., caching and CDNs), supports security features that enforce same-origin policies, and does not impose additional privacy risks. Our evaluations show that Shandian reduces page load times by more than half for both mobile phones and desktops while incurring modest overheads to data usage.
2016
March
When Is Operation Ordering Required in Replicated Transactional Storage?.
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports.
IEEE Data Engineering Bulletin 39(1).
abstract pdf
Today's replicated transactional storage systems typically have a layered architecture, combining protocols for transaction coordination, consistent replication, and concurrency control. These systems generally require costly strongly-consistent replication protocols like Paxos, which assign a total order to all operations. To avoid this cost, we ask whether all replicated operations in these systems need to be strictly ordered. Recent research has yielded replication protocols that can avoid unnecessary ordering, e.g., by exploiting commutative operations, but it is not clear how to apply these to replicated transaction processing systems. We answer this question by analyzing existing transaction processing designs in terms of which replicated operations require ordering and which simply require fault tolerance. We describe how this analysis leads to our recent work on TAPIR, a transaction protocol that efficiently provides strict serializability by using a new replication protocol that provides fault tolerance but not ordering for most operations.
2015
Building Consistent Transactions with Inconsistent Replication.
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports.
Proceedings of the ACM Symposium on Operating Systems Principles (SOSP).
abstract pdf
Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google's Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use -- in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications. <p> We present TAPIR -- the Transactional Application Protocol for Inconsistent Replication -- the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency. By enforcing strong consistency only in the transaction protocol, TAPIR can commit transactions in a single round-trip and order distributed transactions without centralized coordination. We demonstrate the use of TAPIR in a transactional key-value store, TAPIR-KV. Compared to conventional systems, TAPIR-KV provides better latency and throughput.
2015
Building Consistent Transactions with Inconsistent Replication (Extended Version).
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports.
Technical Report UW-CSE-2014-12-01 v2, University of Washington CSE.
abstract pdf
Application programmers increasingly prefer distributed storage systems with strong consistency and distributed transactions (e.g., Google's Spanner) for their strong guarantees and ease of use. Unfortunately, existing transactional storage systems are expensive to use -- in part because they require costly replication protocols, like Paxos, for fault tolerance. In this paper, we present a new approach that makes transactional storage systems more affordable: we eliminate consistency from the replication protocol while still providing distributed transactions with strong consistency to applications. <p> We present TAPIR -- the Transactional Application Protocol for Inconsistent Replication -- the first transaction protocol to use a novel replication protocol, called inconsistent replication, that provides fault tolerance without consistency. By enforcing strong consistency only in the transaction protocol, TAPIR can commit transactions in a single round-trip and order distributed transactions without centralized coordination. We demonstrate the use of TAPIR in a transactional key-value store, TAPIR-KV. Compared to conventional systems, TAPIR-KV provides better latency and throughput.
2015
September
Enhancing mobile apps to use sensor hubs without programmer effort.
Haichen Shen, Aruna Balasubramanian, Anthony LaMarca, and David Wetherall.
Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp). Best Paper Award.
abstract pdf
Always-on continuous sensing apps drain the battery quickly because they prevent the main processor from sleeping. Instead, sensor hub hardware, available in many smartphones today, can run continuous sensing at lower power while keeping the main processor idle. However, developers have to divide functionality between the main processor and the sensor hub. We implement MobileHub, a system that automatically rewrites applications to leverage the sensor hub without additional programming effort. MobileHub uses a combination of dynamic taint tracking and machine learning to learn when it is safe to leverage the sensor hub without affecting application semantics. We implement MobileHub in Android and prototype a sensor hub on a 8-bit AVR micro-controller. We experiment with 20 applications from Google Play. Our evaluation shows that MobileHub significantly reduces power consumption for continuous sensing apps.
2015
July
MetaSync: File Synchronization Across Multiple Untrusted Storage Services.
Seungyeop Han, Haichen Shen, Taesoo Kim, Arvind Krishnamurthy, Thomas Anderson, and David Wetherall.
Proceedings of the USENIX Annual Technical Conference (ATC).
abstract pdf
Cloud-based file synchronization services, such as Dropbox, are a worldwide resource for many millions of users. However, individual services often have tight resource limits, suffer from temporary outages or even shutdowns, and sometimes silently corrupt or leak user data. We design, implement, and evaluate MetaSync, a secure and reliable file synchronization service that uses multiple cloud synchronization services as untrusted storage providers. To make MetaSync work correctly, we devise a novel variant of Paxos that provides efficient and consistent updates on top of the unmodified APIs exported by existing services. Our system automatically redistributes files upon reconfiguration of providers. Our evaluation shows that MetaSync provides low update latency and high update throughput while being more trustworthy and available. MetaSync outperforms its underlying cloud services by 1.2-10X on three realistic workloads.
2015
May
FlexNIC: Rethinking Network DMA.
Antoine Kaufmann, Simon Peter, Thomas Anderson, and Arvind Krishnamurthy.
Proceedings of the USENIX Workshop on Hot Topics in Operating Systems (HotOS).
abstract
We propose FlexNIC, a flexible network DMA interface that can be used by operating systems and applications alike to reduce packet processing overheads. The recent surge of network I/O performance has put enormous pressure on memory and software I/O processing subsystems. Yet even at high speeds, flexibility in packet handling is still important for security, performance isolation. and virtualization. Thus, our proposal moves some of the packet processing traditionally done in software to the NIC DMA controller, where it can be done flexibly and at high speed. We show how FlexNIC can benefit widely used data center server applications, such as key-value stores.
2015
May
Designing Distributed Systems Using Approximate Synchrony in Data Center Networks.
Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr. Sharma, and Arvind Krishnamurthy.
Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI). Best Paper Award.
abstract pdf
Distributed systems are traditionally designed independently from the underlying network, making worst-case assumptions (e.g., complete asynchrony) about its behavior. However, many of today's distributed applications are deployed in data centers, where the network is more reliable, predictable, and extensible. In these environments, it is possible to co-design distributed systems with their network layer, and doing so can offer substantial benefits. This paper explores network-level mechanisms for providing Mostly-Ordered Multicast (MOM): a best-effort ordering property for concurrent multicast operations. Using this primitive, we design Speculative Paxos, a state machine replication protocol that relies on the network to order requests in the normal case. This approach leads to substantial performance benefits: under realistic data center conditions, Speculative Paxos can provide 40% lower latency and 2.6x higher throughput than the standard Paxos protocol. It offers lower latency than a latency-optimized protocol (Fast Paxos) with the same throughput as a throughput-optimized protocol (batching).
2015
Exploring Cyberbullying and Other Toxic Behavior in Team Competition Online Games.
Haewoon Kwak, Jeremy Blackburn, and Seungyeop Han.
Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI).
abstract pdf
In this work we explore cyberbullying and other toxic behavior in team competition online games. Using a dataset of over 10 million player reports on 1.46 million toxic players along with corresponding crowdsourced decisions, we test several hypotheses drawn from theories explaining toxic behavior. Besides providing large-scale, empirical based understanding of toxic behavior, our work can be used as a basis for building systems to detect, prevent, and counter-act toxic behavior.
2015
April
Claret: Using Data Types for Highly Concurrent Distributed Transactions.
Brandon Holt, Irene Zhang, Dan Ports, Mark Oskin, and Luis Ceze.
Proceedings of the ACM Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC).
2014
November
Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency.
Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble.
Proceedings of the ACM Symposium on Cloud Computing (SoCC).
abstract pdf
Interactive services often have large-scale parallel implementations. To deliver fast responses, the median and tail latencies of a service's components must be low. In this paper, we explore the hardware, OS, and application-level sources of poor tail latency in high throughput servers executing on multi-core machines. We model these network services as a queuing system in order to establish the best-achievable latency distribution. Using fine-grained measurements of three different servers (a null RPC service, Memcached, and Nginx) on Linux, we then explore why these servers exhibit significantly worse tail latencies than queuing models alone predict. The underlying causes include interference from background processes, request re-ordering caused by poor scheduling or constrained concurrency models, suboptimal interrupt routing, CPU power saving mechanisms, and NUMA effects. We systematically eliminate these factors and show that Memcached can achieve a median latency of 11 us and a 99.9th percentile latency of 32 us at 80% utilization on a four-core system. In comparison, a naive deployment of Memcached at the same utilization on a single-core system has a median latency of 100 us and a 99.9th percentile latency of 5 ms. Finally, we demonstrate that tradeoffs exist between throughput, energy, and tail latency.
2014
October
Jitk: A Trustworthy In-Kernel Interpreter Infrastructure.
Xi Wang, David Lazar, Nickolai Zeldovich, Adam Chlipala, and Zachary Tatlock.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI).
abstract pdf
Modern operating systems run multiple interpreters in the kernel, which enable user-space applications to add new functionality or specialize system policies. The correctness of such interpreters is critical to the overall system security: bugs in interpreters could allow adversaries to compromise user-space applications and even the kernel. Jitk is a new infrastructure for building in-kernel interpreters that guarantee functional correctness as they compile user-space policies down to native instructions for execution in the kernel. To demonstrate Jitk, we implement two interpreters in the Linux kernel, BPF and INET-DIAG, which are used for network and system call filtering and socket monitoring, respectively. To help application developers write correct filters, we introduce a high-level rule language, along with a proof that Jitk correctly translates high-level rules all the way to native machine code, and demonstrate that this language can be integrated into OpenSSH with tens of lines of code. We built a prototype of Jitk on top of the CompCert verified compiler and integrated it into the Linux kernel. Experimental results show that Jitk is practical, fast, and trustworthy.
2014
October
Customizable and Extensible Deployment for Mobile/Cloud Applications.
Irene Zhang, Adriana Szekeres, Dana Van Aken, Isaac Ackerman, Steven D. Gribble, Arvind Krishnamurthy, and Henry M. Levy.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI).
abstract pdf
Modern applications face new challenges in managing today's highly distributed and heterogeneous environment. For example, they must stitch together code that crosses smartphones, tablets, personal devices, and cloud services, connected by variable wide-area networks, such as WiFi and 4G. This paper describes Sapphire, a distributed programming platform that simplifies the programming of today's mobile/cloud applications. Sapphire's key design feature is its distributed runtime system, which supports a flexible and extensible deployment layer for solving complex distributed systems tasks, such as fault-tolerance, code-offloading, and caching. Rather than writing distributed systems code, programmers choose deployment managers that extend Sapphire's kernel to meet their applications' deployment requirements. In this way, each application runs on an underlying platform that is customized for its own distribution needs.
2014
October
Arrakis: The Operating System is the Control Plane.
Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Douglas Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe.
Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). Best Paper Award.
abstract pdf
Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications, to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation. We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing 2-5x end-to-end latency and 9x throughput improvements for a popular persistent NoSQL store relative to a well-tuned Linux implementation.
2014
June
Towards High-Performance Application-Level Storage Management.
Simon Peter, Irene Zhang, Dan R. K. Ports, Jialin Li, Doug Woos, Thomas Anderson, Arvind Krishnamurthy, and Mark Zbikowski.
Proceedings of the USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage).
abstract pdf
We propose a radical re-architecture of the traditional operating system storage stack to move the kernel off the data path. Leveraging virtualized I/O hardware for disk and flash storage, most read and write I/O operations go directly to application code. The kernel dynamically allocates extents, manages the virtual to physical binding, and performs name translation. The benefit is to dramatically reduce the CPU overhead of storage operations while improving application flexibility.
2014
June
Machine Fault Tolerance for Reliable Datacenter Systems.
Danyang Zhuo, Qiao Zhang, Dan R. K. Ports, Arvind Krishnamurthy, and Thomas Anderson.
Proceedings of the ACM Asia-Pacific Workshop on Systems (APSys).
abstract pdf
Although rare in absolute terms, undetected CPU, memory, and disk errors occur often enough at data center scale to significantly affect overall system reliability and availability. In this paper, we propose a new failure model, called Machine Fault Tolerance, and a new abstraction, a replicated write-once trusted table, to provide improved resilience to these types of failures. Since most machine failures manifest in application server and operating system code, we assume a Byzantine model for those parts of the system. However, by assuming that the hypervisor and network are trustworthy, we are able to reduce the overhead of machine-fault masking to be close to that of non-Byzantine Paxos.
2013
November
Exploring Storage Class Memory with Key Value Stores.
Katelin Bailey, Peter Hornyack, Luis Ceze, Steven D. Gribble, and Henry M. Levy.
Proceedings of the Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW).
abstract pdf
In the near future, new storage-class memory (SCM) technologies - such as phase-change memory and memristors - will radically change the nature of long-term storage. These devices will be cheap, non-volatile, byte addressable, and near DRAM density and speed. While SCM offers enormous opportunities, profiting from them will require new storage systems specifically designed for SCM's properties. This paper presents Echo, a persistent key-value storage system designed to leverage the advantages and address the challenges of SCM. The goals of Echo include high performance for both small and large data objects, recoverability after failure, and scalability on multicore systems. Echo achieves its goals through the use of a two-level memory design targeted for memory systems containing both DRAM and SCM, exploitation of SCM's byte addressability for fine-grained transactions in non-volatile memory, and the use of snapshot isolation for concurrency, consistency, and versioning. Our evaluation demonstrates that Echo's SCM-centric design achieves the durability guarantees of the best disk-based stores with the performance characteristics approaching the best in-memory key-value stores.
2013
September
NLify: Lightweight Spoken Natural Language Interfaces via Exhaustive Paraphrasing.
Seungyeop Han, Matthai Phlipose, and Yun-Cheng Ju.
Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp). Honorable Mention.
abstract pdf
This paper presents the design and implementation of a programming system that enables third-party developers to add spoken natural language (SNL) interfaces to standalone mobile applications. The central challenge is to create statistical recognition models that are accurate and resource-efficient in the face of the variety of natural language, while requiring little specialized knowledge from developers. We show that given a few examples from the developer, it is possible to elicit comprehensive sets of paraphrases of the examples using internet crowds. The exhaustive nature of these paraphrases allows us to use relatively simple, automatically derived statistical models for speech and language understanding that perform well without per-application tuning. We have realized our design fully as an extension to the Visual Studio IDE. Based on a new benchmark dataset with 3500 spoken instances of 27 commands from 20 subjects and a small developer study, we establish the promise of our approach and the impact of various design choices.
2013
August
Expressive Privacy Control with Pseudonyms.
Seungyeop Han, Vincent Liu, Qifan Pu, Simon Peter, Thomas Anderson, Arvind Krishnamurthy, and David Wetherall.
Proceedings of the ACM SIGCOMM Conference (SIGCOMM).
abstract pdf
As personal information increases in value, the incentives for remote services to collect as much of it as possible increase as well. In the current Internet, the default assumption is that all behavior can be correlated using a variety of identifying information, not the least of which is a user's IP address. Tools like Tor, Privoxy, and even NATs, are located at the opposite end of the spectrum and prevent any behavior from being linked. Instead, our goal is to provide users with more control over linkability---which activites of the user can be correlated at the remote services---not necessarily more anonymity. We design a cross-layer architecture that provides users with a pseudonym abstraction. To the user, a pseudonym represents a set of activities that the user is fine with linking, and to the outside world, a pseudonym gives the illusion of a single machine. We provide this abstraction by associating each pseudonym with a unique, random address drawn from the IPv6 address space, which is large enough to provide each device with multiple globally-routable addresses. We have implemented and evaluated a prototype that is able to provide unlinkable pseudonyms within the Chrome web browser in order to demonstrate the feasibility, efficacy, and expressiveness of our approach.
2013
May
The Case for Onloading Continuous High-Datarate Perception to the Phone.
Seungyeop Han and Matthai Philipose.
Proceedings of the USENIX Workshop on Hot Topics in Operating Systems (HotOS).
abstract pdf
Much has been said recently on off-loading computations from the phone. In particular, workloads such as speech and visual recognition that involve models based on ``big data'' are thought to be prime candidates for cloud processing. We posit that the next few years will see the arrival of mobile usages that require continuous processing of audio and video data from wearable devices. We argue that these usages are unlikely to flourish unless substantial computation is moved back on to the phone. We outline possible solutions to the problems inherent in such a move. We advocate a close partnership between perception and systems researchers to realize these usages.
2011
May
Operating Systems Implications of Fast, Cheap, Non-Volatile Memory.
Katelin Bailey, Luis Ceze, Steven D. Gribble, and Henry M. Levy.
Proceedings of the USENIX Workshop on Hot Topics in Operating Systems (HotOS).
abstract pdf
The existence of two basic levels of storage (fast/volatile and slow/non-volatile) has been a long-standing premise of most computer systems, influencing the design of OS components, including file systems, virtual memory, scheduling, execution models, and even their APIs. Emerging resistive memory technologies - such as phase-change memory (PCM) and memristors - have the potential to provide large, fast, non-volatile memory systems, changing the assumptions that motivated the design of current operating systems. This paper examines the implications of non-volatile memories on a number of OS mechanisms, functions, and properties.