Abstracting the Geniuses Away from Failure Testing: Ordinary users need tools that automate the selection of custom-tailored faults to inject.

Authors : Peter Alvaro , Severine Tymon Authors Info & Claims

Pages 29 - 53 Published : 01 October 2017 Publication History 7 citation 33,208 Downloads Total Citations 7 Total Downloads 33,208 Last 12 Months 2,256 Last 6 weeks 173 Get Citation Alerts

New Citation Alert added!

This alert has been successfully added and will be sent to: You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below. Manage my Alerts

New Citation Alert!

Abstract

This article presents a call to arms for the distributed systems research community to improve the state of the art in fault tolerance testing. Ordinary users need tools that automate the selection of custom-tailored faults to inject. We conjecture that the process by which superusers select experiments can be effectively modeled in software. The article describes a prototype validating this conjecture, presents early results from the lab and the field, and identifies new research directions that can make this vision a reality.

References

Alvaro, P., Andrus, K., Basiri, A., Hochstein, L., Rosenthal, C., Sanden, C. 2016. Automating failure testing research at Internet scale. In Proceedings of the 7th ACM Symposium on Cloud Computing: 17-28.

Alvaro, P., Rosen, J., Hellerstein, J. M. 2015. Lineage-driven fault injection. In Proceedings of the ACM SIGMOD International Conference on Management of Data: 331-346.

Andrus, K. 2016. Personal communication.

Aniszczyk, C. 2012. Distributed systems tracing with Zipkin. Twitter Engineering; https://blog.twitter.com/2012/distributed-systems-tracing-with-zipkin.

Barth, D. 2014. Inject failure to make your systems more reliable. DevOps.com; http://devops.com/2014/06/03/inject-failure/.

Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., Rosenthal, C. 2016. Chaos Engineering. IEEE Software 33(3): 35-41.

Beyer, B., Jones, C., Petoff, J., Murphy, N. R. 2016. Site Reliability Engineering. O'Reilly.

Birrell, A. D., Nelson, B. J. 1984. Implementing remote procedure calls. ACM Transactions on Computer Systems 2(1): 39-59.

Chandra, T. D., Hadzilacos, V., Toueg, S. 1996. The weakest failure detector for solving consensus. Journal of the Association for Computing Machinery 43(4): 685-722.

Chen, A., Wu, Y., Haeberlen, A., Zhou, W., Loo, B. T. 2016. The good, the bad, and the differences: better network diagnostics with differential provenance. In Proceedings of the ACM SIGCOMM Conference: 115-128.

Chothia, Z., Liagouris, J., McSherry, F., Roscoe, T. 2016. Explaining outputs in modern data analytics. Proceedings of the VLDB Endowment 9(12): 1137-1148.

Chow, M., Meisner, D., Flinn, J., Peek, D., Wenisch, T. F. 2014. The Mystery Machine: end-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th Usenix Conference on Operating Systems Design and Implementation: 217-231.

Cui, Y., Widom, J., Wiener, J. L. 2000. Tracing the lineage of view data in a warehousing environment. ACM Transactions on Database Systems 25(2): 179-227.

Dawson, S., Jahanian, F., Mitton, T. 1996. ORCHESTRA: A Fault Injection Environment for Distributed Systems. In Proceedings of the 26th International Symposium on Fault-tolerant Computing.

Fischer, M. J., Lynch, N. A., Paterson, M. S. 1985. Impossibility of distributed consensus with one faulty process. Journal of the Association for Computing Machinery 32(2): 374-382; https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf.

Fisman, D., Kupferman, O., Lustig, Y. 2008. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, Lecture Notes in Computer Science 4963: 315-331. Springer Verlag.

Andrus, K., Gopalani, N., Schmaus, B. 2014. FIT: failure injection testing. Netflix Technology Blog; http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html.

Gray, J. 1985. Why do computers stop and what can be done about it? Tandem Technical Report 85.7; http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf.

Gunawi, H. S., Do, T., Joshi, P., Alvaro, P., Hellerstein, J. M., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Sen, K., Borthakur, D. 2011. FATE and DESTINI: a framework for cloud recovery testing. In Proceedings of the 8th Usenix Conference on Networked Systems Design and Implementation: 238-252; http://db.cs.berkeley.edu/papers/nsdi11-fate-destini.pdf.

Holzmann, G. 2003. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional. Honeycomb. 2016; https://honeycomb.io/.

Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Millstein, T., Condie, T. 2015. Titian: data provenance support in Spark. In Proceedings of the VLDB Endowment 9(3): 216-227.

Izrailevsky, Y., Tseitlin, A. 2011. The Netflix Simian Army. Netflix Technology Blog; http://techblog.netflix.com/2011/07/netflix-simian-army.html.

Jepsen. 2016. Distributed systems safety research; http://jepsen.io/. Jones, N. 2016. Personal communication. Kafka 0.8.0. 2013. Apache; https://kafka.apache.org/08/documentation.html.

Kanawati, G. A., Kanawati, N. A., Abraham, J. A. 1995. Ferrari: a flexible software-based fault and error injection system. IEEE Transactions on Computers 44(2): 248-260.

Kendall, S. C., Waldo, J., Wollrath, A., Wyant, G. 1994. A note on distributed computing. Technical Report. Sun Microsystems Laboratories.

Killian, C. E., Anderson, J. W., Jhala, R., Vahdat, A. 2007. Life, death, and the critical transition: finding liveness bugs in systems code. In Networked System Design and Implementation: 243-256.

Kingsbury, K. 2013. Call me maybe: Kafka; http://aphyr.com/posts/293-call-me-maybe-kafka. Kingsbury, K. 2016. Personal communication.

Lafeldt, M. 2017. The discipline of Chaos Engineering. Gremlin Inc.; https://blog.gremlininc.com/the-discipline-of-chaos-engineering-e39d2383c459.

Lampson, B. W. 1980. Atomic transactions. In Distributed Systems Architecture and Implementation, An Advanced Course: 246-265; https://link.springer.com/chapter/10.1007%2F3-540-10571-9_11.

LightStep. 2016; http://lightstep.com/.

Marinescu, P. D., Candea, G. 2009. LFI: a practical and general library-level fault injector. In IEEE/IFIP International Conference on Dependable Systems and Networks.

Matloff, N., Salzman, P. J. 2008. The Art of Debugging with GDB, DDD, and Eclipse. No Starch Press.

Meliou, A., Suciu, D. 2012. Tiresias: the database oracle for how-to queries. Proceedings of the ACM SIGMOD International Conference on the Management of Data: 337-348.

Microsoft Azure Documentation. 2016. Introduction to the fault analysis service; https://azure.microsoft.com/en-us/documentation/articles/service-fabric-testability-overview/.

Musuvathi, M., Park, D. Y. W., Chou, A., Engler, D. R., Dill, D. L. 2002. CMC: A pragmatic approach to model checking real code. ACM SIGOPS Operating Systems Review. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation 36(SI): 75-88.

Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P. A. Neamtiu, I. 2008. Finding and reproducing Heisenbugs in concurrent programs. In Proceedings of the 8th Usenix Conference on Operating Systems Design and Implementation: 267-280.

Newcombe, C., Rath, T., Zhang, F., Munteanu, B., Brooker, M., Deardeuff, M. 2014. Use of formal methods at Amazon Web Services. Technical Report; http://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf.

Olston, C., Reed, B. 2011. Inspector Gadget: a framework for custom monitoring and debugging of distributed data flows. In Proceedings of the ACM SIGMOD International Conference on the Management of Data: 1221-1224.

OpenTracing. 2016; http://opentracing.io/.

Pasquier, T. F. J. -M., Singh, J., Eyers, D. M., Bacon, J. 2015. CamFlow: managed data-sharing for cloud services; https://arxiv.org/pdf/1506.04391.pdf.

Patterson, D. A., Gibson, G., Katz, R. H. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data: 109-116; http://web.mit.edu/6.033/2015/wwwdocs/papers/Patterson88.pdf.

Ramasubramanian, K., Dahlgren, K., Karim, A., Maiya, S., Borland, S., Alvaro, P. 2017. Growing a protocol. In 9th Usenix Workshop on Hot Topics in Cloud Computing.

Reinhold, E. 2016. Rewriting Uber engineering: the opportunities microservices provide. Uber Engineering; https://eng.uber.com/building-tincup/.

Saltzer, J. H., Reed, D. P., Clark, D. D. 1984. End-to-end arguments in system design. ACM Transactions on Computing Systems 2(4): 277-288.

Sandberg, R. 1986. The Sun network file system: design, implementation and experience. Technical report, Sun Microsystems. In Proceedings of the Summer 1986 Usenix Technical Conference and Exhibition.

Shkuro, Y. 2017. Jaeger: Uber's distributed tracing system. Uber Engineering; https://uber.github.io/jaeger/.

Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C. 2010. Dapper, a large-scale distributed systems tracing infrastructure. Technical report. Research at Google; https://research.google.com/pubs/pub36356.html.

Shenoy, A. 2016. A deep dive into Simoorg: our open source failure induction framework. Linkedin Engineering; https://engineering.linkedin.com/blog/2016/03/deep-dive-Simoorg-open-source-failure-induction-framework.

Yang, J., Chen, T., Wu, M., Xu, Z., Liu, X., Lin, H., Yang, M., Long, F., Zhang, L., Zhou, L. 2009. MODIST: Transparent model checking of unmodifed distributed systems. In Proceedings of the 6th Usenix Symposium on Networked Systems Design and Implementation: 213-228.

Yu, Y., Manolios, P., Lamport, L. 1999. Model checking TLA+ specifcations. In Proceedings of the 10th IFIP WG 10.5 Advanced Research Working Conference on Correct Hardware Design and Verification Methods (CHARME 99): 54-66.

Zhao, X., Zhang, Y., Lion, D., Ullah, M. F., Luo, Y., Yuan, D., Stumm, M. 2014. Lprof: a non-intrusive request flow profiler for distributed systems. In Proceedings of the 11th Usenix Conference on Operating Systems Design and Implementation: 629-644.

Cited By

Index Terms

Abstracting the Geniuses Away from Failure Testing: Ordinary users need tools that automate the selection of custom-tailored faults to inject.

Index terms have been assigned to the content through auto-classification.

Recommendations

Automating Failure Testing Research at Internet Scale

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run .

Abstracting the geniuses away from failure testing

Ordinary users need tools that automate the selection of custom-tailored faults to inject.

Characterizing failure-causing parameter interactions by adaptive testing

ISSTA '11: Proceedings of the 2011 International Symposium on Software Testing and Analysis

Combinatorial testing is a widely used black-box testing technique, which is used to detect failures caused by parameter interactions (we call them faulty interactions). Traditional combinatorial testing techniques provide fault detection, but most of .

Comments

Information & Contributors

Information

Published In

cover image Queue

Queue Volume 15, Issue 5 Cryptocurrency September-October 2017 ISSN: 1542-7730 EISSN: 1542-7749 DOI: 10.1145/3155112 Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States