Q

'Failure by design': Advice for surviving disaster in the public cloud

In this expert answer, contributor Chris Moyer offers advice on taking a 'failure by design' approach to prevent public-cloud catastrophe.

At a recent Gartner conference, a keynoter said, "If you're in the public cloud, you're going to be dealing with failure. Learn to deal with it. Design for failure." Any advice on how to approach failure by design?

Netflix is famous for making "design for failure" a popular catchphrase. Netflix officials have described failure by design as a feature, rather than a bug.

One great way to approach failure by design: Make sure you have no single points of failure.

The company started off by trying to catch failing servers or broken code, but eventually realized that the solution isn't to prevent any sort of failure; after all, things will always go wrong. Instead, as Netflix realized, what's important is to design for failure so that failures have minimal impact -- or no impact at all.

One great way to approach failure by design: Make sure you have no single points of failure. It's important to constantly test your production systems, ensuring that if one server dies, it's not going to create major issues. There are several ways you can do that. The most important: Make sure that you can automatically detect a problem, and then automatically repair it if possible.

Netflix has open-sourced several pieces of software that help with designing for public-cloud failure. These systems help by making sure that your servers always fail. They started with Chaos Monkey, which randomly kills Amazon Web Services Elastic Cloud Compute (better known as EC2) instances to test whether apps will survive such failures. The idea is that every instance you run in AWS should be in an auto-scaling group, so when one goes down, another automatically replaces it.

Netflix also offers an open-source management and deployment tool called Asgard, a dashboard app that you can use to get an overview of your system architecture as a whole. It's available free to AWS users.

The biggest thing to keep in mind: You can't really test system architecture on a small scale. To really test whether a system can handle the full load it's going to get, you've got to go full scale. Good luck!

This was first published in October 2013

Dig deeper on Cloud access management and application security

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchAWS

SearchSOA

TheServerSide

SearchFinancialApplications

SearchBusinessAnalytics

SearchCRM

Close