Designing for Failure: The Changing Face of Reliability

I’ve written about reliability and resilience before, but the topic is so important it’s worth revisiting again, using an example from the real world I think you’ll appreciate.

Imagine the pressure the architects and engineers were under when they designed and built the Channel Tunnel connecting England to France via rail. The so-called “Chunnel” would have to transport — safely —millions of people a year at speeds over 160 kilometers per hour, across 37.9 undersea kilometers.

With so many lives at stake, the designers had to eliminate all possibility of failure. Wrong. In fact, in building the Channel Tunnel, the designers expected failure of individual components. That’s why they built three interconnected tunnels: two of them to accommodate rail traffic, and one in the middle for maintenance, but also to serve as an emergency escape route, if needed.

The designers built in “cross-over” passages every 250 meters, connecting the transport tunnels to the service tunnel. Since the Channel Tunnel’s opening in 1994, there have been five fires reported on heavy goods vehicle (HGV) shuttles, but no fatalities. Yes, there have even been several passenger evacuations too, but again, because both mechanical and procedural failures were expected, the net result was passenger inconvenience due to delay, not life-threatening situations that resulted in human loss.

Now, what does all this have to do with software engineering? A lot, actually. The same shift in mindset I’m advocating today must have taken place amongst the Channel Tunnel designers some 30 years ago. Their task: convincing passengers – and regulators – isolated failures were inevitable, so the design goal was to anticipate and account for those failures while keeping the trains running as much of the time as possible. This is precisely the shift in thinking that’s needed in cloud-based software design.

Traditional approaches often equate reliability with preventing failure, but cloud-based systems are inherently different, so the design approach has to be different too. In the cloud, you’re dealing with greater complexity, more interdependencies, and unknown demand – millions of people could end up using your app! All of these factors lead to more pressure, (in the system), and complexity – thus more opportunity for failure.

So while you may not want to scream it from the rooftops, with cloud services, failure is, in fact, inevitable. And because of that, cloud services need to be designed like the Chunnel – to contain failures and recover from them quickly. Once we shift our mindset to this approach, then we include engineering practices like fault modeling and fault injection as part of a continual release process meant to help us build more reliable cloud systems.

In this rapidly changing world of ours, this is the new face of reliability. It is no longer about preventing failure. It is about designing resilient services in which inevitable failures have a minimal effect on service availability and functionality.

Read my earlier blog for more details on fundamental design considerations for cloud-based services. Or, to learn more about designing for failure, watch FailSafe Core Concepts, a webinar by Ulrich Homann, Chief Architect, and Marc Mercuri, Senior Director, Microsoft Services, on the key considerations for developing scalable, resilient applications.

About the Author
David Bills

Chief Reliability Strategist, Trustworthy Computing

David Bills is a general manager of Microsoft’s Trustworthy Computing Group, where he serves as the company’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs. His team engages with business groups Read more »