Building reliable cloud services

Reliability continues to be top of mind for everyone involved with online services. From the engineering teams who want to ensure they’re designing and building a robust service and minimizing live site incidents, to service providers who want to reduce the impact of incidents to their customers by delivering a predictable and consistent outcome.

Today we are publishing an updated version of our whitepaper “Introduction to Designing Reliable Cloud Services”.

The paper describes fundamental reliability concepts and a reliability design-time process for organizations that create, deploy, and/or consume cloud services. It is designed to help decision makers understand the factors and processes that make cloud services more reliable.

One of the key concepts covered is the importance of building resiliency into the service in an effort to improve reliability. All service providers strive to deliver a reliable service for their customers, but the reality is that things sometimes go wrong. Despite persistent reliability-related threats, a resilient service should remain fully functional and allow customers to perform the tasks necessary to complete their work.

Services should be designed to:
•    Minimize the impact of a failure on any given customer.
•    Minimize the number of customers affected by a failure.
•    Reduce the number of minutes that a customer (or customers) cannot use the service in its entirety.

It’s vital that organizations think about how their service will and should operate when a known failure condition occurs. For example, what should the service do when another cloud service on which it depends is not available? What should the service do when it can’t connect to its primary database? How should the service react when there is a sudden increase in traffic, pushing and capacity toward its upper limit?

In my experience there are three main causes of failure:
1.    Device and infrastructure failure – these range from the expected/end-of-life device failures, to catastrophic failures often caused by natural disasters or accidents that are out of an organization’s control.
2.    Human error – administrator error or configuration mistakes that are often out of an organization’s control.
3.    Software imperfections – code flaws and software-related issues in the deployed online service. Pre-release testing can control this to some degree.

To help organizations think through the implications of these three causes and the mitigation strategies to deal with them, we recommend resilience modeling. In this paper, we discuss failure mode and effects analysis (FMEA), an industry-recognized method for identifying dependencies and limiting potential points of failure.

The figure below shows where FMEA fits in the basic development process.

Overview of the Basic Development Process

Source: Microsoft

Designing a resilient service and implementing robust recovery mechanisms is an iterative process. Design iterations are fluid and take into account both information garnered from pre-release testing and data about how the service performs after it has been deployed.

If you’re designing or building an online service, I encourage you to read this paper to gain insight into how to build a more resilient and reliable service. Similarly, we encourage cloud customers to look at how their provider is building resilience into its services.

About the Author
David Bills

Chief Reliability Strategist, Trustworthy Computing

David Bills is a general manager of Microsoft’s Trustworthy Computing Group, where he serves as the company’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs. His team engages with business groups Read more »