Fundamentals of Cloud Service Reliability

By David Bills

September 12, 2012

Influence operations

As the adoption of cloud computing continues to rise, and customers demand 24/7 access to their services and data, reliability remains a challenge for cloud service providers everywhere. As I said in the recent Cloud Fundamentals video on reliability, it’s not a matter of if an outage will occur; it’s strictly a matter of when. This means it’s critical for organizations to understand how best to design and deliver reliable cloud services. Microsoft manages a cloud-based infrastructure supporting more than 200 services, 1 billion customers, and 20 million businesses in more than 76 markets worldwide. So we understand what it takes to build and deliver highly-reliable cloud platforms, solutions, and services that are secure and private.

Today Microsoft has released a new whitepaper titled, “An introduction to designing reliable cloud services”. This paper describes how Microsoft is thinking about key reliability concepts, and outlines a reliability design and implementation process that might be useful for organizations that create, deploy and/or consume cloud services. Designing and delivering reliable services is complex and this paper aims to be the catalyst for further discussions among services teams and organizations, and the industry itself.

If we accept the fact failures will occur, then the outcomes organizations may want to consider in relation to their cloud services fall into four main categories:

Maximize service availability to customers.
Make sure the service does what the customer wants, when they want it, as much of the time as possible.
Minimize the impact of any failure on customers. Assume something will go wrong and design the service in a way that it will be the non-critical components that fail first; the critical
components keep working. Isolate the failure as much as possible so the minimum number of customers is impacted. And if the service goes down completely, focus on reducing the amount of time any one customer cannot use the service at all.
Maximize service performance. Reduce the impact to customers at times when performance may be negatively impacted, such as during an unexpected spike in traffic.
Maximize business continuity. Focus on how the organization and the service respond when a failure occurs. Automate recovery where possible and disaster recovery drills should be carried out to ensure the organization is fully prepared to deal with the inevitable failure.

Please download the whitepaper to read more about the fundamental concepts of reliability and gain insight into how Microsoft is thinking about this topic and how organizations and teams might work together to improve cloud service reliability.