Cloud services: building resiliency and business continuity

The complex nature of cloud computing means that as cloud service providers, we need to be mindful that things will go wrong – it’s not a case of if, it’s strictly a matter of when. Cloud providers need to design and build their services in such a way to maximize the reliability of the service and minimize the impact to customers when things do go wrong. A key facet of this approach is business continuity, or ensuring that critical business functions continue to be available, even in the event of a catastrophe. With that in mind, I was recently interviewed for the winter edition of the Disaster Recovery Journal – a journal which focuses on the business continuity planning profession

When I talk about reliability I’m referring to the outcome all service providers aim for. The notion that the service works as it was designed to, and responds in a predicable fashion when it is needed. One way to improve reliability is to build a service that is resilient – it has the ability to withstand certain types of failure and yet remain fully functional from the customers’ perspective.

Traditionally, as an industry, we have focused on implementing resiliency at the hardware level, (think RAID, redundant pairs of devices and multiple paths). We have also benefitted from significant enhancements made by hardware manufacturers in terms of long-term reliability of the devices they produce. But I think it’s important not to become overly-reliant on resiliency being architected and implemented exclusively at the hardware level. Software plays an important role here too. It’s the combination of intelligent infrastructure design, coupled with equally intelligent software design which, when delivered in tandem, result in highly-reliable cloud-based services.

There are many challenges associated with designing and building highly-reliable services. As I mentioned earlier, cloud services are complex. One of the greatest challenges designers face is the potential unpredictability found in the production computing environment of cloud-based services operating at Internet scale. The extensive use of shared infrastructure and a growing dependency on capabilities supplied by third parties makes it difficult for cloud providers to have direct control over all aspects of their cloud service. In response to this, designers need to think about how the cloud-based service they’re building is going to cope with this level of potential unpredictability.

In addition to considering ways to build resiliency into cloud services at multiple levels, it’s important to plan for those times when things do go wrong. Business continuity and disaster recovery are also critical because reliability is more than just being predictable and dependable when everything is operating under normal circumstances – it’s also about being able to respond in a reasonable manner when something unpredictable happens.

Read the full article on the Disaster Recovery Journal site.

About the Author
David Bills

Chief Reliability Strategist, Trustworthy Computing

David Bills is a general manager of Microsoft’s Trustworthy Computing Group, where he serves as the company’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs. His team engages with business groups Read more »