Whenever I speak to customers and partners about reliability I’m reminded that while objectives and priorities differ between organizations and customers, at the end of the day, everyone wants their service to work. As a customer, you want to be able to do things online, at a time convenient to you. As an organization – or a provider of a service – you want your customers to carry out the tasks they want to, whenever they want to do so.
This article is the first in a four-part series on building a resilient service. In my first two posts, I will discuss the topic as it relates to business strategy, and then we’ll dive deeper into the technical details. The full series of four posts will cover:
1. Reliability vs. resilience – What is the difference between reliability and resilience and why does it matter?
2. Common reliability-related threats – DIAL (Discovery, Incorrectness, Authorization/Authentication, Limits/Latency) is a handy mnemonic to help teams brainstorm potential failures of interactions between components for their service in a structured way. Brainstorming about failure modes and failure points is a key phase in resilience modeling and analysis (RMA) and can help teams improve the reliability of their service.
3. Reliability-enhancing techniques – Taking the “D” and “A” in DIAL, we’ll look at some reliability enhancing techniques you can incorporate into your design related to discovery and authentication.
4. Reliability-enhancing techniques – Taking the “I” and “L” in DIAL, we’ll look at some reliability enhancing techniques you can incorporate into your design related to incorrectness and limits.
My intention is to provide insight into how Microsoft thinks about reliability and the processes and techniques we’re employing to improve the reliability of our services for our customers.
So what is reliability? When I ask customers and partners, the most common responses refer to consistency in performance, speed, availability – and perhaps most significantly –resilience. One thing we all agree on is that for a system or service to be reliable, the user has to believe ‘it just works’.
The Institute of Electrical and Electronics Engineers (IEEE) Reliability Society states reliability [engineering] is “a design engineering discipline which applies scientific knowledge to assure that a system will perform its intended function for the required duration within a given environment, including the ability to test and support the system through its total lifecycle.” For software, it defines reliability as “the probability of failure-free software operation for a specified period of time in a specified environment.”
A reliable cloud service is essentially one that functions as the designer intended it to, when it is expected to, and wherever the customer is connected. That’s not to say every component must operate flawlessly 100 percent of the time. This last point brings us to what I believe is the difference between reliability and resiliency.
Reliability is the outcome cloud service providers strive for – it’s the result. Resiliency is the ability of a cloud-based service to withstand certain types of failure and yet remain functional from the customer perspective. In other words, reliability is the outcome and resilience is the way you achieve the outcome. A service could be characterized as reliable simply because no part of the service has ever failed, and yet the service couldn’t be regarded as resilient because those reliability-enhancing capabilities may never have been tested.
The key takeaway here is the importance of focusing on resilience and designing and building resiliency into your service at every stage of the software development lifecycle. To find out more about the fundamentals of building a reliable online service, read our whitepaper ‘An introduction to designing reliable cloud services’.