Reliability Series #2: Categorizing reliability threats to your service

Online services face ongoing reliability-related threats represented by device failures, latent flaws in software being triggered by environmental change, and mistakes made by human beings. At Microsoft, one of the ways we’re helping to improve the reliability of our services is by investing in resilience modeling and analysis (RMA) as a way for online service engineering teams to incorporate robust resilience design into the development lifecycle. You can read more about RMA in our whitepaper.

A key phase of RMA is the ‘Discover’ phase, during which teams enumerate and record potential failures for every component interaction shown on the diagram for their service. To ensure an effective brainstorming session, all engineering disciplines should actively participate in that session.

DIAL (Discovery, Incorrectness, Authorization/Authentication, Limits/Latency) is a helpful acronym that represents  common reliability threat categories. This list enables services teams to effectively brainstorm, by following a logical order, about the vast majority of possible failure scenarios for their service, and do it in a structured way.

Here’s an example “DIAL” list:

•    Caller cannot locate the resource due to configuration errors.
•    Caller cannot locate the resource due to name resolution errors.

•    Caller receives an error because the request is syntactically incorrect.
•    Request does not complete successfully due to corrupt or malformed data being returned.
•    Malformed configuration settings rendering caller inoperable.
•    Caller receives an error because the state of the resource being requested is incompatible with the request itself.

•    Caller receives an authentication error.
•    Caller receives an authorization failure.

•    Caller receives no response from the resource resulting in a timeout or blocking on called resource.
•    Caller receives an error related to exceeding limits on the called resource.
•    Caller receives a successful response, but at a very slow rate causing queue lengths to be exceeded.

DIAL is not intended to be an exhaustive catalog of all possible reliability-related threat categories. However, the list provides a structured way for teams to think through the reliability threat categories, making failure identification easier, by reducing the possibility of entire classes of failures being overlooked. For example, when Component A makes a request to Resource B, issues may be encountered in the following logical sequence:

1.    Resource B may not exist or cannot be found. If it is found,
2.    Component A may not successfully authenticate with Resource B. If it does authenticate,
3.    Resource B may be slow or not respond to the request issued by Component A, and so on.

DIAL can be a useful reference for teams when working through the ‘Discover’ phase of RMA. You can read more about resilience modeling and analysis (RMA) and download example templates in our whitepaper, ‘Resilience by design for cloud services’.

**Previous post in this series: Reliability Series #1: Reliability vs. resilience

About the Author
David Bills

Chief Reliability Strategist, Trustworthy Computing

David Bills is a general manager of Microsoft’s Trustworthy Computing Group, where he serves as the company’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs. His team engages with business groups Read more »