Reliability Series #3: Reliability-enhancing techniques (Part 1)

In my previous post, I discussed “DIAL”, an approach we use to categorize common service component interaction failures when applying Resilience Modeling & Analysis, (RMA), to an online service design.  In the next two posts,  I’ll discuss some mitigation strategies and design patterns intended to reduce the likelihood of the types of failures described by “DIAL”.

When considering mitigations, first decide where the most appropriate place is to mitigate a given failure in your design. Mitigation techniques should only be applied to service components which have sufficient awareness about the failing scenario, (think “context”); otherwise, the best recourse is to pass the error back to the requesting resource further upstream in the component interaction chain.  Why?  We want to be sure the component responsible for deciding how to handle the failure has sufficient knowledge about that scenario to decide the appropriate course of action.

Mitigations applied too deep in the scenario stack can result in undesirable effects of increased queue lengths, delays and code complexity. Imagine a Component Interaction Diagram, (CID), that shows a web role attempting to request an image to be used when rendering a page of HTML. If this request fails, the web role should have enough context to simply render the page without the image, whereas the underlying component issuing the request has no such context and could induce unnecessarily-long page rendering times for the end user if it attempts to retry the image retrieval request over and over again before passing control back to the upstream component. In fact, the severity of some live site incidents can intensify due to well-meaning but inappropriately-applied mitigation strategies.

One appropriate mitigation for components lacking scenario awareness is the so-called “Circuit Breaker” pattern.  This pattern, explained in detail here, protects unhealthy downstream components from unnecessary load resulting from “retry storms” that can delay or block recovery.  Once a circuit breaker trips due to exceeding a threshold of failures, the component with full scenario context quickly receives a “…your request has failed” signal that’s straightforward to interpret and act upon appropriately.

In this article I’ll discuss the Discovery and Authorization/Authentication categories represented by the “DIAL” acronym, and I’ll also share example mitigations targeting specific failure modes.

Discovery and Authorization/Authentication
Both the Discovery and the Authorization/Authentication categories relate to pre-requisite interactions for establishing a successful connection between service components. The type of failures you may see in these categories varies by implementation.  For both Discovery and Authorization/Authentication, helper services (such as DNS for name resolution, or Active Directory, Microsoft Service Account, etc. for authentication) may be employed, and unhealthy or misconfigured helper services, (name records incorrect, missing accounts, etc.), can prevent one service component from accessing another.

If the helper service is unhealthy and returning errors which can essentially be categorized as ‘Server Error’, the cause of the failure can be transient.  In these cases, use the “Retry” pattern, explained in detail here, to re-attempt the original request at an appropriate interval since the helper service may have recovered. Sustained failures in helper services require other mitigations which fall under the “Limits” category.

If a helper service failed due to misconfiguration, such as incorrect server name records, or missing accounts and permissions, these types of faults are not transient, and will only be resolved via configuration changes being applied to the helper service.  If you identify a failure mode involving misconfiguration, focus instead on monitoring and detection to flag configuration-related problems quickly, and take immediate corrective action to roll back to a last known good version.

Discovery and Authorization/Authentication failures may also arise due to misconfigured clients.  In these cases the retry pattern is not effective and the interaction will only succeed when the client configuration is corrected.  A useful technique to combat this type of failure is the “Runtime Reconfiguration” pattern, explained in detail here, enabling instances to detect configuration errors and bring themselves back into a working state, but without requiring redeployment or restarting the application. An example of how to apply this pattern is to authenticate using an alternate account or certificate instead of the original account or certificate that is no longer functioning as expected.

These common strategies and mitigation patterns can help services engineering teams build resiliency into online services. I’ll be back again soon with the fourth and final post in this series, in which I’ll discuss mitigation patterns for Limits/Latency and Incorrectness.

**Previous posts in this series:

Reliability Series #1: Reliability vs. resilience

Reliability Series #2: Categorizing reliability threats to your service

About the Author
David Bills

Chief Reliability Strategist, Trustworthy Computing

David Bills is a general manager of Microsoft’s Trustworthy Computing Group, where he serves as the company’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs. His team engages with business groups Read more »