Reliability Series #4: Reliability-enhancing techniques (Part 2)

In my previous post in this series, I discussed the Discovery and Authorization/Authentication categories of the “DIAL” acronym to share mitigations targeting specific failure modes. In this article I’ll discuss the “Limits/Latency” and “Incorrectness” categories represented by the “DIAL” acronym, and I’ll also share example mitigations targeting specific failure modes for each.

Limits/Latency
The three most prevalent failure modes in this category are: Connection Request Timeout, Resource Unavailable, and Latency Increases Impacting Performance.

Connection Request Timeout errors are self-explanatory – a request has exceeded the allocated time to obtain a response.   There is a critical assumption here though: requests must specify timeout parameters; otherwise significant blocking issues may arise within the service that may lead to cascading failures.

The Resource Unavailable failure mode encapsulates any error returned from a request indicating a resource is unavailable to process that request.  This may be due to an otherwise-impaired service or it might simply be an oversubscribed resource that is employing a protective throttling mechanism to keep from being overwhelmed by incoming requests.

The primary distinction between mitigating Service Unavailable and Timeout-type errors is that a Service Unavailable-type error may be transient and, therefore, worthy of using the Retry pattern (explained here), while Timeout-type errors are almost always longer in duration.  Aside from this difference, you can approach mitigating these two failure modes using similar strategies.

If the scenario can be completed successfully in more than one way, then some form of graceful degradation may be the right mitigation approach to use when that scenario is under some form of duress.  Think back to our earlier example of requesting an image to be included in a page of HTML.  If that page were part of a shopping cart workflow, it might be acceptable to render an “IMAGE NOT FOUND” placeholder instead so the customer can complete the transaction by placing their order without seeing the image.  Another approach to this same problem might involve using an image of the right product, but in a different color or pattern.

If the request being made by a component is in the critical path for the scenario to complete successfully, then using an alternate path (a redundancy pattern) to obtain this resource might be the correct mitigation to apply.  Using our previous example where the image retrieval failed, we can redirect the request to a secondary location storing a copy of the requested image.  If data freshness is not critical, a local cache of previously-requested images might be searched first by the requesting component. Keep in mind that while having multiple instances of the same resource is generally a good thing from a reliability standpoint, redundancy with diversity is even better, though there is a complexity trade-off to consider.  Ideally, the alternate path should be sufficiently distinct from the original to avoid a class of problems affecting all instances of those redundant resources.  Following the so-called “all or nothing” deployment strategy can lead to a situation where the redundant instances are immediately afflicted with the same failure and thereby rendered unusable, which is the exact opposite of what the designer intended by incorporating redundancy into the design in the first place.

Another mitigation is to queue the request for completion at a later time.  A great example of this is an e-commerce site where a payment method is submitted and a problem in the external payment processing system causes the entire transaction to fail.  The transaction could instead be queued, and the customer notified the order was placed and to expect a confirmation notification at a later time after the payment is processed successfully. This illustrates an architectural pattern known as Asynchronous Messaging, described here.

The third failure mode, Performance Latency Increases Impacting Performance, presents an interesting challenge to the designer.  When interactions are slow, but not slow enough to incur Timeout errors, the requesting components may produce a large backlog of work resulting in extended periods of time where resources are fully-consumed, resulting in so-called “blocking behavior” caused by system bottlenecks.  One approach to alleviating such issues is to instrument the component interaction calls and use the resulting telemetry to trigger protections, such as Circuit Breakers or Throttling, This can reduce inbound requests to manageable levels or limit the consumption of a resource and not allow bottlenecks to emerge and potentially cascade through the system.

Incorrectness

The Incorrectness category covers a broad range of failures often characterized as “code-related” problems, such as mismatched interfaces, incorrectly-specified request or response parameters, and the improper handling of duplicate requests.  Let’s look at two of these failure types.

I discussed how retrying a previously-failed request can help improve the resilience of the application to handle transient faults. The retry logic for transient faults must guarantee that the integrity of the underlying data being manipulated is preserved. Two common mitigation techniques are idempotence and deduplication.

The term idempotence is used to describe an operation that will produce the same results if executed only once or executed multiple times. An example would be applying an update to a mailing address. If the update is applied only once to the data source, the result is the same as if the update is applied over and over again to the same data source – the final result is an up-to-date address.

The term deduplication is used to describe an operation that will eliminate multiple copies of repeated data, resulting in only a single copy being retained.  An example would be allowing a customer to click the “Submit” button multiple times on a form where they have supplied their contact information, but not creating multiple records with identical information.  The submitted data fields would be scanned and duplicates removed before the contact information is recorded permanently.

I mention these two terms because they are frequently used as mitigation strategies to cope with the class of failures we refer to as “Incorrectness”.  The idea is to isolate incorrect data elements so they can be triaged using alternative logic, thus delivering a more consistent, predictable and therefore reliable experience to the customer.

I’ve shared some common strategies and mitigation patterns for building resiliency into online services. I encourage you to explore Resilience Modeling and Analysis (RMA) and use it to help you identify failure points in your service and to evaluate when to apply one, or more, of the mitigations briefly described above.

**Previous posts in this series:

Reliability Series #1: Reliability vs. resilience

Reliability Series #2: Categorizing reliability threats to your service

Reliability Series #3: Reliability-enhancing techniques (Part 1)

About the Author
David Bills

Chief Reliability Strategist, Trustworthy Computing

David Bills is a general manager of Microsoft’s Trustworthy Computing Group, where he serves as the company’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs. His team engages with business groups Read more »