Improve the reliability of your service with resilience modeling & analysis

I’ve written before on the complexity of the cloud, including the notion that things will go wrong and the importance of planning ahead to minimize impact to customers when things do go wrong. Today Microsoft released a new whitepaper titled, “Resilience by design for cloud services”, which provides a methodology for resilience modeling, along with detailed guidance and example templates for cloud services teams to use.  The goal is to make implementation easy and consistent. The paper describes resilience modeling and analysis (RMA), which is based on an industry-standard technique called failure mode and effects analysis (FMEA), but  we’ve adapted it to more effectively prioritize work in the areas of detection, mitigation and recovery from failures – all of which are significant factors in reducing time to recover (TTR) for cloud services. There are four key phases of the RMA process: 1. Pre-work. This is one of the most critical phases of the process and it’s important to remember that the quality of the artifacts produced during this phase will have a significant influence on the final output. This phase covers two separate activities. First, the team develops a complete logical diagram, (or schematic), for their service which depicts every component, data source and data flow visually. Second, the team uses the logical diagram to identify all of the components susceptible to failure, (a.k.a. failure points). The team makes sure they understand the interactions, or connections, between these components and how each component in the ecosystem works. 2. Discover. In this step the team identifies all potential failure modes for each component including the underlying service infrastructure elements and the various dependencies between those elements. The goal is to capture where the system can fail, (points), and how it can fail, (modes).  We help guide this conversation with a pre-populated failure category checklist. 3. Rate. In this phase, the team analyzes and records the effects that could result from each of the failures identified during the Discover phase. The RMA workbook contains a series of drop-down selections which help determine the effect and likelihood of a particular failure. These columns include the effect of the failure; the portion of users affected by the failure; the time it takes to detect the failure; the time it takes to recover from the failure and the likelihood of the failure occurring. This phase produces a list of calculated risk values for every failure type, and allows the team to prioritize their engineering investments based on those risk values. 4. Act. The final phase is to take action on the items captured in the RMA worksheet and make the necessary investments to improve the reliability of the service. The ranking of failures captured during the Rate phase will allow teams to target their efforts toward the improvements which will have the biggest impact. If you are designing and deploying cloud services at scale, I encourage you to download this paper to read more about resilience modeling and analysis (RMA) and consider how implementing this process may help to improve the reliability of your online services.

About the Author
David Bills

Chief Reliability Strategist, Trustworthy Computing

David Bills is a general manager of Microsoft’s Trustworthy Computing Group, where he serves as the company’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs. His team engages with business groups Read more »