Fault Modeling for Cloud Services

Over the past couple weeks I have posted blogs talking about service reliability organizational goals, as well as causes of service outages and the associated mitigation strategies. Today I’d like to share some insight into just one of the methods Microsoft uses to design and build cloud services to help ensure our services can respond gracefully to outages. It’s not a new concept, but one that I believe is useful for providers and customers alike to be thinking about.

Just as threat modeling is an important step in the design process when security-related issues are being evaluated, fault modeling is an important step in the design process for building reliable cloud services. It’s about identifying the interaction points and dependencies of the service and enabling the engineering team to identify where investments should be made to ensure the service can be monitored effectively and issues detected quickly. And, in turn, even guiding the engineering team toward effective coping mechanisms so the service is better able to withstand, or mitigate, the fault.

There are a few key steps to create fault models:
1.Create a component inventory. The inventory includes all components the service uses, whether they are user interface components hosted on a web server, a database hosted in a remote data center, or an external service the service being modeled is dependent on.
2.Create user scenarios. This includes describing all of the ways users may interact with the service. For example, an online video service might contain scenarios for logging in, browsing the video library, selecting a video and viewing it, then rating the video after viewing.
3.Develop a matrix containing components and scenarios; then map usage of components across scenarios. By mapping the user scenarios against the component inventory, we can identify which components are accessed for each scenario and identify dependencies and possible failure points.
4.Define fault-handling mechanisms. Defining fault-handling mechanisms, or coping strategies, for each dependency helps to ensure that the software will do something reasonable when a failure occurs. What “reasonable” means depends on the functionality of the service and the type of failure the coping strategy
addresses. For example, the architects of a car purchasing service may design their application to include ratings for specific makes and models of each car model type. The purchasing service might have a dependency on another service that provides comparative ratings of the car models. If the rating service
fails, or is unavailable, the coping strategy might mean the purchasing service displays a list of models without the associated ratings rather than not display a list at all. In other words, when a particular failure happens, the service should still produce a reasonable result from the customer’s perspective.

For more insights on these reliability topics, I encourage you to download the recent services reliability whitepaper.

About the Author
David Bills

Chief Reliability Strategist, Trustworthy Computing

David Bills is a general manager of Microsoft’s Trustworthy Computing Group, where he serves as the company’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs. His team engages with business groups Read more »