Measure Twice, Cut Once, With RMA Methodology

I’ve been beating our drum for a while now about the inevitability of failure in cloud-based systems. Simply put, the complexities and interdependencies of the cloud make it nearly impossible to avoid service failure, so instead we have to go against our instincts and actually design for this eventuality. Once you accept this basic premise, the next question is how exactly do we need to change our design processes? The … Read more »

Designing for Failure: The Changing Face of Reliability

I’ve written about reliability and resilience before, but the topic is so important it’s worth revisiting again, using an example from the real world I think you’ll appreciate. Imagine the pressure the architects and engineers were under when they designed and built the Channel Tunnel connecting England to France via rail. The so-called “Chunnel” would have to transport — safely —millions of people a year at speeds over 160 kilometers … Read more »

Antifragility – the goal for high-performance IT organizations

In a recent post, I shared a short list of my favorite books and articles, related to reliability. Each one has influenced my thinking with respect to how to go about creating a high-performing IT organization, despite the fact not all of these publications are IT-centric in terms of subject matter. In this post, I’m going to take a closer look at “Antifragile”, the 2012 book written by Nassim Nicholas … Read more »

My “Desert Island Half-Dozen” – recommended reading for resilience

When I speak with customers, they often ask how they can successfully change the culture of their IT organization when deciding to implement a resilience engineering practice. Over the past decade I’ve collected a number of books and articles which I have found to be helpful in this regard, and I often recommend these resources to customers. I’ve included my favorites below, in no particular order, with a short explanation … Read more »

Security Trends in Cloud Computing Part 2: Addressing data challenges in healthcare

In Part 1 of this series, we looked at survey data from financial services organizations and discussed some potential benefits for those that adopt cloud computing. The data are derived from the Cloud Security Readiness Tool (CSRT), which helps organizations evaluate whether cloud adoption will meet their business needs. In today’s post, I want to focus on cloud security trends in the healthcare industry.  See more>>

Reliability Series #4: Reliability-enhancing techniques (Part 2)

In my previous post in this series, I discussed the Discovery and Authorization/Authentication categories of the “DIAL” acronym to share mitigations targeting specific failure modes. In this article I’ll discuss the “Limits/Latency” and “Incorrectness” categories represented by the “DIAL” acronym, and I’ll also share example mitigations targeting specific failure modes for each.  See more >>

Reliability Series #3: Reliability-enhancing techniques (Part 1)

In my previous post, I discussed “DIAL”, an approach we use to categorize common service component interaction failures when applying Resilience Modeling & Analysis, (RMA), to an online service design.  In the next two posts,  I’ll discuss some mitigation strategies and design patterns intended to reduce the likelihood of the types of failures described by “DIAL”.  See more >>

Reliability Series #2: Categorizing reliability threats to your service

Online services face ongoing reliability-related threats represented by device failures, latent flaws in software being triggered by environmental change, and mistakes made by human beings. At Microsoft, one of the ways we’re helping to improve the reliability of our services is by investing in resilience modeling and analysis (RMA) as a way for online service engineering teams to incorporate robust resilience design into the development lifecycle.   See more>>

Reliability Series #1: Reliability vs. resilience

Whenever I speak to customers and partners about reliability I’m reminded that while objectives and priorities differ between organizations and customers, at the end of the day, everyone wants their service to work. As a customer, you want to be able to do things online, at a time convenient to you. As an organization – or a provider of a service – you want your customers to carry out the … Read more »

Building reliable cloud services

Reliability continues to be top of mind for everyone involved with online services. Today we are publishing an updated version of our whitepaper “Introduction to Designing Reliable Cloud Services”. The paper describes fundamental reliability concepts and a reliability design-time process for organizations that create, deploy, and/or consume cloud services. It is designed to help decision makers understand the factors and processes that make cloud services more reliable.  Read more >>