Problem management for reliable online services

I think most cloud providers recognize how important it is for them to be able to detect, diagnose, and resolve problems that threaten to reduce the availability and reliability of the services they offer. However, due to the plethora of components and dependencies involved in a typical cloud service, rapid detection, diagnosis and resolution of issues can be rather hard to achieve.  Different root causes may manifest themselves as similar symptoms – this means it can be difficult to know for certain whether or not you have resolved an issue permanently.

For example, slow response times could be traced back to queries that haven’t been optimized or to fully-utilized network links slowing the transfer of data or to machines swapping from memory to disk and back.  How you go about solving for each of these root causes is radically different, despite the same symptom – sluggish response!

Many organizations focus their attention on incident management, but in my experience, it’s the organizations that employ a robust problem management process that achieve greater reliability, agility and efficiency when managing their cloud services.

Today Microsoft released a new whitepaper titled, “Problem management for reliable online services”. The paper describes problem management and the benefits organizations derive from implementing a robust problem management framework. It compares incident management to problem management, and describes the fundamental concepts of effective problem management and outlines the problem management processes organizations can use to help improve the reliability of their online services. The paper also includes two real-world examples of approaches to problem management used by Bing and Microsoft IT.

In my experience, it is often difficult for organizations to dedicate the resources required to implement a robust problem management methodology, but the return on investment realized by organizations that do make the commitment to problem management can be very beneficial.  Problem management teams investigate why each incident occurred and correlate that information with data gathered from previous incidents to look for similarities. By stepping back and looking at the big picture, they can often identify patterns which would otherwise be overlooked – patterns that lead to permanent solutions.

If you are deploying cloud services at scale, either as a cloud provider providing infrastructure, platform or software solutions, or as an independent software vendor, or even if you are a customer managing your own private cloud at scale, I encourage you to download this paper, and read more about how to successfully implement a problem management methodology meant to improve the reliability of your online services.


About the Author
David Bills

Chief Reliability Strategist, Trustworthy Computing

David Bills is a general manager of Microsoft’s Trustworthy Computing Group, where he serves as the company’s chief reliability strategist and is responsible for the broad evangelism of the company’s online service reliability programs. His team engages with business groups Read more »