In a recent post on GigaOM, Katie Fehrenbacher summarized Microsoft’s plans for a biogas-fed data center research project in Wyoming. As I reflected on the points in Katie’s article, as well as the detailed description of the project written by Microsoft’s program manager Sean James, I began pondering the reliability-related implications of effectively reducing the reliance large-scale data centers have on the electrical grid. In view of the recent challenges many data center operators faced in the aftermath of Hurricane Sandy, I think research and development projects like this one are essential. From a reliability perspective, the notion of highly-localized, cost-effective, abundant and most importantly, dependable energy sources being closely coupled to energy consumers, (like data centers), and decoupled from monolithic, complex, (and arguably unreliable), systems like the grid makes a lot of sense. In addition, the economic benefits and environmental benefits are described in the referenced article, and I encourage the reader to take a look.
It is critically important that data center operators invest heavily in uninterruptable power supplies (UPS) and backup generators (diesel or natural gas being two of the most common), to hedge against the possible loss of electricity being drawn from the grid to power the computing equipment housed in those facilities, even if the likelihood of such a loss is relatively small. And yes, there is also the possibility those same UPS systems relied upon by data center operators to guard against disruptions will not function as expected and end up causing the very thing they are supposed to prevent – an interruption in the flow of electricity. Ok, so what does this have to do with designing, developing and operating reliable cloud-based services? The Data Plant project reduces reliability-related risk by minimizing the number of complex systems needed to produce a dependable supply of electricity, and it localizes the energy source feeding the fuel cells, further compartmentalizing potential risk to well-characterized fault domains, (i.e., a fault domain equaling the unit of cloud computing capacity being served by a particular fuel cell). Designers of cloud services should consider implementing similar reliability-enhancing plans to minimize the risk of service interruptions: systematically seek out – and remove – complexity of all types. Complexity in the software stack, as well as complexity in the physical infrastructure should be at the top of your list.
Ask yourself these questions: Can the software be re-factored? Can the physical infrastructure be re-tooled to eliminate low-value subsystems or components altogether? Can the computing workload be geo-distributed and leverage commoditized computing resources rather than requiring highly-specialized, (and therefore limited), computing resources or exclusively-dedicated, (and therefore expensive), computing resources for an extended period of time? Can procedures requiring human intervention be re-evaluated with an eye toward automation, (minimally), or better yet, eliminated altogether by taking a dramatically different approach to achieving the same outcome?
This last point warrants restating – the Data Plant project is an example of researchers at Microsoft applying an innovative approach to achieving the same outcome in a completely different way – in this case it’s supplying a cost-effective source of dependable energy used to power Microsoft data centers without having to rely on the electrical grid to do it. The reliability of the power being supplied to the data center is preserved, (that’s the desirable outcome), and without adding dependencies on complex/fragile systems to do so, thus improving the overall reliability of the cloud-based services being delivered.
Along with physical infrastructure considerations, it is important to consider the software design and development implications associated with cloud-based services. For more information on these, please read our recently released whitepaper, “An Introduction to Designing Reliable Cloud Services”.