Reliability Vs. Security

At the International Symposium on Software Reliability Engineering (ISSRE 07, Trollhattan Sweden) one would think that the security versus reliability debate would be very one-sided. After all, reliability is the attendees’ mainstay and if there is one group of folks on the planet who would see security as a subset or subsidiary concern, it might be the industry and academic experts that attend this prestigious IEEE conference.


I gave the ‘industry keynote’ to open the second day of ISSRE 07 this past November, and started this debate by focusing on the topic that consumes my days: security. I painted a picture of the disaster scenarios we spend a heroic amount of effort trying to avoid and talked about the technical and organizational challenges to getting it right. But after the talk, the discussion centered on a broader topic: is security more difficult to achieve than reliability? Afterwards, a gaggle of professors from five continents and practitioners from Saab, Ericsson, Microsoft, Cisco, IBM and Google debated the matter from the halls of the conference to the pubs in the Trollhattan city center.


Here are two points discussed at length during the debate:


1.       Reliability folks are lucky – they have a clear definition of what a bug is: a deviation between the application and the spec. Having a spec means understanding which behaviors are bugs and which are by design; it’s an unerring guide to testing. Security folks have no such oracle since we have no way of specifying all the ways in which an application might be exploited (a threat model might represent our best effort). Without such a spec, topics such as coverage, completeness and so forth have little meaning for security folks and testing is much harder because without a spec we don’t know what we are looking for.



This is a nice state of affairs for reliability until you realize that specs are not what they are cracked up to be. Given the traditional natural language format of most written specs, they are notoriously ambiguous and have an annoying tendency to become out of date as the code evolves and they do not! Sorry, but I refuse to score any advantage to reliability on this point. The state of our collective design documentation and specs won’t allow it.



2.       Security folks are lucky – they only have to deal with a subset of the entire bug space. Their only concern is those components that consume untrusted input and only then the subset of issues that might be exploitable. The rest of the issues can be ignored. Reliability people, on the other hand, must deal with the entirety of the application because reliability bugs can be anywhere. Reliability folks deal with this by weighting their tests according to an operational profile, an unwieldy proposition at best and one that security folks can safely ignore (because hackers don’t follow an operational profile).



As a security guy, this sounds pleasing: I have a smaller problem to deal with! But the solar system is a lot smaller than the galaxy and it isn’t particularly more ‘explorable’ because of its smaller size. It’s only recently, after centuries of study, that we realized there are Pluto-sized rocks out there. Let’s face it, even by reducing the places we have to explore, there are still too many to have any hope of covering them all. The solar system and the galaxy are the same size because they are both too big to be adequately explored with our current methods. Advantage to Security? Nope.


The one thing we both have in common is an unqualified ability to cause pain to our users. Of course there are exceptions, but with security that pain is extreme and happens over the short period of time in which the exploit runs undetected (and the subsequent recovery). With reliability, the pain is often less intense but occurs more frequently and over longer periods of time; it’s those annoying little bugs that waste time and force awkward work-arounds. You can pull the band-aid off all at once or endure it a little at a time. The pain is equally unacceptable.


There is one point I will readily cede to the reliability community: they can teach the security community a thing or two about analyzing data. Metrics are an often-used if still imprecise reliability tool. The use of Bayesian statistics, stochastic processes and reliability modeling is well developed and has been proven time and again on real software development data. Reliability analysis is predictive and can be used to monitor the development process. But in security we rely on simple counting of vulnerabilities and metrics such as ‘days of risk.’ Security measures are more often used to place blame and point fingers than to estimate or predict anything. Security learning tends more toward Pavlov than Markov: when it keeps on hurting, eventually we stop doing it.


But there is also one point the reliability community must cede: security folks are more proactive with corrective action. We spend far more time acting on data than analyzing it. In security, we’ve managed to mitigate and even drive to near-extinction entire classes of vulnerabilities. Despite our inability to measure security, we are very good at driving development and testing process change.  The SDL is a perfect example of this – it’s been proven in practice on some of the most complex software on the planet. Yes, we get it wrong from time-to-time, but we learn from those mistakes.


Security and reliability are different aspects of the general problem of protecting our customers. There is much to learn by our communities working together and sharing solutions that will make our software work better and more securely. ISSRE convinced me that we in the security community are missing out on decades of research in fault and failure analysis that would serve us well. And I think the reverse is true too, that by our example, reliability can be better embedded into the development lifecycle to drive improvements and better protect customers.


I look forward to ISSRE 08, enough so that I’ve helped convince Microsoft to host it. See you next November in Redmond.


Join the conversation

  1. LyalinDotCom

    One common factor that links both of these fields is the usual lack of appreciation for such efforts by most organizations. As we all know features rule while Security and reliability tend to be downgraded unless the specific industry has core business reasons to consider them.

    In the end of the day if you do it right, no one will know you did anything at all.

Comments are closed.