Behind the Charts – Scrubbing the Vulnerability Data

In The Evolution of Malware and the Threat Landscape, the Special Edition Security Intelligence Report that we released at RSA and other Security Intelligence Reports (SIR), my starting primary source is the National Vulnerability Database ( that is maintained by the National Institute of Standards (NIST) team under sponsorship from DHS.

I frequently get questions though, since my charts don’t necessarily match up to a simple comparison with the raw data from the NVD, so I thought it might be interesting to some if I walked through the process and shared the data at a couple of checkpoints.

So, first I start with the raw xml data, which you can download from Data Feeds (  It isn’t necessarily easy to process and or analyze in the raw xml format, so here is the data I use in a comma separated value (CSV) format that can be loaded into Excel or another spreadsheet application.  I’ve included these columns:

  • CVE identifier (ie, the vuln name)
  • Publication date
  • Disclosed data (my data)
  • CVSS severity score
  • CVSS access complexity
  • Affected product list (separated by ‘|’ in a single cell for filtering)

Note that the “disclosed” date is not part of the raw NVD data, but is my own extension based upon examining the reference URLs in the NVD and finding the one with the earliest mention of the vulnerability.  I include it so that people can filter by disclosure dates – for example, filter for the vulns disclosed in 2H2011 or other periods.

[link to nvdbase.csv]

I originally wrote about the need to scrub the data about 5 years ago (Scrubbing the Source Data) and back in 2007, I found that while vulnerabilities affecting Microsoft were typically attributed correctly, for other affected products, there was a lower accuracy percentage in terms of identifying an affected product (e.g. Ubuntu products were only listed about 2.1% of the time, when compared with confirmations in Ubuntu advisories.)

So, to the scrub the data, I use vendor issued security advisories for confirmation and add in any affected products and also add the advisories as CVE references if they are not already present in the NVD.  Here are links to the vendors sources for this exercise/data set:

After applying the vulnrebilities and products listed in these sources, I also do some CPE (common product enumeration) product name simplification to limit the names to just include vendor, product name, major version and minor versions (CPE names can be quite long and duplicative).  Here is the CSV file for the scrubbed data:

[link to nvdscrubbed.csv]

I’m happy to take any questions for feedback on this.  For you dataheads out there, enjoy!


About the Author
Jeff Jones

Principal Cybersecurity Strategist

Jeff Jones a 27-year security industry professional that has spent the last decade at Microsoft working with enterprise CSOs and Microsoft's internal teams to drive practical and measurable security improvements into Microsoft products and services. Additionally, Jeff analyzes vulnerability trends Read more »