Take WASC Data With a Grain of Salt
The Web Application Security Consortium (WASC) just published statistics on the prevalence of various web application vulnerabilities. The list was compiled from 31,373 automated assessments performed during 2006 by four contributing companies, with the methodology around data collection described as follows:
The scans include a combination of raw scan results and results that have been manually validated to remove false positive results. The statistics do not include the results of any purely manual security audits (aka human assessments).
As with any statistical data, the results of this study should be digested with a healthy dose of skepticism and a solid understanding of the sampling bias. Take, for example, a political tracking poll conducted by phone during normal business hours. The results of the poll will only account for the opinions of voters with publicly listed phone numbers who happen to be home during the day (and who don’t screen their calls to weed out tracking polls). The sampling bias of the WASC study is that it only accounts for the findings of automated web application scanners. As a result, it primarily reflects the capabilities and limitations of these scanners, not the general state of web application security, as one might reasonably expect from a WASC publication.
Keeping this bias in mind, what does this data really tell us, beyond the fact that automated vulnerability scanners find a lot of XSS? Does it give us true visibility into the actual prevalence and distribution of vulnerabilities in custom web applications? My answer is no.
Let’s look at a sample of the prevalence data:
Those numbers just don’t pass the “giggle test.” The category that stands out the most in that list is Insufficient Authorization, a very common vulnerability in my experience. It’s highly unlikely that only four of the applications contain authorization-related vulnerabilities. All this does is highlight the limitations of automated web app scanners.
What about Cross-Site Request Forgery? That doesn’t show up at all on the list, despite the fact that the vast majority of web applications are vulnerable to it (even Jeremiah agrees on this point). It’s not on the list because it isn’t something the automated scanners can detect with any degree of accuracy. For the same reason, several categories on the OWASP Top Ten aren’t even represented, such as Buffer Overflows and Denial of Service.
Now let’s talk about false positives. The methodology clearly states that the data is a mixture of raw scan output and manually validated results. Since the results are presented in aggregate, it is impossible to derive real meaning from the figures without insight into the following information:
- Which results came from which product
- Which results have been manually validated
- The historical false positive rates, by category, for each product
There is also lack of clarity around the definition of “one vulnerability.” Consider this code snippet:
Map params = request.getParameterMap(); PrintWriter pw = response.getWriter(); for (String key : params.keySet()) for (String value : params.get(key)) pw.println(key + "=" + value + " ");
An automated scanner might report that as 100 different XSS vulnerabilities, one for each parameter that it fuzzed. However, there is only one actual flaw in the code. This is a simplistic example, but I suspect the inflated XSS numbers are partly due to this type of accounting.
In conclusion, here are the key takeaways from this list, after accounting for all of the weaknesses inherent to the methodology and the data itself:
- Automated web app scanners find a lot of XSS and SQL Injection
- Automated web app scanners are ineffective at finding vulnerabilities that require some understanding of higher-level logic, e.g. Insufficient Authorization or CSRF
- Including raw scan results from a category of products that are notorious for high false positive rates makes the resulting statistics even less meaningful
- The many-to-one mapping of vulnerabilities to actual instances of flawed code artificially inflate the prevalence of certain categories
In other words, this study provides minimal value to a veteran pen tester, and is misleading to just about anyone else.