Security Headers on the Top 1,000,000 Websites: March 2014 Report
The March 2014 report is going to be a bit different than those in the past. This is primarily due to architectural changes that were made to get more precise data in less time. Additionally, a lot of work has been done to automate generation of these reports so they can be released more often. Our scan was run on March 5th 2014 using the latest input from the Alexa Top 1 Million.
Before going over what has changed, we must cover what was done in the past. Previously, scans were run using Python + gevent with Kyoto Cabinet as a data store. The architecture was a hack and not much thought was given to it as it was not much more than a toy project at the time. After a scan was completed only specific headers were extracted and put into a MySQL database for processing. Initially, these scans were done as one offs and MySQL was simply chosen because it was already running on the system used. The K/V store of Kyoto Cabinet gave the benefit of automatically reducing duplicate URLs as keys are unique, at least that was the thought. After changing the database to PostgreSQL a number of discrepancies were noticed. First, care was not taken to lowercase the URLs, this ended up with duplicates due to redirects upper-casing parts or the entire URI. However, since MySQL treats HTTP://VERACODE.COM and http://veracode.com the same when using the distinct modifier, our stats were for the most part accurate. Other issues appeared such as how MySQL treats white space in default collation. A simple query such as select * from headers where header_name=’x-xss-protection’; matches not only ‘x-xss-protection’ but also ‘x-xss-protection ‘. However, a header of: X-XSS-Protection : 1; mode=block; (note space between header and value) is technically invalid and at least in Internet Explorer, only the default XSS protections would be in place, not the defined blocking mode.
Another issue is that grequests (gevent wrapper of the requests library) merges the values of header names that are the same. So if a server responds with duplicate headers such as below, the results would be stored as x-xss-protection: 1, 1; in our database.
X-XSS-Protection: 1; X-XSS-Protection: 1;
This is unfortunate as it was not possible to tell if the server responded with 1, 1 or the headers were merged by the requests library. Since these reports are being quoted and re-used in various forums it was felt that a rewrite was in order, with proper data integrity and more precise statistics. As such, a number of changes were made to meet this goal. The scanner was completely re-written in Go. Issuing four million requests concurrently is almost the perfect use case for a language like Go.
All header data is now written directly to PostgreSQL using a uniqueness constraint on URLs and user-agents. These constraints stopped over 17,000 duplicate URLs being added from sites which redirect back to a URL that had already been processed. Additionally, all URLs and header names were lower-cased prior to insertion into the database. Another check that was added was if a requested URL redirects back to itself over a different protocol, the redirect would not be followed and instead the 301 response would have its headers inserted into the database. Previously, redirects were followed all the way to the final destination resource, potentially overwriting values if they already existed.
Overall, the new architecture allows us to issue 4 million requests in under two hours; roughly 740 requests per second all while ensuring data integrity and giving us all header data to use in our analysis. For the curious, this is done using two m3.large AWS instances (with permission from Amazon). One hosting the ‘Golexa’ scanner, and the other our PostgreSQL 9.3 database. CPU consumption was around 60-70% of all three cores with less than a gigabyte of memory used. The scanner averaged around 80 MB/s for the requests. PostgreSQL hovered around 80% CPU utilization with around 200 concurrent connections.
While unfortunate, the old format is too imprecise to give accurate results, as such, it has been decided to not show rate of change for previous scans. However, future scans will be compared using the new format. Finally, it was observed that responses using the different user-agents ended up in some cases producing very interesting differences in header values. All charts will now be displayed using the values specific to the browser’s response.
Of the four million requests, we received 2,809,213 responses with 1,393,497 URLs matching in both Firefox and Chrome responses. Using Firefox 25′s user-agent, there were 1,404,180 responses where Chrome 31 produced 1,405,033. Chrome had a total of 941,568 HTTP and 463,465 HTTPS responses, where Firefox had 940,899 HTTP and 463,281 HTTPS responses. In total there were 23,095,205 headers stored for analysis.
The March 2014 report adds two new headers to the analysis; X-Content-Type-Options and Public-Key-Pins. For more information on these headers please see our previous post on Guidelines for Setting Security Headers.
Invalid Header Names
Thankfully, the number of invalid headers is quite low, with the majority being incorrect CORS headers such as “access-control-allow-origen” [sic] or “access-control-allow-method” (missing s on methods). Overall, around 50 header names were specified incorrectly.
This header continues to be the most widely used header as it is simple to add and, provided the site doesn’t allow HTML tags to begin with, has little impact on the operation of the site. The astute readers may notice a rather interesting discrepancy between Chrome and Firefox responses. When using the Chrome user-agent, 399 sites respond with 0; mode=block, which is an invalid setting. Numbers don’t always tell the full story, we need to look at the data to determine if there are any hints as to why this is happening. Almost all of the 399 URLs serve the same exact headers:
cache-control: no-store, no-cache, must-revalidate
cache-control: post-check=0, pre-check=0
expires: Wed, 05 Mar 2014 00:00:00 GMT
last-modified: Wed, 05 Mar 2014 09:27:02 GMT
x-xss-protection: 0; mode=block
date: Wed, 05 Mar 2014 09:27:02 GMT
content-type: text/html; charset=windows-1252
p3p: CP=”IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT”
The URLs themselves give away an interesting tell, almost all appear to be forums. After visiting a number of them, there were clear signs they are apart of www.forumotion.com a free forum hosting service. Visiting their own help forum at http://help.forumotion.com/forum exhibits the same issue, accessing with Firefox the 1; mode=block value is returned. When accessing with Chrome, 0; mode=block is returned.
There were no real significant differences between the values for Chrome and Firefox user-agents. The majority of responses came from blogspot with 33,414 and youtube having 12,589. Eight of the invalid settings were due to extra : characters existing in the header value in a response such as: X-Content-Type-Options:: nosniff.
The numbers for X-Frame-Options appear very similar between both browsers. Compared to our last run however, the number of sites using the sameorigin setting nearly doubled. It should be noted that we have modified how we treat invalid settings. On inspecting the values closer there appears to be a wide range of values people set for allowing any site to frame the resource. In previous runs, we only split out GOFORIT as it was the most common, now we are seeing other values such as ALLOWALL or simply ALLOW. These have been moved from the invalid category (while they still technically are invalid) into a goforit/allowall category. What is interesting about these 520 sites is why they even bother setting the header. Simply not having the header achieves the same goal, sites will be able to frame the resource. The answer in some cases, turns out to be an attempt at disabling a server wide header setting.
Access-Control-Allow-Origin with the wildcard value increased by about 2000 new sites compared to last time. The number of invalid values dropped slightly but is still has the highest number of invalid configurations compared to all the other headers. As in the past, this is primarily due to the continued misunderstanding that only a single serialized origin is allowed.
Compared to the last run it may appear that the number of sites using STS has dropped, however another change was made to how we analyze. Now, we only consider responses that came from HTTPS URLs to be counted towards the total. It should be noted that the includeSubDomains directive is not exclusive and is simply a total count of sites using the directive whether they are long, short or zero. As a reminder, long values are sites that set the max-age value to anything over 7 days. The 58 invalid settings were primarily due to sites using incorrect tokens when multiple directives exist, see our Guidelines for Setting Security Headers post for more information.
This is a brand new header and as such has very little adoption. In fact only three sites responded with the Public-Key-Pins header, and one with the Public-Key-Pins-Report-Only header. Of the three sites, only one was configured correctly by encapsulating the hash values in quotes. This is concerning as anyone who is even aware of this header should be quite adept at setting it correctly. The fact that this is not the case, does not bode well for site operators who may be implementing this in the future. We look forward to watching the adoption rate of this header and hope that either the specifications are relaxed to allow unquoted hashes or it is made painfully clear that doing so is incorrect.
A slight drop in totals was observed from our last run for X-Content-Security-Policy and X-WebKit-CSP, but not by much. The number of sites using inline or unsafe script continues to be alarmingly high. As a reminder this header will soon be deprecated and developers and frameworks must consider moving towards the Content-Security-Policy header.
While we still haven’t seen much traction with Content-Security-Policy we once again are surprised at how many sites are defined with inline or unsafe script directives. The invalid inline column is calculated as sites using CSP but containing Firefox’s old X-Content-Security-Policy ‘inline-script’ or ‘eval-script’ directives. What this points to is that these 5 sites most likely kept the old value but simply dropped the ‘X-’ from the header name. This is invalid and will not protect the site. What is far more fascinating from these numbers is sites that defined Content-Security-Policy and X-WebKit-CSP or X-Content-Security-Policy at the same time. When originally calculating the totals, the hope was that adoption for CSP was going to increase, but sites still wanted to support older browsers by having the old X header types. It turns out this is not the case, as the number of sites defining both is extremely low. So why are there so many X-CSP/X-WebKit header still defined? For the answer we must go back and look at the data.
Once again we look at all of the headers returned for sites that have X-CSP or X-WebKit-CSP values to find similarities. Almost immediately we notice a large number of sites (423 to be exact) which have X-CSP but not the CSP header, have one major thing in common. They all return this header/value:
Set-Cookie: phpMyAdmin=a9dee5eb7a9d5ae4579ad44fb82c6f37b5278351; path=/; secure; HttpOnly
It turns out back in May of 2011, according to this changelog, phpMyAdmin added the X-Content-Security-Policy header. Even in the latest version, it is still defining X-Content-Security-Policy and not Content-Security-Policy.
Analyzing the results from this month has probably been the most interesting for me personally. Having more data and in a more structured layout has led me to be able to get greater insight and context into who is using these headers and how they are being used. The numbers themselves can only tell half the story, only by analyzing the context and surrounding information can we get the full picture. One thing that should be painfully obvious is the importance web frameworks and hosting companies play in the adoption of these security defenses. If you or your company falls into either of these categories, it is strongly recommended that you keep up to date with the latest specifications as they can change quite often. The proper or improper implementation of them can have far more impact on the state of the web than a single web site.
Invalid settings continue to be a concern, to the point that I think either the specifications and implementations are too rigid or the documentation for them are not doing a good job in clearly explaining their constraints. I personally find ABNF rather cryptic and do wonder if a better format could be used. As for the future, I’m hoping with the new infrastructure we can run these more often. As always, comments and ideas for additional analysis is always welcome.
I’d like to finally give thanks to a number of people. As these reports are getting more in depth and more complicated I always appreciate people who point me in the right directions or answer my questions! So big thanks to Ian Melven from New Relic who is my insight into Mozilla’s Firefox, Mike West from Google for pointing me to the right place for Chrome internals, our very own Erik Peterson for setting up the AWS infrastructure and Florent Daignière of Matta Consulting for introducing me to the Public-Key-Pins header.