mathr / blog / #

2,405,518,376 flies can't be wrong

EDIT: some practical things you can do to mitigate the problems below (and others) are mentioned at, thanks to whoever pointed me to that link.

The internet has many layers. One layer is the world wide web, based on HTTP. When you visit a page in your browser, the browser sends a request to the web server. A simple HTTP request for looks like this:

GET /index.html HTTP/1.1
Accept: */*

and the response from that server looks like this:

HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: max-age=604800
Content-Type: text/html
Date: Fri, 04 Oct 2013 10:26:16 GMT
Etag: "359670651"
Expires: Fri, 11 Oct 2013 10:26:16 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (fll/073E)
X-Cache: HIT
x-ec-custom-error: 1
Content-Length: 1270

which is followed by a blank line and then the contents of the page. The Content-Type header shows that this is an HTML page. HTML is text with markup tags that describe the structure of the document. Some of those tags refer to other resources, like images, stylesheets or scripts, and the web browser will automatically fetch these too - and when it does it lets the server know which page needed those resources in the Referer header.

Moreover, every server can set cookies in their response headers, and the browser is expected to store them and pass them back to the domain that set them. These are used by the server to know you are the same visitor when you visit another page on that site.

But the combination of HTML cross-domain linking, HTTP referer headers, and cookies leaks a ton of information to third parties who you might not even know are watching you. Suppose you visit a page on site A, and it links to a resource from site C, then site C knows that someone looked at a page on site A. If you later visit a page on site B, which also happens to link to a resource from site C, then site C knows that the the same person made both visits (because C set a cookie). If you later log into site C (perhaps to check your social network updates) then C knows who you are, what you looked at, and when.

I did an experiment. I installed VirtualBox on my machine, and installed a minimal Debian 7.1 inside a virtual machine I called "turd". I installed --without-recommends iceweasel xfce4 gdm3 xorg, and shut down the virtual machine. I then enabled network capture on the virtual machine and started it up.

VBoxManage modifyvm "turd" --nictrace1 on --nictracefile1 turd.pcap
VirtualBox -startvm "turd"

I launched Iceweasel, and entered the terms snowden lanchester in the search box on the start page. I clicked the link in the search results, and read the article. I then shut down the virtual machine and opened the network capture in Wireshark, setting the display filter to show only outbound HTTP requests:

http && ip.src==

I edited the preferences to alter the columns, I deleted them all and added two Custom columns


I then printed the results to a plain text file (packet summary line for all displayed packets - I had to delete the columns I didn't want printed, as just hiding them didn't seem to work). The text file was quite long. Here's what I found with some simple shell scripts.

Browsing the article page:

resulted in 182 additional HTTP requests with the Referer header set to the article page, across 57 domains:

Iceweasel does have a checkbox saying "Tell websites I do not want to be tracked", but I'm very skeptical that it's any more than a placebo.

Oh, and the title of this post refers to these statistics. And in the interest of full disclosure, there are some third party resources linked on my own site, namely the MathJAX that makes equations pretty. Maybe I should host a local copy for local people...