2,405,518,376 flies can't be wrong

EDIT: some practical things you can do to mitigate the problems below (and others) are mentioned at fixtracking.com, thanks to whoever pointed me to that link.

The internet has many layers. One layer is the world wide web, based on HTTP. When you visit a page in your browser, the browser sends a request to the web server. A simple HTTP request for http://example.com/index.html looks like this:

GET /index.html HTTP/1.1
Host: example.com
Accept: */*

and the response from that server looks like this:

HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: max-age=604800
Content-Type: text/html
Date: Fri, 04 Oct 2013 10:26:16 GMT
Etag: "359670651"
Expires: Fri, 11 Oct 2013 10:26:16 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (fll/073E)
X-Cache: HIT
x-ec-custom-error: 1
Content-Length: 1270

which is followed by a blank line and then the contents of the page. The Content-Type header shows that this is an HTML page. HTML is text with markup tags that describe the structure of the document. Some of those tags refer to other resources, like images, stylesheets or scripts, and the web browser will automatically fetch these too - and when it does it lets the server know which page needed those resources in the Referer header.

Moreover, every server can set cookies in their response headers, and the browser is expected to store them and pass them back to the domain that set them. These are used by the server to know you are the same visitor when you visit another page on that site.

But the combination of HTML cross-domain linking, HTTP referer headers, and cookies leaks a ton of information to third parties who you might not even know are watching you. Suppose you visit a page on site A, and it links to a resource from site C, then site C knows that someone looked at a page on site A. If you later visit a page on site B, which also happens to link to a resource from site C, then site C knows that the the same person made both visits (because C set a cookie). If you later log into site C (perhaps to check your social network updates) then C knows who you are, what you looked at, and when.

I did an experiment. I installed VirtualBox on my machine, and installed a minimal Debian 7.1 inside a virtual machine I called "turd". I installed --without-recommends iceweasel xfce4 gdm3 xorg, and shut down the virtual machine. I then enabled network capture on the virtual machine and started it up.

VBoxManage modifyvm "turd" --nictrace1 on --nictracefile1 turd.pcap
VirtualBox -startvm "turd"

I launched Iceweasel, and entered the terms snowden lanchester in the search box on the start page. I clicked the link in the search results, and read the article. I then shut down the virtual machine and opened the network capture in Wireshark, setting the display filter to show only outbound HTTP requests:

http && ip.src==10.0.2.15

I edited the preferences to alter the columns, I deleted them all and added two Custom columns

http.request.full_uri
http.referer

I then printed the results to a plain text file (packet summary line for all displayed packets - I had to delete the columns I didn't want printed, as just hiding them didn't seem to work). The text file was quite long. Here's what I found with some simple shell scripts.

Browsing the article page:

http://www.theguardian.com/world/2013/oct/03/edward-snowden-files-john-lanchester

resulted in 182 additional HTTP requests with the Referer header set to the article page, across 57 domains:

imageceu1.247realmedia.com fw.adsafeprotected.com mp.apmebf.com facebook-web-clients.appspot.com guardian-notifications.appspot.com related-info-hrd.appspot.com static-serve.appspot.com static.chartbeat.com cdnjs.cloudflare.com graph.facebook.com clients1.google.com www.google.com ajax.googleapis.com pagead2.googlesyndication.com discussion.guardianapis.com secure-uk.imrworldwide.com s.c.lnkd.licdn.com platform.linkedin.com www.linkedin.com pixel.mathtag.com adfarm.mediaplex.com img.mediaplex.com cdn.optimizely.com cdn3.optimizely.com 10822091.log.optimizely.com images.outbrain.com odb.outbrain.com widgets.outbrain.com assets.pinterest.com log.pinterest.com passets.pinterest.com widgets.pinterest.com edge.quantserve.com pixel.quantserve.com b.scorecardresearch.com discussion.theguardian.com hits.theguardian.com oas.theguardian.com ophan.theguardian.com www.theguardian.com platform.twitter.com survey.112.2o7.net ak1.abmr.net ping.chartbeat.net cm.g.doubleclick.net googleads.g.doubleclick.net js.revsci.net pix04.revsci.net req.connect.wunderloop.net ophan.guardian.co.uk combo.guim.co.uk gia.guim.co.uk id.guim.co.uk pasteup.guim.co.uk resource.guim.co.uk static.guim.co.uk s.ophan.co.uk

Iceweasel does have a checkbox saying "Tell websites I do not want to be tracked", but I'm very skeptical that it's any more than a placebo.

Oh, and the title of this post refers to these statistics. And in the interest of full disclosure, there are some third party resources linked on my own site, namely the MathJAX that makes equations pretty. Maybe I should host a local copy for local people...