Automating Website Analysis

Analyzing the top 5000 sites on the net with Selenium WebDriver

Posted by Stefán Orri Stefánsson on 22 April 2015

In my post, You’ve been framed!, I surveyed the usage of iframes on the top 5000 websites on the Internet. Gathering the data for this number of websites isn’t a straightforward task. This is how I solved it.

A Dynamic Problem

Modern websites are typically javascript-heavy and simply retrieving them with wget or curl is not sufficient for many purposes. This is especially true for a task like counting iframes. The majority of iframes are generated by javascript libraries like Google’s AdSense. You need a client capable of evaluating javascript and querying the resulting DOM.

My first attempt was with PhantomJS which I’ve used in the past for testing and screenshot generation. However, crawling 5000 sites with it proved to be an excercise in frustration. Way too many websites caused Phantom to trip over with known bugs or mysterious hangs. I abandoned it and moved on to SlimerJS. Slimer is mostly compatible with Phantom, but based on Firefox (XULRunner to be more exact) instead of WebKit. It was much quicker, but suffered from the same problems of hanging for some sites with no explanation.

Selenium to the Rescue!

After the two scriptable browsers had failed me, I turned to Selenium WebDriver, the browser automation tool generally used for website testing. Selenium is not free from quirks of its own but it’s much easier to handle exceptions and recover. Using the Ruby bindings, I ended up with this script.

require "selenium-webdriver"

http_client =
http_client.timeout = 30

# I used Firefox but there are plenty of other options
driver = Selenium::WebDriver.for :firefox, :http_client => http_client

# Read in the file list and visit each site
File.readlines('sites.txt').each do |line|
  url = line.split(',')[1].strip
  begin "http://www." + url
	  # Count the iframes and see if any contain the 'sandbox' attribute
	  frames = driver.find_elements(:tag_name, "iframe")
	  sandboxes = 0
	  frames.each { |frame| sandboxes += 1 if frame.attribute('sandbox') != nil }
	  # Print the result
	  puts "#{url} #{frames.length.to_s} #{sandboxes}\n"
  rescue Timeout::Error
	# Selenium has lost touch with the browser. Create a new window and continue
	puts "- #{url} - TIMEOUT\n"
	driver = Selenium::WebDriver.for :firefox, :http_client => http_client
	puts "- #{url} - FAIL\n"


Running the Script

Unlike PhantomJS, Selenium is not headless - it needs a real browser to automate. Having a scripted browser cycling through websites on your desktop is fun to watch… for a minute, then it gets boring. It also turns out that some of the top 5000 websites contain a lot of filth, so I recommend keeping the browser window out of sight. This is easy using Xvfb in Linux.

user@host:~/$ xvfb-run ruby iframe_browser.rb > results.txt

This is also the way to go if you’re running the script on a headless VPS, which I did for geographic comparison with my machine at home.

Loading so many sites invariably means some are going to fail. I did two additional runs to get more results and wound up with 65 failures - 1.3 percent. I considered this reasonable and moved on to the analysis.

Analyzing the Results

The result file consists of one line per website containing the domain name, number of iframes and number of iframes with the sandbox attribute, like this: 0 0 0 0 5 0

Now the only thing left was calculating the average number of iframes and the total number of sites using sandboxed frames. This is easily done with awk.

user@host:~/$ awk '{ sum += $2; n++; sum_s += $3 } END { print sum / n; print sum_s }' results.txt

This means that the top 5000 sites on the Internet contain on average just over five iframes. And only a few sites use the sandbox attribute on iframes. What sites? Using grep we can find all lines who don’t end in zero while ignoring the error domains which start with a dash:

user@host:~/$ grep -v 0$ results.txt | grep -v ^- | awk '{ print $1 }'

Photo credit: Gullfoss, Iceland. Photo by Frederick W. W. Howell, ca. 1900. From the Cornell University Library, via Flickr.