How many active sites are there?

Motivation

The Netcraft Web Server Survey has run since August 1995, exploring the Internet to find new websites. Every month, an HTTP request is sent to each site, determining the Web server software used to support the site, and, through careful inspection of the TCP/IP characteristics of the response, the operating system.

In the early days of the web, hostnames were a good indication of actively managed content providing information and services to the Internet community. The situation is now considerably more blurred — the web includes a great deal of activity, but also a considerable quantity of sites that are untouched by human hand, produced automatically at the point of customer acquisition by domain registration or hosting service companies, advertising providers or speculative domain registrants, or search-engine optimisation companies. The biggest domain registrars are large enough to be significant in the context of the whole survey. For example, Go Daddy (17M hostnames) and 1&1 (10M hostnames) make up 16% of the 168M hostnames surveyed in May 2008.

Circa 1996-1997, the number of distinct IP addresses would have been a good approximation to the number of real sites, since hosting companies would typically allocate an IP address to each site with distinct content, and multiple domain names could point to the IP address being used to serve the same site content. However, with the adoption of HTTP/1.1 virtual hosting, and the availability of load balancing technology it is possible to reliably host a great number of active sites on a single (or relatively few) IP addresses.

It is therefore desirable to find a way to count sites, rather than IPs, but excluding those sites which come from a standard or computer-filled template. This is achieved, by retrieving the front page and looking for unique content, by Netcraft’s active sites survey.

Methodology

The front page is taken from each hostname on an IP address and then compared with the front page of other hostnames on the same IP address. Only sites with distinct content will be counted, such that unique content is counted once no matter how many domains and hostnames point at the site.

Where an IP address serves a large number of websites, it is infeasible to sample all sites on the IP address, both from a logistical viewpoint and from an acceptability viewpoint. We therefore apply a sampling technique and extrapolate the results across all sites hosted on that IP.

Accordingly, Netcraft do not visit all sites on an IP but instead take a random sample of sites. For an IP address with N sites, Netcraft retrieves:

625 * N / (N + 624)

sites.

This means that for IPs with a small number of sites (<100), almost all the sites are visited; and for IPs with huge numbers of sites (>100000), around 625 are visited. Keeping the number of requests to be under 700 even in the most pathological case ensures that the survey robots are not banned from sites, and that the results remain respectable and accurate.

Determining Activeness

The comparison is made based on the structure of the page, rather than the text. The typical situation for domains is that a domain registration or domain advertising company will use the same template tag structure for each domain, but vary some text on each site to reflect the domain name, the customer name, or the date registered, etc., which makes a strict comparison of the HTML body unsafe. Free online service providers, like blogging providers, will do something similar, with the variation that they usually provide their accounts hostnames under a common domain name, instead of separate domains; the domain under which these accounts occur often uses Wildcard DNS, and in some cases any hostname under the domain will return valid but computer-generated content, with the hostname taken to be as an account name and interpolated into the content. Hence it is important for the active sites methodology to discard the actual words in the page and focus instead on the page structure, as represented by the HTML tag structure.

Some companies like to use framesets for presentation, displaying more than one HTML document on the same page. When the front page contains a frameset tag, we follow the references to request each framed document within the frameset, and compare the tag structure of those in the same way as the initial front pages. Following these links is necessary otherwise there would be a risk that all front pages consisting of framesets would be incorrectly assessed as inactive.

If the site returns content (i.e., a page or frameset, as opposed to an error response or connection timeout), the HTML tag structure is extracted and an MD5 hash is computed, allowing the content to be quickly compared with other content on the same IP address.

Example

  200 OK
  Date: Fri, 16 Jun 2008 13:57:53 GMT
  Server: Boa/0.92o
  Content-Length: 228
  Content-Type: text/html

  <html>
  <head><title>Test Page for localhost</title></head>

  <body bgcolor="ffffff">
  <h1>Test Page for localhost</h1>
  <p align="left">This is a test page for the machine `localhost'</p>
  </body>

  </html>

would be reduced down to:

  <html><head><title></title></head><body><h1></h1><p></p></body></html>

before hashing, and:

  3095feea34254a8ed7f099406a12a664

afterwards.

The reason for doing this is that most inactive sites are automatically generated from a template, and whilst the page contents may differ between sites, the page structure remains constant.