Open Source Web Crawling is About Ten to Fifteen Years Behind Google
In 1999, it took Google one month to crawl and build an index of about 50 million pages. In 2012, the same task was accomplished in less than one minute. The 2012 capability is about 50,000 times faster. This is slightly better than doubling the speed every year for 14 years.
In 2016, a new open-source Bubing web crawler was announced that can achieve around 12,000 crawled pages per second on a relatively slow connection. This is could be 1 billion pages per day. The pricing is about $40 per day. There is an arxiv article from 2016. (BUbiNG: Massive Crawling for the Masses) This is about the capability that Google had about ten to fifteen years ago.
BUbiNG is here at github.
a 64-core, 64 GB workstation it can download hundreds of million of pages at more than 10 000 pages per second respecting politeness both by host and by IP, analyzing, compressing and storing more than 160 MB/s of data.
It is about $200 for a 10 Terabyte hard drive. This would store about one hour of crawling.
Borislav Agapiev is a web crawling expert who described what was needed to run Bubing with about Google 2005 capabilities.
Using Amazon AWS spot instances, in particular hi 1.4xlarge which is a good machine with dual Xeon E5620 CPUs with 60GB RAM. Sitting on a 10Gbit network. Borislav verified hundreds Mbps incoming crawling bandwidth which is great.
Spot pricing is $0.16/hr which is awesome, especially considering regular price is $3/hr. Of course, with spot machines, they can get taken out under you at any moment but it is a good practice to be able to handle that anyway.
With simple setup, I was able to get 25MB/s (200Mbps) sustained, with almost 1200 requests/sec on a single machine, which is awesome performance. That could be close to 100M pages/day for
Web Crawling Versus Search Request Handling
Google had far more massive search request volume growth. Google’s handling of search request increased 17,000% year to year between 1998 and 1999, 1000% between 1999 and 2000, and 200% between 2000 and 2001. Google search continued to grow at rates of between 40% to 60% between 2001 and 2009. It started to slow down stabilizing at a 10% to 15% rate in recent years.