Oct
26
I had a problem with rogue crawlers aggressively crawling thousands of pages on my website. They were driving my CPU load dangerously high. I was also worried they were stealing my content to republish it in spam websites or content farms to try and drive search engine traffic to pages that actually belong to me. Most of the crawlers were from IP addresses in China, Russia and Czech Republic. At my most desperate stage I almost blocked entire countries.
Here’s how I fixed the problem.
In my weblogs I have someone who is pretending to be Google’s crawler and I have real crawler activity. Here’s the fake’s log entry:
219.64.65.40 (-) - - [22/Aug/2007:11:51:43 -0700] “GET *hidden* HTTP/1.1″ 200 630 “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”To do a reverse lookup on this IP address, I type the following command in Linux:
dig -x 219.64.65.40
When I do a reverse lookup on his IP address I get the following:
219.64.65.40.HYD-CDMA.dialup.vsnl.net.in.
That tells me he’s actually someone in India on a dialup connection. Sneaky content thief that must be stopped.
I also have this log entry:
66.249.66.242 (-) - - [22/Aug/2007:11:52:06 -0700] “GET / HTTP/1.1″ 200 30293 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”The reverse lookup for this IP address gives me:
crawl-66-249-66-242.googlebot.com.
That looks suspiciously like a real Google crawler.
The way you protect yourself is as follows: Every time your web application (Ruby, Perl, PHP, or whatever) serves a real content page (not an image, stylesheet or javascript) to a user, you increment the number of requests you’ve received from their IP address in a lightweight database of some sort. I use BerkeleyDB to do this (and I use mod_perl). Each IP address is a key in my BDB database and every time I get a request I increment a counter for that IP address or key.
Then you set up a cron job with a script that runs every minute. It checks how many requests each IP address has sent. If it’s more than a threshold (15 requests per minute for example), then you do a reverse lookup on the suspicious IP address. If it doesn’t map to Google, MSN, Yahoo or someone else you like then you use Linux’s iptables utility to create a firewall rule that drops all traffic from that host. Once you’ve checked all IP addresses, you truncate the file to reset all counters to 0.
Whenever I block an IP address, I also get the script to email me the reverse lookup details for that address. That way I can check if I’m accidentally blocking someone I care about. I serve a lot of traffic and my script blocks about 5 IP addresses per day.
To block an IP address on your linux server, assuming you don’t have any other rules this might interfere with, you type the following:
/sbin/iptables -I INPUT -s BAD.IP.ADDRESS.HERE -j DROP
When you activate this rule, it will appear to the sneaky content thief that your server has gone down. They may rub their hands in glee because they think they DoS’d you, but in fact it is you who’s kung-fu is stronger.
15 Requests for actual content pages may sound like a little, but try and do it on your website and you’ll see how much it actually is.
Once you implement this a content thief will get at the most 1 or two minutes of crawl time before you firewall them. If they’re crawling 1 page per second that’s 120 pages at most.
Content leechers were consuming an amazing amount of CPU on my servers and implementing this gave me a very nice performance gain - it saved me buying an extra server.
Update: I found a post by Matt Cutts from Google’s quality team that suggests using this technique. According to someone from their crawl team and a comment poster on this blog this technique isn’t foolproof because a serious spammer could fake a reverse record (PTR record). But in the absence of a better technique, and because this has proven extremely effective for me, I’ll keep using it.
