Stop Thief! How to stop people from stealing your website content 5

I had a problem with rogue crawlers aggressively crawling thousands of pages on my website. They were driving my CPU load dangerously high. I was also worried they were stealing my content to republish it in spam websites or content farms to try and drive search engine traffic to pages that actually belong to me. Most of the crawlers were from IP addresses in China, Russia and Czech Republic. At my most desperate stage I almost blocked entire countries.

Here’s how I fixed the problem.

In my weblogs I have someone who is pretending to be Google’s crawler and I have real crawler activity. Here’s the fake’s log entry:

219.64.65.40 (-) - - [22/Aug/2007:11:51:43 -0700] “GET *hidden* HTTP/1.1″ 200 630 “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

To do a reverse lookup on this IP address, I type the following command in Linux:

dig -x 219.64.65.40

When I do a reverse lookup on his IP address I get the following:

219.64.65.40.HYD-CDMA.dialup.vsnl.net.in.

That tells me he’s actually someone in India on a dialup connection. Sneaky content thief that must be stopped.

I also have this log entry:

66.249.66.242 (-) - - [22/Aug/2007:11:52:06 -0700] “GET / HTTP/1.1″ 200 30293 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

The reverse lookup for this IP address gives me:

crawl-66-249-66-242.googlebot.com.

That looks suspiciously like a real Google crawler.

The way you protect yourself is as follows: Every time your web application (Ruby, Perl, PHP, or whatever) serves a real content page (not an image, stylesheet or javascript) to a user, you increment the number of requests you’ve received from their IP address in a lightweight database of some sort. I use BerkeleyDB to do this (and I use mod_perl). Each IP address is a key in my BDB database and every time I get a request I increment a counter for that IP address or key.

Then you set up a cron job with a script that runs every minute. It checks how many requests each IP address has sent. If it’s more than a threshold (15 requests per minute for example), then you do a reverse lookup on the suspicious IP address. If it doesn’t map to Google, MSN, Yahoo or someone else you like then you use Linux’s iptables utility to create a firewall rule that drops all traffic from that host. Once you’ve checked all IP addresses, you truncate the file to reset all counters to 0.

Whenever I block an IP address, I also get the script to email me the reverse lookup details for that address. That way I can check if I’m accidentally blocking someone I care about. I serve a lot of traffic and my script blocks about 5 IP addresses per day.

To block an IP address on your linux server, assuming you don’t have any other rules this might interfere with, you type the following:

/sbin/iptables -I INPUT -s BAD.IP.ADDRESS.HERE -j DROP

When you activate this rule, it will appear to the sneaky content thief that your server has gone down. They may rub their hands in glee because they think they DoS’d you, but in fact it is you who’s kung-fu is stronger.

15 Requests for actual content pages may sound like a little, but try and do it on your website and you’ll see how much it actually is.

Once you implement this a content thief will get at the most 1 or two minutes of crawl time before you firewall them. If they’re crawling 1 page per second that’s 120 pages at most.

Content leechers were consuming an amazing amount of CPU on my servers and implementing this gave me a very nice performance gain - it saved me buying an extra server.

Update: I found a post by Matt Cutts from Google’s quality team that suggests using this technique. According to someone from their crawl team and a comment poster on this blog this technique isn’t foolproof because a serious spammer could fake a reverse record (PTR record).  But in the absence of a better technique, and because this has proven extremely effective for me, I’ll keep using it.

Quick SEO crash course 0

I’ve promised myself I won’t spend more than 10 minutes on this blog entry. A separate entry on time management is forthcoming. It’s 6:15pm. Here goes…

I’ve had incredible success with SEO. Ironically enough, my search engine business that I sold in 2005 benefited hugely from SEO. 10,000 uniques per day from Google alone was not uncommon.

There are two basic approaches to SEO.

Type 1. You can focus on driving traffic to a small set of pages. Optimize for a particular set of keywords. Get lots of backlinks to those pages driving up their pagerank. Hope that you get a high conversion rate on those pages.

Type 2. Provide lots and lots of unique and useful content. Publish that content in a way that Google can index it. Have a few pages that have lots of links to your useful and unique content. Get lots of backlinks from good quality sites to those jump-off pages (or directory pages if you like).

Type 1 will not get you much traffic, but those that do arrive will bounce less and have a much higher conversion rate than the second approach.

Type 2 will get you as much traffic as you have content. If you have 350,000 pages, you can easily exceed 10,000 uniques per day provided you get some very high quality backlinks to your jump-off pages. But these visitors weren’t looking for your site specifically. In fact most of them come, stay for 1 second and go away.

I tend to go with Type 2. Have a TON of content that’s somewhat related to the product or service you’re trying to sell, drive massive traffic and try to convert from there.

When you choose Type 2, you get Type 1 for free. All 350,000 of those pages you’ve published have a backlink to your home page. That means they’re driving a huge amount of internal pagerank to your home page. That combined with a few high quality sites linking to your home page will give you high conversion rate Type 1 traffic.

Remember, you need to publish as much UNIQUE and USEFUL content as you can. If Google doesn’t send you visitors in droves, then they’re not doing their job because they’re in the business of listing unique and useful web pages in their search results.

It’s 6:31 and I’ve overrun my allowance. The best SEO resource on the Intertubes is at http://webmasterworld.com/

Scaling early stage startups 2

I gave a talk at the Seattle Tech Startups meeting last night on scaling early stage startups. The slides can be found here.

I’ve included some of the data I talked about below…

Here’s a post on Greg Linden’s blog describing the research that Google did and their experience at Amazon wrt page load times. 0.5 seconds increase in page load time can kill user satisfaction and consequently your revenue from that page.

Slow performance may also turn away visitors who may have become customers. If you’re a viral business, this can break your virality. It can take user adoption below that critical viral threshold.

A few key points from the talk:

  • Make sure your servers are configured correctly before you get your first burst in traffic. The biggest problem I seem to make constantly is underestimating the amount of memory that MySQL or Apache is going to grab when the web server load builds or the db size increases. This results in my machine running out of memory and using disk as memory (swapping) which slows things down very badly.
  • Disable Keepalive until you can set up a separate server for static content like images.
  • Cache as much dynamic data as possible. I use Cache::FileCache. Memcached is an excellent product for caching across server boundaries.
  • When MySQL isn’t fast enough for you, consider using lightweight systems like BerkeleyDB or Perl’s Tie::File. Or just roll your own using read/write file locking (flock).
  • Create a separate static content server with keepalive enabled. Lots of lightweight threads can handle many more connections that your app server. Having keepalive enabled is much friendlier to your browsers. You also get the added benefit of higher browser concurrency with multiple hostnames.
  • Block content thieves by monitoring how many actual content pages they fetch per minute. If they exceed a threshold, then do a reverse lookup on their IP address. If they aren’t Google, MSN, etc, then block them using a firewall rule. ‘iptables‘ is a great tool for blocking baddies.
  • Most I/O intensive processes like MySQL or your own file access routines benefit hugely from the Linux filesystem cache. Make sure you leave plenty of free memory on your server. Linux will automatically grab that memory and use it to cache disk data and minimize actual disk reads and writes. I/O is a very common bottleneck for servers and this is an easy way to fix this bottleneck.
  • Websitepulse.com is great for seeing your actual HTML page load time including DNS lookups.
  • httperf from HP Labs is an open source linux tool that you can use to torture your web servers and see exactly how badly or well they perform under load. Make sure you run it on a different machine to the one you’re benchmarking.

What about Angels? 0

I’m opposed to VC funding for the reasons I’ve mentioned in a previous post, but what about Angel funding. In my experience, it’s actually not a bad deal. Generally you can sell common stock to an angel investor with no liquidation preferences and no other strings attached. The angel owns 5% to 15% of your business - sometimes they have a seat on your board, but not always. You retain full control. More importantly, if there is a liquidity event (you sell or IPO your business) you split the proceeds along the equity lines. The share everyone gets is proportional to how much of the business they own.

If business development is part of your strategy, angels can be an incredible resource to help you build your network.

Be wary of Angel deals that require to you take VC investment down the road.

Finders Keepers, Founders Weepers 0

I discovered a brilliant blog entry by Paul Allen (the other one) on his experience with Liquidation preferences. It uses the excellent case of how the epinions founders got zero after a $30 million sale of their business and subsequent valueation of $300 million 18 months later.

Here’s a few extracts:

Today, the Epinions portion of Shopping.com is worth hundreds of millions, but at the time of the DealTime merger, Epinions was valued at roughly $30 million, which rendered all common shares worthless.

“The question we’re asking,” said one former Epinions employee who asked not to be identified and who plans to be part of the lawsuit, “is how this company supposedly worth only $30 million was suddenly worth $300 million only 18 months later.”

and..

If preferred shareholders control the board, they therefore have the power to crush the common shareholders by triggering a liquidation event at any time. If the valuation at that point is not higher than the Liquidation Preference, then the common shareholders get nothing.

Paul also talks about the history of MyFamily.com, how they took on a truckload more liquidation preferences via an acquisition and their experience dodging liquidation preferences

Start… 4

Every person involved in consumer tech startups that I’ve spoken to over the last two years has regretted getting venture capital financing. That includes CEO’s of companies who have received $10’s of millions in venture funding and folks who have been involved in the pitch and been a part of the company post funding. Some of these people I know have started new businesses with the explicit goal of not taking any venture money and doing it themselves. Taking action like this speaks volumes about their previous experience.

I don’t know anyone who is happy about taking VC money.

My personal aversion to VC financing focuses on a few specific things:

  • The liquidation preferences. Even though a VC owns 20% of your stock, there are usually terms in the deal that say that if you sell for say $20 million, the VC’s take the first X million before you get paid. That could be $19.9 million. Or it could be more which blocks any potential exit.
  • The “knock it out of the park or go home” problem. When a VC invests, they invest in 20 companies and 19 of those are expected to fail. The 20th is expected to be the next Google or MySpace i.e. to sell for over $500 million or to IPO and generate similar amounts of cash for investors. When you take money from a VC, you are put under enormous pressure to “knock it out of the park” and you may be pushed into taking risks you wouldn’t normally take. These risks will see your business either succeeding bigtime or failing bigtime.
  • The loss of control. No matter how the deal is structured (and this is from a friend who’s taken money), it’s tough to say no to someone who has just written you a check for $6 million.
  • The reporting burden. Running a company is time intensive already. Now you have to provide regular reports that may be largely academic when it comes to your long term growth.
  • The loss of being nimble. When you sell a business plan to an investor, and they invest, you have committed to a specific path. If the competitive landscape changes or a new opportunity presents itself, you may not have the ability to radically change your business quickly to respond to this new threat or opportunity.

Depending on the terms of the deal you cut, you may be very surprised at how little you walk away with after a liquidity event. The MySpace founders made around $5 million each from a $550 million sale of their company.

I’ve started No VC Required in the hope of starting a conversation about the benefits of not taking money from VC’s and how you might not actually need to take VC money.

Mark Maunder