Blacklists consisting of IP:s known to scrape the site is not really a method in itself since you still need to detect a scraper first in order to blacklist him. Even so it is still a blunt weapon since IP:s tend to change over time. In the end you will end up blocking legitimate users with this method. If you still decide to implement black lists you should have a procedure to review them on at least a monthly basis.
Captcha tests are a common way of trying to block scraping at web sites. The idea is to have a picture displaying some text and numbers on that a machine can’t read but humans can (see picture). This method has two obvious drawbacks. Firstly the captcha tests may be annoying for the users if they have to fill out more than one. Secondly, web scrapers can easily manually do the test and then let their script run. Apart from this a couple of big users of captcha tests have had their implementations compromised.
For example U.S. courts have acknowledged that users of “scrapers” or “robots” may be held liable for committing trespass to chattels,which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. However, to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff’s possessory interest in the computer system and that the defendant’s unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.
Obfuscating Source Code
Some solutions try to obfuscate the http source code to make it harder for machines to read it. The problem here with this method is that if a web browser can understand the obfuscated code, so can any other program. Obfuscating source code may also interfere with how search engines see and treat your website. If you decide to implement this you should do it with great care.
To rate limit an IP means that you only allow the IP a certain amount of searches in a fixed time-frame before blocking it. This may seem sure way to prevent the worst offenders but in reality it’s not. The problem is that a large proportion of your users are likely to come through proxy servers or large corporate gateways which they often share with thousands of other users. If you rate limit a proxy’s IP that limit will easily trigger when different users from the proxy uses your site. Benevolent bots may also run at higher rates than normal, triggering your limits.
One solution is to use white list but the problem with that is that you continually need to manually compile and maintain these lists since IP addresses change over time. Needless to say the data scrapers will only lower their rates or distribute the searches over more IP:s once they realize that you are rate limiting certain addresses.
In order for rate limiting to be effective and not prohibitive for big users of the site we usually recommend to investigate everyone exceeding the rate limit before blocking them.