Scraping Terminology


Blacklists consisting of IP:s known to scrape the site is not really a method in itself since you still need to detect a scraper first in order to blacklist him. Even so it is still a blunt weapon since IP:s tend to change over time. In the end you will end up blocking legitimate users with this method. If you still decide to implement black lists you should have a procedure to review them on at least a monthly basis.

Captcha tests

Captcha tests are a common way of trying to block scraping at web sites. The idea is to have a picture displaying some text andnumbers on that a machine can't read but humans can (see picture). This method has two obvious drawbacks. Firstly the captcha tests may be annoying for the users if they have to fill out more than one. Secondly, web scrapers can easily manually do the test and then let their script run. Apartfrom this a couple of big users of captcha testshave had their implementations compromised.

Legal Recourse

Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear.While outright duplication of original expression will in many cases be illegal, you will first have to catch the scraper and then present the forensic evidence. This is a a lengthy and costly process that may not be conclusive.

For example U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to chattels,which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. However, to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.

Obfuscating source code

Some solutions try to obfuscate the http source code to make it harder for machines to read it. The problem here with this method is that if a web browser can understand the obfuscated code, so can any other program. Obfuscating source code may also interfere with how search engines see and treat your website. If you decide to implement this you should do it with great care.

Rate limiting

To rate limit an IP means that you only allow the IP a certain amount of searches in a fixed timeframe before blocking it. Thismay seem sure way to prevent the worst offenders but in reality it's not. The problem is that a large proportion of your users are likely to come through proxy servers or large corporate gateways which they often share with thousands of other users. If you rate limit a proxy's IP that limit will easily trigger when different users from the proxy uses your site. Benevolent bots may also run at higher rates than normal, triggering your limits.
One solution is to use white list but the problem with that is that you continually need to manually compile and maintain theselists since IP-addresses change over time. Needless to say the data scrapers will only lower their rates or distribute the searches over more IP:s once they realise that you are rate limiting certain addresses.
In order for rate limiting to be effective and not prohibitive for big users of the site we usually recommend to investigate everyone exceeding the rate limit before blocking them.

Our Service

Customer experience

We offer effective solutions to companies in several sectors. Our clients, many of which are long term, are testament to our commitment.