Generally the hard part of stopping screen scrapers is not placing a block on them, but rather finding them in the first place. Once you have identified a scraper, it is essential to place the block as quickly as possible to stop the activity from the current source. When designing blocking you should bear in mind that scrapers often distribute themselves over thousands or millions of IP addresses to hide themselves. Any solution should be able to handle large lists of IP addresses and ranges of IPs. Another key issue in handling lists of blocked IP addresses is keeping the lists up-to-date. There is rarely a reason to block or allow IP addresses indefinitely. Without proper handling of white lists and black lists they tend to grow over time to a point where they become unmanageable.
Blocking of IP addresses can be done in various parts of the web infrastructure. Depending on how your site is built and which parts you have control over, you may choose to perform blocking through one or more of the following:
- Firewalls are typically built for the purpose of blocking IP addresses and can handle long lists of IPs without any noticeable performance impact.
- There is normally a change process in place to manage firewall changes without interrupting normal site operations.
- Blocking IP addresses in the firewall normally doesn’t require any development work.
- A firewall change may take an unacceptable amount of time to implement from a procedural point. Once you have identified a scraper it is important to get a block in place quickly to stop the unwanted behavior.
- Firewalls normally only operate on the TCP/IP layer which means that you cannot target scrapers behind proxy servers or big gateways without impacting all users of that gateway.
- Load balancers often operate on the HTTP layer which means you can block scrapers by user agent or cookie, giving a greater flexibility than blocking on IP alone.
- Load balancers can often handle blocks without noticeable performance impact.
- Depending on the brand and model of load balancer it may require complex changes to the configuration.
- Many companies are reluctant to make changes to their load balancers outside maintenance windows.
.htaccess or similar
A description of the htaccess functionality can be found here: http://en.wikipedia.org/wiki/Htaccess
- A simple way of blocking bots and scrapers on the IP level in the webserver.
- Normally only requires a small change to the web server configuration file.
- It is possible to use more advanced functions in the webserver to access parts of the HTTP header for blocking.
- It may have a performance impact.
- It may require development work.
In the application
- Incorporating blocking functionality in the web application is often the most flexible way of blocking.
- Correctly written, blocking in the application will allow you to place blocks immediately.
- Requires development work
Update: @miss_sudo kindly provided this script that can be added to a website to block Tor nodes from accessing the site.