When protecting against screen scraping it is useful to understand whom you are up against. This article will go through the common methods people and companies use to scrape data.
This method is normally the first method scrapers try before any countermeasures are activated against them. An internal development team will develop a script to run against your site to retrieve certain data. They will most commonly start out from the real IP address of their company with a basic script. As they do little to hide themselves, they are normally easy to identify. It is a good idea to block the IP address and respond to these requests with a message explaining what they are doing is unacceptable, and the potential impact if they choose to continue. Perl is one of the more common languages to write scraper scripts, but it could just as well be a-guy-with-a-python script or visual basic for that matter.
Here are some examples of the default user-agents of the most common scraping methods:
A simple grep command on the webserver logs will usually identify a few scrapers to help to get started blocking straight away.
Some companies do not have the internal resources to write the scraping programs themselves, and hire outside experts. A few sites offering these services can be found here:
There is no shortage of coders specializing in writing scraping tools.
An interesting exercise is to search “your-site-name +scraper”, and see who is interested in hiring people specifically to scrape your business.
There are a lot of readily made programs out there to help facilitate scraping of websites, everything from simple freeware programs or open source to more advanced commercial programs. Uipath and newprosoft are good examples of commercial scraping tools where for example mozenda.com would qualify more as software as a service or a cloud solution. The more advanced tools will let the scraper utilize lists of open proxies or similar to hide their activities. Scraping as a service companies can specialize in particular market segments as well as for specific websites.
How scrapers hide their activities
One of the basic problems with scraping is that the scraper almost always wants to extract a lot of data in a relatively short period of time or continuously over a longer period. This makes them fairly easy to spot by simply looking for IP addresses that have an abnormally high usage. To make themselves more invisible scrapers try to distribute themselves over large amounts of IP addresses by a number of different ways:
1. Open proxy servers
Open proxy servers are web proxies or webservers that have been wrongly configured. The owners of web proxies are most often not aware that their infrastructure is being used to hide others original source IP addresses. There are communities and services that continuously trawl the Internet looking for new open proxies and compiling them to lists that can then be utilized by scrapers and other people on the Internet wishing to hide their true identity. http://www.xroxy.com/proxylist.htm is an example of such a service. There are more reliable proxy networks with a higher level of anonymity that can be accessed for a fee.
2. Cloud services
Cloud services are the favorite option for the serious scraper. Services like Amazon’s EC2 lets scrapers create servers on demand and use Amazon’s vast IP ranges for scraping. There is a growing number of cloud service providers to choose from, but a place to start if you suspect you are being scraped by a cloud service is to check your web logs for access from Amazon’s IP ranges as they are the largest provider. Amazon’s IP ranges can be found here:
3. Anonymizer services
There are services available on the internet that lets you hide your IP address. These services often provide the users with large amounts of IP addresses as well to make tracing of the users harder.
A commonly used service by scrapers is Tor. Tor is a free service that depends on users setting up exit nodes for others to use. Unlike most of the other anonymizer services Tor has a list of the active IP addresses that you can use to cross reference your logs. The list of currently active Tor exit nodes can be found here.
There are other commercial services like for example www.anonymizer.com that do not disclose the IP ranges they provide for their customers. In order to find them out you need to sign up on the service and track them yourself.
4. Large networks
Some scrapers have access to substantial ranges of IP addresses that they own themselves, if they for example have access to a b-class network they can use over 65 000 IP addresses. These are normally easier to find out as they are connected in one range.
Learn how to protect your website from scrapers