Learn how to stop scraping and is scraping really legal?
If you have questions regarding scraping or how to handle it, feel free to drop an email to [email protected] and we will answer. All information will be kept confidential, nothing that can be used to identify you or your company will be published.
Q: What is the scraped data used for?
A: This is very much dependant on the data of course and impossible to give a specific answer but some examples may be:
- Launching competing services
- Building telemarketing databases (specific for yellow/white pages)
- Building link farms
- Reselling services
- Content for adsense pages
As well other uses not listed here.
Q: Why is data scraping difficult to counter?
A: Parasitic scraping use tools that behave like legitimate bots used by spider, crawlers and search engines such as Google and Bing. It is difficult to differentiate between good and bad scrapers. It is imperative to avoid blocking legitimate scrapers.
Q: What is the impact of data scraping?
A: Data Scraping can impact a site in 3 main ways:
- The uniqueness of your intellectual property is compromised
- The sheer volume of scrapers may impact, slow down or even create a denial of service of your site
- Scraping by unknown parties may have a legal impact on your partner content
Q: What is the difference between a screen/web scraper and a bot?
When we use the word scraper we refer to both the program itself and the person/organization behind it. A screen/web scraper is normally focused on a single site or a single business vertical and is very dedicated to retrieving the information or committing the transactions. The scraper will monitor the performance of his/hers program on each target site and will quickly notice when they are getting blocked. A malicious web bot is, as we define it, a program that retrieves or posts information from/to a large number if sites more or less autonomously. The web bot programmer is focused more on the overall performance of the web bot and is not likely to immediately notice a block on a specific website.
Q: Why don’t you just use a captcha test? That will block all scripts!?
A: Yes and no, in some environments a captcha test may be very useful, for example for registering a single thing but in other places it may be more or less useless. If you take for instance a large database in where users are supposed to do several searches giving each user a captcha test is not an option in most cases. Even if you use it in conjunction with rate limiting to detect site scrapers you will still have problems with large gateways and spiders.
Q: Is screen scraping legal?
A: problems with legal action against scraping
There are two major problems with using legal action to stop web scraping.
- The first is obviously that since the scraping is performed on the Internet the scraper may be located anywhere in the world and he or she may not abide by the laws of the country where the site is located.
- The second problem is the sheer scale of scraping and the fact that it is not easy to identify the scrapers at most times. If you have a large site with valuable information or business logic that attracts scrapers there will probably be hundreds of offenders each month and pursuing legal action against them all will be very costly.
You may think that it would be enough with one or two to deter the others but from our experience most scrapers care little for that risk and hide behind open proxy servers or other anonomizing services that make them close to impossible to identify.
Q: How can I block an ip from accessing my site?
A: There are four main ways of doing this,
- In a firewall or other packet filtering device
- In the webserver by using .htaccess or similar
- In the application itself
We have put together a comprehensive guide on this: