Before starting to think about the technical measures to deploy to prevent scraping, it is useful to try and estimate to what lengths the scrapers will go to try and avoid detection and blocking. This is normally proportional to the value that the scrapers see in the data, if the value for the scrapers is high they are much more likely to go to great lengths to try and obtain the data.
If your site is being scraped at present, understanding who is doing the scraping, and subsequently determining the impact on their business of discontinuing the flow of data from your site is a good place to start.
Typical low value scraping is the collection of email addresses or the posting of forum spam. These attacks are normally performed by autonomous bots targeting a wide array of sites. These can be stopped relatively easily. At the other end of the scale, you have extremely targeted attacks directed at sites with large quantities of unique data. Ticketing is a good example, and concerts, sports, flights, trains, buses, and ferries all suffer from scraping. The scrapers hitting these sites will go to great lengths to avoid detection and blocking.
What needs to be protected?
Normally only a small proportion of a website is of any real value to a scraper. On a listings site for example, only the listings themselves are targets for scrapers, not all the pages or pictures. To only protect what generally needs protecting will minimize computing overhead and potential latency in delivering the content. It is usually pretty evident as to where the valuable data resides, but if you are in doubt it is best to identify the most active scrapers on the site and have a look at what they are accessing. The scale of the protection can always be paired back if performance is suffering.
Minimum disruption to normal users
After determining what needs to be protected, it is time to decide how to protect it and what level of protection the data needs. In some cases, it may well be enough to use some of the simpler methods described under the resources section of this site that you can implement yourself. In other cases you may need to consider a commercial solution to help out with the problem. It all depends on how much damage the scrapers are doing to your business and how determined they are to succeed. Whatever way you choose, you need to consider the impact it will have on the visitors to the site and make sure it is acceptable. Some solutions like CAPTCHA tests can have serious impact on the traffic if they are used incorrectly, and other methods may introduce latency. False positives will always occur to some extent and having a structured way of handling users being innocently blocked is necessary.
Where to fit it into the infrastructure
An important decision to make is how to integrate the anti-scraping solution with the current infrastructure. This will have impact on everything from troubleshooting to the normal development of your site, and the correct placement should be carefully considered. In general there are four places to detect and enforce scraping activity. Within existing infrastructure such as 1) a load balancer or webservers, 2) out of band passively monitoring the traffic, 3) as a standalone device of function in-line with the traffic or 4) in the application. The integration will depend on the infrastructure of your website, a cloud solution usually has completely different needs and possibilities than an in-house hosted solution. Different stakeholders in the business such as development or infrastructure teams may have strong opinions on how to best integrate with minimum disruption to their workflows.