Web Scraping is the act of stealing data secretly from the web. It has become the most common threat to millions of online businesses today. The vast importance of data and the criticality to access it for multiple business purposes has inspired people to adopt various legal or illegal techniques to gather the web information for personal gains. Even the high profile websites like Google, Ryanair, Ebay, Facebook and EasyJet, just to name a few, have encountered terrible scraping attacks costing them billions of dollars in the data theft carried out by professional scraper companies. Thus, in view of this widespread menace of data theft, it becomes a mandate to understand the true meaning of web scraping, get familiar with the common tools and techniques scrapers apply to crack into the website and the possible ways to detect and prevent scraping. So, let’s have a deep insight into this article that throws light on the definition, detection & prevention of Web Scraping to take you out from the dark nightmare of risking your online presence and intellectual web property in the hands of offenders.
What is Web Scraping?
Web Scraping refers to the technique of extracting bulk data (both text and graphic) from websites and then compiling the gathered information into physical data storage units (hard disks, compact disks etc) to use it for financial gains or other business purposes. This is usually done by using artificially intelligent web scraping software programs which simulate the complete human-computer interaction to automate the process of manual data extraction (copy paste techniques) and making it easy to harvest tonnes of data quickly and efficiently into spreadsheets.
The other names that are similar in meaning to web scraping include Web Data Extraction, Web Harvesting, Screen Scraping and Web Automation.
Web Scraping Software Vs Web Browser
A Web scraping software is functionally similar to a Web browser in the sense that both of them interact with the website in a similar way and have built-in capabilities to parse the HTML document object model (DOM). However, a web browser focuses on just rendering the HTML tags into a full-fledged webpage while, a web harvesting software quickly extracts the desired content(only the desired fields like name, phone no, address etc) from the HTML syntax and saves it to a local file present on the hard disk of your computer or an external database.
Web Scraping Software Vs Web Crawler
Web Scraping software usually simulates the way humans explore the web just like the web crawlers do but additionally, while crawlers just index the data for search engines, scraping software also transforms the unusable and non-readable format of data (HTML format) into usable and readable format (original content like text, images etc) that can be easily exported into spreadsheets for later analysis. For this, these softwares either implement low-level HTTP or may also embed the whole browser program to peel-off the information from webpages.
Common Techniques and Technologies Available
There are numerous techniques and technologies to perform Web Scraping, most of which use Site Scraping tools, custom scripts (generally Perl or Python) and Advanced Scraping Softwares. However these techniques and technologies can be bifurcated into two categories – Manual Techniques and Automated Techniques.
- - - - -
Manual Scraping Techniques
Manual scraping techniques are those techniques that don’t use any tool or software to accomplish the task of gathering the web information and saving it on local storage repositories. However, it is absolutely impractical to use such techniques which are both cumbersome and time consuming, especially when it comes to harvest tonnes of data for bulk marketing usage or for other large scale purposes. But, sometimes manual scraping does makes sense as scrapers don’t always depend on automated softwares and tools for collecting information from the web. This is because most of the high profile websites are already equipped with barrier settings that detect harvesting softwares and prevent them to run any automation scripts on their websites. Thus in extremely urgent cases when someone desperately needs to extract information for business purposes, it’s likely that the person practices the following very common manual scraping technique on your website.
Human Copy and Paste
This technique is the simplest manual solution to extract data from a website wherein the scrapers can select the sections of interest from the webpage that they’re surfing, copy them using keyboard shortcut (Ctrl+C) or via a mouse and concurrently paste(Ctrl+V) all the required stuff into external data sheets(spreadsheets etc). Human labour in low cost countries is also frequently used to bypass harder captcha challenges that some sites use.
- - - - -
Automated Scraping Techniques
Automated scraping techniques consist of technologies that automate the task of scraping. These techniques include the advanced software programs, sophisticated web harvesting tools and some programming techniques that are used commonly to scrape bulk content from the websites. Following is a list of these techniques and technologies:
1. Text Grepping via Regular Expression Matching
This technique uses the most commonly used Unix command – grep to extract information from the websites clubbed with regular expression matching via usage of programming languages like Perl or Python.
2. Text Searching without Regular Expression Matching
This technique uses “Hit and Trial” method and is mostly preferred by the people who don’t usually understand the Regular Expression matching technique.
Xpath is a technique in which scrapers install a free add-on “XPath Checker” from Mozilla Firefox that lets them view the Xpath expression, and manipulate it via matching the text by running a specialised scripting code. This technique has several advantages like the enhancement in development speed and easy replacement.
5. HTTP Requests
Information from both the types of webpages – static and dynamic, can be easily retrieved by sending HTTP requests to the remote web server on where the data actually resides. This is often done using Socket Programming.
6. HTML Parsing
HTML parsing is a technique that uses HTML parsers to extract bulk data from a collection of webpages using customised scripts , templates ,algorithms and some data query languages like Xquery, HTQL etc.
7. DOM Parsing
DOM refers to the document object model of an HTML document based on which the software programs can access the whole DOM tree and retrieve specific parts of webpages. DOM parsing is usually done by embedding the whole of the web browser like – Internet Explorer or Mozilla, just to name a few to retrieve the dynamically generated client-side scripts.
8. Vertical Aggregation Platforms
These platforms are developed by giant companies using Cloud solutions as their strategy to harvest large amounts of data. These companies rely on their sophisticated solutions that generally create varied “Bots” for specific purposes and monitor them via highly scalable Vertical Aggregation Platforms
9. Semantic Annotation Recognizing
Semantic Annotation Recognizing is a sub-part of DOM parsing technique. Metadata mark-ups or semantic annotations can be used for retrieval of data schema and locating small data snippets that are embedded in the “Head” part of most of the webpages.
10. Computer Vision Web-page Analyzers
These analyzers typically use machine learning algorithms or Big Data Solutions to identify required information and later extract it via webpage interpretation.
11. Site Scraping Tools and Softwares
Site scraping tools and softwares usually provide scrapers with a point and click interface that harvests information as desired by them. Here is a tabular summary of some of the popular scraping tools and softwares along with their functionalities that are most commonly used across the world.
|Visual Web Ripper||An automated and powerful Win 32 based visual tool mainly suitable for those who are beginners in this domain.|
|Smart Data Scraper||This software extracts data from all web pages, external and custom links using smart techniques (autosave, autopause etc) as its weapon to scavenge web content. It can even schedule scraping tasks and allows one to set custom frequency also via options for “hourly”, “daily”, “weekly”, or“monthly”scheduling.|
|Scraper(a Chrome Plug-in)||This is an add-on by Google chrome easily available on its web store. This plug-in is used for traversing and scraping the information from the websites via a custom sitemap that navigates through the webpages.|
|Helium Scraper||It’s a powerful Web Page Scraper cum Data Extractor that can be easily set up to extract any kind of information from the web via a point-and-click UI.|
|Outwit Hub||It’s a generic tool with great capabilities to grab web content easily and present it in varied specialised formats. This tool also performs recurrent SEO analysis.|
|Screen Scraper||This tool allows scrapers to peel-off structured data from unstructured data and format it after downloading it for free.|
|Web Content Extractor||A highly reliable Site scraping tool that comes with a user-friendly wizard driven interface that doesn’t require any technical know-how to be used.|
|Web Harvy Data Extractor||Web Harvy is an intelligent data extractor that scrapes and collects web content in multiple formats.|
|Data Extractor by Mozenda||This Site Scraping Tool is mostly used for Web Content Mining.|
|iRobotSoft||A web automation Win 32 software that creates smart Web Robots to simulate various high-end human activities.|
|Import.io||It’s provides an extremely easy way to import the web data via downloading it directly.|
|Scraper Wiki||This is an integrated platform for both web scraping and screen scraping.|
|Easy Web Extract||This is a web scrape solution that harvests all data in a few screen clicks. Plus, there is no programming knowledge required to use this tool.|
|Web Sundew||This is a Site scraping tool that extracts data from unstructured data formats extremely fast and with comparatively higher productivity.|
|Jspider||It’s a Java based web spider engine available for free under the Open Source License and is thus mostly used by experienced Java Programmers.|
|Scrapy||A Python based web crawling framework mainly used for screen scraping. It can do everything from Data Mining to monitoring deadly site attacks and even has a provision to automate the testing phase as well.|
|DRKSpider||An SEO-oriented Java Web Crawler that scans all your broken links, internal and external links, style sheets, images and various other files etc and reports any anomaly encountered.|
|Apache Nutch||It’s an open source web-search software project mostly used by professionals. It has built in parsing support, a custom crawler and a link graph database.|
|WebSphinx||It ‘s basically a Java library cum IDE for web crawlers comprising a solid GUI for configuring and controlling it.|
|Heritrix||This is a open source, extensible and scalable web crawler project that comes like an Internet Archive.|
- - - - -
How to Detect Scrapers?
Stealing data secretly from the web to use it for malicious purposes has become a dangerous threat to various well-established online businesses today. From online retailers (ecommerce sites) to other online organizations like real estate firms, airline sector, tour and travel companies, jobs boards or auction websites, anyone can be targeted by scrapers. These scrapers either use some scraping agents, tools or software to perform myriad of scraping attacks on your website. These scraping attacks include unusual activities and web application threats like SQL Injections, Cross site scripting (XSS) , Cross site request forgery(CSRF), Directory Traversal, Site Reconnaissance, Sensitive Data Leakage, Message Board Content Spam, Malicious Source IP addresses, Bot user agents, session information etc that can cost you millions of dollars if not detected on time.
However, it is not that easy to detect scrapers who use various malignant ways to attack your site and steal content. In some situations, it becomes extremely difficult to differentiate scrapers from legitimate users. Hence it is recommended to adopt some powerful solutions that can detect the data theft and thus mitigate the scraping attacks concurrently. Some of them include the techniques to catch the content scrapers directly via common methods while some include the usage of highly scalable data security solutions to identify people who are breaking into your server remotely.
1. Copyscape Premium
Copyscape solution detects the content-duplicacy through a simple search engine that lets you add your website URL and quickly find the duplicates of your content on the internet. The results that show up lets you know who has copied your content into their website and after detecting the antisocial elements you can quickly file a DMCA complaint against them to immediately stop them from stealing your content. If you pay for a premium account on Copyscape, you can run a check on up to 10,000 webpages of your site and can access various advanced tools also.
2. Webmaster Tools
Webmaster tools are also of great help in detecting content scrapers. To check who is scraping your site, just navigate to your site in webmaster tools and look under Web > Links and then sort them w.r.t the Linked Pages column wherein you can see the websites who are copying secretly from your web content.
3. Google Alerts
Creating Google alerts by using quotation marks (“”) to find the exact match for your post’s title will let you know via email updates whenever anyone steals your posts and uses it on their own site. There are 6 basic fields in Google Alerts that let you nab the content scrapers of your site.
- “Search query” field – Here you can write the name of your post to check the duplicacy of the content.
- “Result Type” – A field that has a dropdown list for you to select from various specific subfields like “News”, “Blogs”,” Video”, “Discussions”, “Books” etc. By default the results type has “Everything” field that returns results of every domain type.
- “Language” field – Lets you select from all existing International languages.
- “How Often” – In this field, you can set the frequency of alerts using fields like “once a day” and “once a week”. However by default this field is checked on “As-it-happens” field.
- “How Many” field – This field provides you with options like “Best Results” and “All results”.
- “Deliver to” – It has 2 options – one is deliver to “email” and other is to deliver to “feeds”.
If you want to detect and directly suspect content scraping attacks on your website, this is the best technique to use as it provides you the paths of all the website URL’s that have stolen content from your site.
5. Bot Agent Detection
Bot Agent Detection is done to identify the most common bot agents that perform site scraping and to stop them to cause any further harm. For this, various advanced softwares are used that automatically differentiate between robots and actual human users.
6. Real Time Monitoring and Analytics
There are some data security solutions that act as hidden cameras to alert you immediately when they sense something unusual happening on your website. These security softwares enable you to successfully suspect scraping attacks and monitor suspicious events on your website via security alerts. Security alerts comprises the following functionalities:
- Suspecting scraping activities from a competitor’s IP address
- Detecting automated web requests
- Detecting the source address of malicious website users
- Fetching the date and time information when the site was being attacked
- Predicting the severity level of the attack and the type of attack
- Provide you full visibility of the malignant activities that trigger Site scraping attacks.
7. Cookie Enforcement
- - - - -
How to Prevent Bad Bot Access and Stop Scraping
“Bots” is an acronym for robots that generally refer to the software programs used by search engines to surf the whole web content (web pages, websites etc) and subsequently perform certain automated tasks on it. Different bots possess different functionalities – both harmful and harmless. Like some of them just index the data, some peep into the web forms or look through email addresses to send you spam and some may even investigate your website completely, scouting the site for potential security vulnerabilities. It is necessary to stop these malicious bots from harming your website.
Generally Bots are classified into two types – “Good Bots” and “Bad Bots”. Good bots are those which don’t harm anyone and are used for certain non-malicious purposes like crawling content to index it for major search engines (e.g Google, Yahoo, Bing), harvesting data for personal research, collecting information with the intent of gaining and sharing knowledge without misusing it for any personal benefits or profits. In contrast, Bad bots are those which perform repeated Site Scraping attacks on your website to illegally collect and use the valuable information for generating revenue.
- - - - -
Techniques to Block Scraping
To remain safe from any kind of data security risks, here are some strategies that one needs to follow to block scraping:
1. Applying Anti-Automation Techniques to verify that the user is Human
Using anti-automation tactics like integration of complex “Captchas” into the website can easily sue all the hackers trying to run automated scripts on your websites.
2. Data Obfuscation
Data Obfuscation simply refers to introducing certain obstacles in front of site scrapers to hinder them from accessing and manipulating your data. This can be done using various tools that evade harvesting programs.
3. Deployment of Web Application Firewall
Web Application Firewall does a 2-in-one function. It detects the scrapers proactively and also blocks high profile scraping attacks done by online thieves. This method to stop Site Scraping is adopted by the Networking departments of most of the online businesses who endeavour to prevent any kind of web data harvesting attacks on their websites.
5. Block Well Known Malicious Sources
Most of the scraping attacks can be easily prevented by taking precautions and blocking well known malicious sources beforehand. This can be done by:
- Blacklisting the IP Addresses of well-known cloud services and user-agents by blocking them manually.
- Allowing and Whitelisting only trusted user agents to access your website
- Blocking based on access rate i.e. the websites that send all-at-once page requests
- Blocking visitors that disobey robots.txt files i.e. traps and honeypots
- Blocking well known open proxies to detect and screen proxied requests
6. Using Anti-Scraping Plugins on your website
Some websites use anti-scraping plugins to avoid hackers and spammers steal and misuse the content. Examples of such well known plugins include – WordPress Data Guards, a solid plug-in used by WordPress website owners that blocks right clicks, and common keyboard shortcuts used to manually copy paste the content from the web. The paid version of this plug-in also prevents some automatic scraper attacks.
7. Limit HTTP Requests
Every browser program or scraping software has to send HTTP requests to access the web pages. However, these requests are extremely fast when made by automated clients rather than real users. Plus, the automated robots tend to request the same content or information repeatedly over a consistent period of time. Hence, if you limit the HTTP requests on your website, you can stop Scraping attacks upto a large extent from single IPs.
- - - - -
Are you interested in how ScrapeSentry can protect your online data from malicious scraping?
Fill out the form below and describe your matter of concern.