How Websites Detect Web Scraper
Web scraper (Bot) and humans can be differentiated based on their features or their activities. Webpages or the anti-scraping services, examine the features and activities of users visiting the webpage to differentiate the type of user.
These tools and products construct basic or detailed digital fingerprints from the characteristics and interactions of these visitors with the website. This data is all compiled and each visitor is assigned a likelihood of being a human or web scraper (bot) and either allowed to access the website or denied access.
This detection is done either as installed software or by service providers bundling this service in their CDN type service or pure cloud based subscription offerings that intercept all traffic to a website before allowing access to anyone.
Where Website can Detect Bot
Detection can be done on client side i.e. on your browser or on web server or by using both of these mechanisms. Web server can use inbuilt software to detect a bot or they can use clod service providers like AWS or Google cloud. As this detection is based on probability determined by considering various factors it can go wrong. Sometimes it can block genuine users and allow bots to enter the webpage.
Let’s look both these technics in detail:
Server-side Bot Detection
This type of detection occurs at web server end by using either softwares or a web service provider. All the traffic is routed through that software or service provider’s server and only genuine users are allowed to actually hit the original web server.
There are many ways to do such detection as follows:
HTTP Fingerprinting:
HTTP fingerprinting is done by scanning some basic information send by a browser like User Agent, Request headers like cookies, referrer, browser encoding, gzip compression etc. Most important and easy to detect way is IP address of the user.
TCP/IP Fingerprinting:
Any data that we send to a web server is sent as packets over TCP/IP. These packets contains details such as Initial packet size, TTL, Browser window size, segment size, window scaling value, “sackOk” flag, “nop” flag etc. All these details are combined to make a unique signature of a machine which can help in finding a bot.
Web Activity Monitoring and Pattern Detection:
After creating an identity using all the methods listed above, bot detectors can monitor user activity on a website or on number of websites using same bot detecting services and if any unusual activity is found like higher than usual requests which can only be made a by a bot. If a user is identified as a bot, website can ask to solve a CAPTCHA, if user fails he can be flagged or blocked.
Client-Side Bot Detection
As client side bot detection is easier most websites use both technics. On the client side any request that is not coming through a genuine browser gets blocked instantly. Easiest way to detect if request is coming from a bot is to see if it can render a block of java script. All the browsers have javascript enabled while a request sent by a boat such as using Request module can not render a javascript.
In such cases a real browser is necessary to access the webpage and scrape it. There are libraries like selenium, puppeteer etc which can control a real web browser like chrome and do scraping.
Client side detection occurs by creating a fingerprint using multiple attributes of a real browser such as:
- User Agent
- Current Language
- Do Not Track Status
- Supported HTML5 Features
- Supported CSS Rules
- Javascript Features that Supported
- Plugins installed in Browser
- Screen Resolution, Color Depth
- Time Zone
- Operating System
- Number of CPU Cores
- GPU Vendor Name & Rendering Engine
- Number of Touch Points
- Different Types of Storage Support in Browser
- HTML5 Canvas Hash
- The list of fonts have installed on the computer
Using all these technics a website can detect a bot. But again as websites gets smarter in bot detecting so does the web scraper. Expertise web scraping services can mimic a browser using selenium or use proxies, IP rotation, CAPTCHA solving services etc to bypass all the checkpoints. It is an ongoing fight between websites and scrapers and both are continuously developing new ways to counter each other. Learn more about How to Use Cookies and Session in Python Web Scraping.