Scraping bigger websites can be a challenge if done the wrong way. Bigger websites would have more data, more security and more pages. We've learned a lot from our years of crawling such large complex…
Almost every company and individual benefits from Web scraping everyday. If you don’t believe that, imagine a world without Google.
You discovered this article due to the benefits of Google web scraping our website. Every company out there engages in Web scraping at some level or relies on third parties to perform Web scraping for them. If they don’t do either, they definitely consume data that is gathered by Web scraping companies or at the very least rely on search engines to perform everyday tasks – all that is web scraping or web crawling that use bots.
However, some bots can lead to harmful outcomes such as ad manipulating bots or bots that automate hacking at a large scale and a lot of resources are being dedicated on both sides of the equation – web scraping companies and anti-bot companies. Web Scrapers and Bot Mitigation companies have been playing a game of cat and mouse ever since web scraping became popular in extracting data from web pages. Bot mitigation companies and products try to identify non-human or bot traffic from all the traffic that a website receives. The least sophisticated bots are easy to identify and as the bots get more sophisticated, it becomes much harder to accurately identify a bot from a human.
Table of Contents
How do websites detect web scrapers and other bots?
Bots and humans can be distinguished based on their characteristics or their behavior. Websites or the anti scraping services they employ, analyze the characteristics and behavior of visitors to the website to distinguish the type of visitor. These tools and products construct basic or detailed digital fingerprints from the characteristics and interactions of these visitors with the website. This data is all compiled and each visitor is assigned a likelihood of being a human or bot and either allowed to access the website or denied access. This detection is done either as installed software or by service providers bundling this service in their CDN type service or pure cloud based subscription offerings that intercept all traffic to a website before allowing access to anyone.
These products obfuscate how they identify the traffic as belonging to a bot which adds an aura of mystery to their products and enhances there commercial value.
Where can websites detect bots?
This detection can happen at the Client side (i.e. your browser running on your computer) or the Server side (i.e. the web server or inline anti-bot technologies protecting their traffic by intercepting the traffic) or a combination of the two. Web servers either use inline products to detect this behavior before it hits the web server or they use cloud services which either work before the traffic hits the website or are embedded into the web server and rely on processing performed elsewhere to detect and block the bot traffic. The problem is that this detection (like everything else) has False positives and end up detecting and blocking regular people as bots or add too much processing overhead that they make the site slow and unusable. These technologies do come with costs (financial and technical) and have these trade-offs to consider.
Here are some of the areas where the detection can occur
- Server Side Fingerprinting with behavior analysis
- Client or Browser Side fingerprinting with behavior analysis
- A Combination of both of the above spread across multiple domains and data centers
Server Side Bot Detection
This level of detecting bots starts at the Server level – on the web server of the website or devices of cloud based services that sit in from of the website, monitoring traffic and identifying or blocking bots. There are a few types of fingerprinting methods that are usually used in combination to detect bots from the server side.
Fingerprinting has a detrimental impact on global privacy that allows seamless tracking of individuals across the Internet but that is a whole topic in itself.
HTTP fingerprinting is done by analyzing the traffic a visitor of a website sends to the web server. Almost of all this information is accessible to the web server and some of it can also been seen in the web server logs. It can reveal basic information about a visitor to the site such as the
- User Agent (what kind of browser – Chrome, Firefox, Edge, Safari etc and its version)
- Request Headers like Referer, Cookie, which Encoding the browser accepts, whether they accept gzip compression etc – these are all additional pieces of information that are sent by the browser to the server
- The order of the Headers above
- The IP Address of the visitor are making the request from or finally accessing the web server (in case the visitor is using an ISP based NAT address or Proxy servers)
TCP/IP Stack Fingerprinting
The data a visitor send to servers reach the server as packets over TCP/IP. The TCP Stack fingerprint has details such as
- Initial packet size (16 bits)
- Initial TTL (8 bits)
- Window size (16 bits)
- Max segment size (16 bits)
- Window scaling value (8 bits)
- “don’t fragment” flag (1 bit)
- “sackOK” flag (1 bit)
- “nop” flag (1 bit)
Read more about TCP/IP internals in a very easy to understand article.
These variables are combined to form a digital signature of the visitor’s machine that has the potential to uniquely identify a visitor – bot or human. Open source tools such as p0f can tell if a User Agent is being forged. It can even identify whether a visitor to a website is behind a NAT network or have a direct connection to internet, the browsers setting such as language preferences etc.
When a site is accessed securely over the HTTPS protocol, the web browser and a web server generate a TLS fingerprint during an SSL handshake. Most client user-agents such as, different browsers, applications such as Dropbox, Skype, etc. will initiate an SSL handshake request in a unique way which allows for that access to be fingerprinted.
The Open source TLS Fingerprinting library JA3 gathers the decimal values of the bytes for the following fields in the Client Hello packet during an SSL Handshake:
- SSL Version
- Accepted Ciphers
- List of Extensions
- Elliptic Curves
- Elliptic Curve Formats
It then combines those values together in order, using a “,” to delimit each field and a “-” to delimit each value in each field. These strings are then MD5 hashed to produce an easily consumable and shareable 32 character fingerprint. This is the JA3 SSL Client Fingerprint. MD5 hashes also have the benefit of speed in generating and comparing values and are very unique hashes.
Learn more about JA3 here: Salesforce JA3 Github
Behavior Analysis and Pattern Detection
Once a unique fingerprint is constructed by combining all the above, bot detection tools can trace a visitor’s behavior in a website or across many websites – if they use the same bot detection provider. They perform behavioral analysis on the browsing activity, which is usually
- The pages visited
- The order of pages visited
- Cross matching HTTP Referrer with the previous page visited
- The number of requests made to the website
- The frequency of requests to the website
This allows the anti-bot products decide if a visitor is a bot or human based on the data they have seen previously and in some cases sends a problem such as a CAPTCHA to be solved by the visitor. If the visitor solves the CAPTCHA – the visitor might be recognized as a user and if the CAPTCHA fails (which is the case with most bots that does not anticipate this) gets flagged as a bot and blocked.
Now, any requests that comes from these fingerprints – HTTP, TCP, TLS, IP Address etc to any of the websites that use the same bot detection service will challenge visitors to prove themselves as a human. The visitor or their IP address is usually kept in a blacklist for a certain period of time, and then removed from it if they do not see any more bot activity. Sometimes, persistently abused IP addresses are permanently added to global IP blocklists and denied entry to many sites that use such rudimentary blocklists.
It is relatively easier to bypass Server Side Bot Detection if web scrapers are fine tuned to work with the websites being scraped.
Pro Tip: The best way to understand every aspect of the data that moves between a client and a server as part of a web request, is to use a proxy server in the middle such as MITM or look at the network tab of a web browser’s developer toolbar (accessed by F12 in most cases). For deeper analysis beyond HTTP and lower down the TCP/IP Stack you can also use Wireshark to check the actual packets, headers and all the bits that go back and forth between the browser and the website. Any or all of those bits can be used to identify a visitor of the website and consequentially help fingerprint them.
You can follow the directions in this post to get past most of the simpler server side bot detection techniques.
Client Side Bot Detection (Browser Side Bot Detection)
Almost all of the bot detection services, use a combination of Browser side Detection with Server Side Detection to accurately block bots.
The first thing that happens when a site starts client side detection is that all scrapers that are not a real browser will get blocked immediately.
Once this happens, a real browser is necessary in most cases to scrape the data. There are libraries to automatically control browser such as
As an example, the navigator object of a browser exposes a lot of information about the computer running the browser. Here is a look at the expanded view of the navigator object of a safari browser
Below are some common features used to construct a browser’s fingerprint
- User Agent
- Current Language
- Do Not Track Status
- Supported HTML5 Features
- Supported CSS Rules
- Plugins installed in Browser
- Screen Resolution, Color Depth
- Time Zone
- Operating System
- Number of CPU Cores
- GPU Vendor Name & Rendering Engine
- Number of Touch Points
- Different Types of Storage Support in Browser
- HTML5 Canvas Hash
- The list of fonts have installed on the computer
Learn More about browser fingerprints from this article on Mozilla – This is Your Digital Fingerprint
Apart from these techniques, bot detection tools also look for any flags that can tell them that the browser is being controlled through an automation library.
- Presence of bot specific signatures
- Support for non standard browser features
- Presence of common automation tools such as Selenium, Puppeteer, Playwright, etc.
- Human generated events such as randomized Mouse Movement, Clicks, Scrolls, Tab Changes etc.
All this information is combined to construct a unique client side fingerprint which can tag one as bot or human.
Bypassing these Bot Detection / Bot Mitigation / Anti Scraping Services
Developers have found many workarounds to fake their fingerprints, to conceal that they are bot. For example
- Puppeteer Extra – Puppeteer Stealth Plugin
- Patching Selenium/ Phantom JS – Stack OverFlow Answer on Patching Selenium with Chrome Driver
- Fingerprint Rotation – Microsoft Paper on Fingerprint Rotation
But as you might have guessed, just like Bots, Bot Detection companies are getting smarter. They have been improving their AI models and look for variables, actions, events etc that can still give away the presence of an automation library.
Most poorly built scrapers will get banned with these advanced (or “military grade” – as they call it) bot detection systems.
Web scraping techniques and the detection techniques evolve everyday and this article has hopefully provided you valuable insight into some of the general concepts.
If you like this article go ahead and share it on your favorite medium, Twitter, Facebook, LinkedIn, Reddit etc and keep coming back for more articles that we publish regularly on our website – scrapehero.com.