How to prevent getting blacklisted while scraping

Web scraping is a task that has to be performed responsibly so that it does not have a detrimental effect on the sites being scraped. Web Crawlers can retrieve data much quicker, in greater depth than humans, so bad scraping practices can have some impact on the performance of the site.

If a crawler performs multiple requests per second and downloads large files, an under-powered server would have a hard time keeping up with requests from multiple crawlers. Since web crawlers, scrapers or spiders (words used interchangeably) don’t really drive human website traffic and seemingly affect the performance of the site, some site administrators do not like spiders and try to block their access. Most websites may not have anti-scraping mechanisms since it would affect the user experience, but some sites do block scraping because they do not believe in open data access.

In this article, we will talk about how websites detect and block spiders and techniques to overcome those barriers.

Basic Rule: “Be Nice”

An overarching rule to keep in mind for any kind of web scraping is
BE GOOD AND FOLLOW A WEBSITE’S CRAWLING POLICIES

Here are some of the best practices you can follow to overcome the detection.

1. Respecting Robots.txt

Web spiders should ideally follow the robot.txt file for a website while scraping. It has specific rules for good behavior such as how frequently you can scrape, which pages allow scraping, and which ones you can’t. Some websites allow Google to scrape their websites, by not allowing any other websites to scrape. This goes against the open nature of the Internet and may not seem fair but the owners of the website are within their rights to resort to such behavior.

You can find the robot.txt file on websites. It is usually the root directory of a website – http://example.com/robots.txt.

If it contains lines like the ones shown below, it means the site doesn’t like and want to be scraped.

User-agent: *

Disallow:/ 

However, since most sites want to be on Google, arguably the largest scraper of websites globally, they do allow access to bots and spiders.

2. Make the crawling slower, do not slam the server, treat websites nicely

Web scraping bots fetch data very fast, but it is easy for a site to detect your scraper as humans cannot browse that fast. The faster you crawl, the worse it is for everyone. If a website gets too many requests than it can handle it might become unresponsive.

Make your spider look real, by mimicking human actions. Put some random programmatic sleep calls in between requests, add some delays after crawling a small number of pages and choose the lowest number of concurrent requests possible. Ideally put a delay of 10-20 seconds between clicks and not put much load on the website, treating the website nice.

Use auto throttling mechanisms which will automatically throttle the crawling speed based on the load on both the spider and the website that you are crawling. Adjust the spider to an optimum crawling speed after a few trials runs. Do this periodically because the environment does change over time.

3. Do not follow the same crawling pattern

Humans generally will not perform repetitive tasks as they browse through a site with random actions. Web scraping bots tend to have the same crawling pattern because they are programmed that way unless specified. Sites that have intelligent anti-crawling mechanisms can easily detect spiders by finding patterns in their actions

Incorporate some random clicks on the page, mouse movements and random actions that will make a spider looks like a human.

4. Disguise your requests by rotating IPs and Proxy Services

When scraping, your IP address can be seen. A site will know what you are doing and if you are collecting data. They could take data such as – user patterns or experience, if they are first time users.

Multiple requests coming from the same IP will lead you to getting blocked, which is why we need to use multiple addresses. When we send requests from a proxy machine, the target website will not know where the original IP is from, making the detection harder.

Create a pool of IPs that you can use and use random ones for each request. Along with this, you have to spread a handful of requests across multiple IPs.

There are several methods can be used to change your outgoing IP.
Services such as VPNs, shared proxies and TOR can help. In addition, various commercial providers also provide services for automatic IP rotation.

This technique also distributes the load across various exit points.

Some websites have blocked well-known IP ranges such as AWS completely to prevent using the massive Amazon AWS IPs. Such protectionist policies are obviously counter to the open nature of the Internet.

5. User Agent Rotation and Spoofing

A user agent is a tool that tells the server which web browser is being used. If the user agent is not set, websites won’t let you view content. You can get your User-Agent by typing ‘what is my user agent’ in Google’s search bar.

You can also check your user-string here:
http://www.whatsmyuseragent.com/

Every request made from a web browser contains a user-agent header and using the same user-agent consistently leads to the detection of a bot. The only way to make your User-Agent appear more real and bypass detection is to fake the user agent.

How to Spoof a User-Agent:

Create a list of user agents and pick a random one for each request to prevent getting blocked. Set your user agent to a common web browser instead of the default user agent.

Set your user-agent to a common web browser instead of using the default user-agent (such as wget/version or urllib/version). You could even pretend to be the Google Bot: Googlebot/2.1 if you want to have some fun! (http://www.google.com/bot.html)

A user-agent string listing to get you started can be found here:
http://www.useragentstring.com/pages/useragentstring.php
https://developers.whatismybrowser.com/useragents/explore/

6. Check if Website is Changing Layouts

Some websites make it tricky for scrapers, serving slightly different layouts.

For example, in a website pages 1-20 will display a layout, and rest of the pages may display something else. To prevent this, check if you are getting data scraped using XPaths or CSS selectors. If not, check how the layout is different and add a condition in your code to scrape those pages differently.

7. Use a headless browser

Depending on certain web browsers being used, websites can display content differently. Take Google search results. If the browser (identified by the user agent) has advanced capabilities, the website may present “richer” content – something more dynamic and styled which may have a heavy reliance on Javascript and CSS.

The problem with this is that when doing any kind of web scraping, the content is rendered by the JS code and not the raw HTML response the server delivers.

Blacklisting can be prevented is by using a headless browser. A headless browser is one that works like any other browser; only they aren’t visual on a desktop. This means there is no graphical interface. Instead of interacting with element, you can automate everything with a command-line interface. This can help you to stay undetected while web scraping

Selenium, PhantomJS, and the latest entrant – Google’s own headless chrome are some options to explore further.

Keep in mind that headless browsers like Selenium and Puppeteer use a lot of resources (RAM, CPU, Bandwidth etc) in comparison to script-based approaches. You can check out these web scraping tutorials on using headless browsers:

8. Beware of Honey Pot Traps

Honeypots are systems set up to lure hackers and detect any hacking attempts that try to gain information. It is usually an application that imitates the behavior of a real system. Some websites install honeypots, which are links invisible to normal users but can be seen by web scrapers.

When following links always take care that the link has proper visibility with no nofollow tag. Some honeypot links to detect spiders will have the CSS style display:none or will be color disguised to blend in with the page’s background color.

This detection is obviously not easy and requires a significant amount of programming work to accomplish properly, as a result this technique is not widely used on either side – the server side or the bot or scraper side.

9. Scrape Behind Login

Login is basically permission to get access to web pages. Some websites like Indeed and Facebook do not allow permission.

If a page is protected by login, the scraper would have to send some information or cookies along with each request to view the page. This makes it easy for the target website to see requests coming from the same address. They could take away your credentials or block your account.

Its generally preferred to avoid scraping websites that have a login as you will get blocked easily, but one thing you can do is imitate human browsers whenever authentication is required you get the target data you need.

How can websites detect web scraping?

how-do-websites-detect-web-scraping

Websites can use different mechanisms to detect a scraper/spider from a normal user. Some of these methods are enumerated below:

  1. Unusual traffic/high download rate especially from a single client/or IP address within a short time span.
  2. Repetitive tasks performed on the website – based on an assumption that a human user won’t perform the same repetitive tasks all the time.
  3. Detection through honeypots – these honeypots are usually links which aren’t visible to a normal user but only to a spider. When a scraper/spider tries to access the link, the alarms are tripped.

How to address this detection? 

Spend some time upfront and investigate the anti-scraping mechanisms used by a site and build the spider accordingly, it will provide a better outcome in the long run and increase the longevity and robustness of your work.

How to easily know if a site doesn’t support data scraping?

Check the robots.txt file, which is usually in the root directory of a website e.g http://example.com/robots.txt.
If it contains lines like the ones shown below, it means the site doesn’t like and want scraping.

However, since most sites want to be on Google (arguably the largest scraper of websites globally) they do allow access to bots and spiders.

User-agent: *
Disallow: /

This line is for preventing well-behaved bots or the bots which respect robots.txt.

What happens when you get banned ?

There are two simple ways to ban a webspider/webscraper – either by banning all accesses from a particular IP or by banning all accesses that use a specific id to access the server (most browsers and web spiders identify themselves whenever they request a page by user agents. Chrome browser for example uses Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36

The banning can be temporary or permanent.

Temporary blocks can last minutes or hours.

Permanent bans go against the open nature of the Internet but some sites resort to this “scorch the internet” measure.

How do you find out if a website has blocked you ?

website-blocked-scraping

If any of the following signs appear on the site that you are crawling, it is usually a sign of being blocked or banned.

  • CAPTCHA pages
  • Unusual content delivery delays
  • Frequent response with HTTP 404, 301 or 50x errors

Frequent appearance of these HTTP status codes is also indication of blocking

  • 301 Moved Temporarily
  • 401 Unauthorized
  • 403 Forbidden
  • 404 Not Found
  • 408 Request Timeout
  • 429 Too Many Requests
  • 503 Service Unavailable

A comprehensive list of HTTP return codes (successes and failures) can be found here. It will be worth your time to read through these codes and be familiar with them.

Summary

All these ideas above provide a starting point for you to build your own solutions or refine your existing solution. If you have any ideas or suggestions, please join the discussion in the comments section.

Thank you for reading.

Or you can ignore everything above, and just get the data delivered to you as a service. Interested ?

Turn the Internet into meaningful, structured and usable data


Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Posted in:   Featured, Scraping Tips, Web Scraping Tutorials

Responses

Naznin January 10, 2017

scrapped, and now it is showing as forbidden. what to do next?

Reply

    Shabbir February 14, 2017

    wait for a day and check if you are still blocked. If you are using a proxy, change the proxy in request params and retry. If it doesn’t work, you have switch to a different I.P, flush you DNS and renew your IP on DHCP. If static then sorry:-)

    Reply

    thedude that doesntabide April 29, 2017

    Dont listen to numb nuts down there, change your user agent. If you are spawning alot of requests, use vpn or proxy every other request.. but change user agent often as that will be the first thing marked.

    Reply

Alex April 24, 2017

Hey,

First off, great article! A lot of good information here. Secondly, I was hoping you might be able to help me out. I’ve created a spider using Guzzle (php) and I am using a spoof header (only a fake user agent), but it only works 60% of the time. The other 40% I get a 503 error. The weird thing is, is that I noticed when I set User-Agent to null, it passes 100% of the time. I would like to use fake user-agents, because I know I’ll eventually get blocked.

Do you have any idea why this might be?

Thanks again

Reply

VotersofNY December 20, 2017

New at this. Can I just do a view source and then save the source and use a php script to extract the information I want from it?

Reply

    Martín January 29, 2018

    You can while you are not doing it 1000 times per minute with an automated software/script.

    Reply

sajad September 6, 2018

Is it possible to scrap from a website that has a strict limitation ????
I just check robots.txt for a web page and it seems it even prevent the google_pm to have access but the data that I want to scrap from it is public.
so when I requests.get(URL) in python I always got error like:
ConnectionError: (‘Connection aborted.’, OSError(“(10054, ‘WSAECONNRESET’)”,))

here is the rovots.txt of the website:
here is the robots.txt rules for this website.
Disallow: /search/
Disallow: /google_pm/
Disallow: /research/get_news.php
Disallow: /news_comtex_sitemap.xml
Disallow: /news_partner_sitemap.xml
Disallow: /blog/post_archive.html
Disallow: /external/all_commentary/
Disallow: /article/stock/news/
Disallow: /article/stock/commentary/
Disallow: /pr/
Disallow: /comments.php
Disallow: /ZER/zer_get_pdf.php
Disallow: /ZER/free_report.php
Disallow: /etf/etf_get_pdf.php
Disallow: /research/pdf_snapshot.php
Disallow: /stock/quote/pdf_snapshot.php
Disallow: /zer_comp_reports.php
Disallow: /commentary_print.php
Disallow: /research/print.php
Disallow: /forgot.php
Disallow: /research/report.php
Disallow: /tracks/
Disallow: /stock/stockcompare/comparestocks.php
Disallow: /ZER/zer_comp_reports.php
Disallow: /ZER/zer_industry_drilling_detail.php
Disallow: /performance_guarantee.php
Disallow: /stock/quote/report.php
Disallow: /research/report.php
Disallow: /research/reports/index.php
Disallow: /research/reports/
Disallow: /logout.php
Disallow: /performance/
Disallow: /z2_index.php
Disallow: /2802258/
Disallow: /funds/mfrank/showAnalyst_report.php
Disallow: /registration/ultimatetrader/
Disallow: /registration/zic/
Disallow: /registration/confidential/
Disallow: /registration/premium/
Disallow: /registration/blackboxtrader/
Disallow: /registration/etftrader/
Disallow: /registration/ftmtrader/
Disallow: /registration/homerun/
Disallow: /registration/incomeinvestor/
Disallow: /registration/insidertrader/
Disallow: /registration/internationaltrader/
Disallow: /registration/markettimer/
Disallow: /registration/momentumtrader/
Disallow: /registration/optionstrader/
Disallow: /registration/rta/
Disallow: /registration/stocksunder10/
Disallow: /registration/surprisetrader/
Disallow: /registration/top10/
Disallow: /registration/valueinvestor/
Disallow: /registration/order.php

Reply

Shaimaa Hafez September 19, 2018

I got blocked from a website I was scraping. Every time I try to open the site through any browser, it says 403 forbidden and the scraping code doesn’t work anymore.
What should I do to be able to access the website again?

Reply

    ScrapeHero September 19, 2018

    Changing your IP would be the best bet and our website has other ideas if that doesn’t work.

    Reply

      Shaimaa Hafez September 19, 2018

      Thank you for replying.
      It’s a windows server 2012 IP address so how to change it?

      Reply

        Chad January 18, 2019

        it would mean changing your public IP address. A proxy would be one way…

        Reply

        VBScript VBScript June 6, 2019

        Is scraping with repetitive keystrokes Ctrl+a, Ctrl+c (sendkeys commands in VBScript) detectable? I would easily analyze data from the clipboard!

        Reply

Umer October 26, 2018

I am trying to scrape some information from website http://www.similarweb.com through python script (tried through both python shell and IDE) but ends up into a captcha page but the same url loads completely in chrome or any other browser. Captcha message:

Pardon Our Interruption…

As you were browsing similarweb.com something about your browser made us think you were a bot. There are a few reasons this might happen:

You’re a power user moving through this website with super-human speed.
You’ve disabled JavaScript in your web browser.
A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article.

After completing the CAPTCHA below, you will immediately regain access to similarweb.com.

Could you please let me know about the fix?

Reply

Jerrad March 14, 2019

What is a good speed to start out with when trying a new spider? For example in clicking links or copying text.? I’m not in a hurry I just want my search to be complete.

Reply

    ScrapeHero March 15, 2019

    A delay of 10 – 30 seconds between clicks would not put much load on the website and the scraper would be “nice” to the website.

    Reply

Prashant October 14, 2019

How to solve distil captcha for the purspose of scraping

Reply

Comments or Questions?

Turn the Internet into meaningful, structured and usable data