How to prevent getting blacklisted while scraping

Web scraping helps turn the massive amount of largely unstructured text on the web into structured data.

It helps derive exponential better uses from the original data.

Imagine a life without Google, because Google also uses web scraping/crawling to get almost all its data. Without Google and web scraping, we would never find all the wonderful sites and information and the Internet would not be as indispensable as it is today.

However, web scraping has to be performed responsibly so that it is does not have a detrimental effect on the sites being scraped.

Web Crawlers can retrieve data much quicker, in greater depth than humans, so bad scraping practices can have some impact on the performance of the site.

If a crawler performs multiple requests per second and downloads large files, an under-powered server would have a hard time keeping up with requests from multiple crawlers.

Since web crawlers, scrapers or spiders (words used interchangeably) don’t really drive human website traffic and seemingly affect the performance of the site, some site administrators do not like spiders and try to block their access.

Most websites may not have anti scraping mechanisms since it would affect the user experience, but some sites do block scraping because they do not believe in open data access.

In this article, we will go through how websites detect and block spiders and talk about techniques to overcome those barriers.

Detection

How can websites detect web scraping?

how-do-websites-detect-web-scraping

Websites can use different mechanisms to detect a scraper/spider from a normal user. Some of these methods are enumerated below:

  1. Unusual traffic/high download rate especially from a single client/or IP address within a short time span.
  2. Repetitive tasks performed on the website – based on an assumption that a human user won’t perform the same repetitive tasks all the time.
  3. Detection through honeypots – these honeypots are usually links which aren’t visible to a normal user but only to a spider. When a scraper/spider tries to access the link, the alarms are tripped.

How to address this detection:  Spend some time upfront and investigate the anti-scraping mechanisms used by a site and build the spider accordingly, it will provide a better outcome in the long run and increase the longevity and robustness of your work.

Easiest way to find if a site doesn’t want data to be scraped

Check the robots.txt file, which is usually in the root directory of a website e.g http://example.com/robots.txt.
If it contains lines like the ones shown below, it means the site doesn’t like and want scraping.

However, since most sites want to be on Google (arguably the largest scraper of websites globally ;-)) they do allow access to bots and spiders.

User-agent: *
Disallow: /

This line is for preventing well-behaved bots or the bots which respect robots.txt.

What happens when you get banned ?

There are two simple ways to ban a webspider/webscraper – either by banning all accesses from a particular IP or by banning all accesses that use a specific id to access the server (most browsers and web spiders identify themselves whenever they request a page by user agents. Chrome browser for example uses Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36

The banning can be temporary or permanent.

Temporary blocks can last minutes or hours.

Permanent bans go against the open nature of the Internet but some sites resort to this “scorch the internet” measure.

How do you find out if a website has blocked you ?

website-blocked-scraping

If any of the following signs appear on the site that you are crawling, it is usually a sign of being blocked or banned.

  • CAPTCHA pages
  • Unusual content delivery delays
  • Frequent response with HTTP 404, 301 or 50x errors

Frequent appearance of these HTTP status codes is also indication of blocking

  • 301 Moved Temporarily
  • 401 Unauthorized
  • 403 Forbidden
  • 404 Not Found
  • 408 Request Timeout
  • 429 Too Many Requests
  • 503 Service Unavailable

A comprehensive list of HTTP return codes (successes and failures) can be found here. It will be worth your time to read through these codes and be familiar with them.

Web Crawling Best Practices

Basic Rule: “Be Nice”

An overarching rule to keep in mind for any kind of web scraping is

BE GOOD AND FOLLOW A WEBSITE’S CRAWLING POLICIES

Here are some of the best practices you can follow to overcome the detection.

1. Make the crawling slower, do not slam the server, treat websites nicely

Use auto throttling mechanisms which will automatically throttle the crawling speed based on the load on both the spider and the website that you are crawling. Adjust the spider to an optimum crawling speed after a few trial runs and do this periodically because the environment does change over time.

The faster you crawl, the worse it is for everyone.

Put some random programmatic sleep calls in between requests, add some delays after crawling a small number of pages and choose the lowest number of concurrent requests possible.

These techniques make the spider look like a human and are generally good for everyone.

2. Disguise your requests by rotating IPs and Proxy Services

A server can easily detect a bot by checking the requests from a single IP address, so if you use different IP addresses for making requests to a server, the detection becomes harder. Create a pool of IPs that you can use and use random ones for each request.
There are several methods can be used to change your outgoing IP.
Services such as VPNs, shared proxies and TOR can help. In addition, various commercial providers also provide services for automatic IP rotation.

This technique also distributes the load across various exit points.

Some websites have blocked well known IP ranges such as AWS completely to prevent using the massive Amazon AWS IPs. Such protectionist policies are obviously counter to the open nature of the Internet.

3. User Agent Rotation and Spoofing

Every request made from a web browser contains a user-agent header and using the same user-agent consistently leads to the detection of a bot.

User Agent rotation and spoofing is the best solution for this.

Spoof the User Agent by creating a list of user agents and picking a random one for each request.
Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the default user-agent (such as wget/version or urllib/version). You could even pretend to be the Google Bot: Googlebot/2.1 if you want to have some fun! (http://www.google.com/bot.html)

You can check your user-agent string here:
http://www.whatsmyuseragent.com/

A user-agent string listing to get you started can be found here:
http://www.useragentstring.com/pages/useragentstring.php

4. Beware of Honey Pot Traps

Some websites install honeypots to detect web spiders. These honeypots usually are links that normal user can’t see but a spider can.
When following links always take care that the link has proper visibility with no nofollow tag. Some honeypot links to detect spiders will be have the CSS style display:none or will be color disguised to blend in with the page’s background color.

This detection is obviously not easy and requires a significant amount of programming work to accomplish properly, as a result this technique is not widely used on either side – the server side or the bot or scraper side.

5. Do not follow the same crawling pattern

Only robots follow the same crawling pattern because unless specified otherwise, programmed bots follow a logic which is usually very specific.

Sites that have intelligent anti-crawling mechanisms can easily detect spiders from finding pattern in their actions. Humans generally will not perform repetitive tasks.

Incorporate some random clicks on the page, mouse movements and random actions that will make a spider looks like a human.

6. Always respect the robots.txt file

Web spiders should ideally follow rules in a robots.txt file for the website being scraped. The robots.txt file specifies rules for good behavior, such as how frequently bots are allowed to request pages, what pages are allowed to be scraped and which areas are off limits for scraping.

However, some websites are extremely liberal with letting Google scrape their websites but not allowing any other bots access. This goes against the open nature of the Internet, but the websites owners are well within their rights to resort to such behavior.

[Update July 11 2017: Added new item below]

7. Use a headless browser

 

Some websites present different content based on the type of browser that is accessing the site. The Google search results is a perfect example of such behavior. If the browser (identified by the user agent) has advanced capabilities, the website may present “richer” content – something more dynamic and styled which may have a heavy reliance on Javascript and CSS. Some websites may also use the same Javascript based methods to detect the browser capabilities despite the “user agent” that is advertised by the bot. e.g. if the browser identifies itself as a Chrome 50 version (which definitely has Javascript support), the website can use some script based programming to identify whether the browser can actually perform some client-side calculation.

The solution for this is to use headless browsers (browsers that aren’t really visual on a desktop but are fully functional otherwise). Selenium, PhantomJS, and the latest entrant – Google’s own headless chrome are some options to explore further.

Keep in mind that headless browsers like Selenium and Puppeteer use a lot of resources (RAM, CPU, Bandwidth etc) in comparison to script-based approaches. You can check out these web scraping tutorials on using headless browsers:

Summary

All these ideas above provide a starting point for you to build your own solutions or refine your existing solution. If you have any ideas or suggestions, please join the discussion in the comments section.

Thank you for reading.

Or you can ignore everything above, and just get the data delivered to you as a service. Interested ?

Turn websites into meaningful and structured data through our web data extraction service

31 comments on “How to prevent getting blacklisted while scraping

netsi1964

Could scraping make a website try to blacklist your IP in some “global blacklist” of IP adresses?
– My IP have been reported as a “bad IP” on Facebook… I am not sure why…
I made a simple Node.js app which scraped a site: “https://domaintyper.com”.

/Sten

    scrapehero

    We don’t believe there is a global blacklist like a email RBL

      alex

      atomicorp run a global rbl for their apache modsecurity rules customers.

        scrapehero

        Thanks Alex,
        That is good to know – assume it is just a private list maintained by this company not a global and public list?

Thomas Afrim

Hi!

Do you have any ideas how this website work? Is this website scraping ebay and amazon content?
http://shopotam.ru/catalog/Consumer_Electronics

Can we do the same with your tool (million products, refresh every 5 seconds)?

Tommy

    scrapehero

    Hi Tommy – blatantly scraping sites with no value add isn’t a recipe for success.
    We don’t have a tool but rather a service that does thousands of pages per second – we haven’t tried millions yet !

    Pighead Wang

    This site is worked by API, not website scraping.

simbian92

This is a very nice guide, big thanks.

Mr. Jiggs

Hi, in case you are scraping a website that requires authentication (login and password), do proxies become useless?

What is the best technique fro crawling websites that require authentication without being banned?

Should one use multiple user accounts?

    scrapehero

    Hello Mr Jiggs,
    Lets try our best to answer you questions in order
    In case you are scraping a website that requires authentication (login and password), do proxies become useless?
    It depends on what kind of detection mechanism is used by the site. Authentication based sites are easy to block – disable the account and you are done.
    Proxies serve a different purpose not directly related to preventing authentication based blocks.

    What is the best technique fro crawling websites that require authentication without being banned?
    Speed is probably your best technique – if you can mimic a real human that would be your best approach

    Should one use multiple user accounts?
    This depends on the site, but banning accounts is fairly easy for sites, so multiple accounts may not be an ultimate solution.

Jinzhen Wang

It looks like I got banned by a website since I tried to crawl it without limit of speed. I have to click the CAPTCHA every time I visit the page. How can I make my crawl work again?

BTW, goole chrome got banned but safari still works.

    scrapehero

    You have a few options:
    1. Use a proxy server for this site – free and paid options are available
    2. Renew your dynamic IP if you have one – disconnect your router from the Internet and reconnect after 5 minutes or so.
    3. If you have a static IP, you will need to ask your ISP to get a new IP

    Good luck !

Jinzhen Wang

Thanks! I tried to connect to vpn but it does not seem to work. Will this affect my crawling?

    scrapehero

    If it is just a browser issue, you can also try clearing all cookies and the cache and try.
    Blocking will obviously affect your crawling – unless you mind a CAPTCHA in every page 😉

Narciso

I really like this post! I was looking for post like this, i means, i am new in the scraper world and i love it. But i have a question….Is it possible scrap webs like https://www.oportunidadbancaria.com/ . Because i am using Hub Spot for Scrap, but the URL and the order of the products is changing when i search or i use filters. Is possible do something?

    ScrapeHero

    Hi Narcisco,
    Glad you liked the post.
    We are not aware of Hub Spot as scraper so are unable to comment on its capabilities.
    However, given time and money most sites are scrapeable.
    Thanks

      Narciso Jimenez

      Thanks for the answer! I only wanted to know if was posible!

Maria L.

I am not trying to scrape anything and I am not “crawling”. I don’t even know what that means! My problem is this — Suddenly, this morning I cannot connect to Zillow using either Chrome or Internet Explorer. Chrome gives me error msg. saying “request blocked; crawler detected”. On IE it says the error is (HTTP 403 Forbiddent) I have been using zillow extensively over the past year, b/c I am getting ready to buy a house and I have looked at a lot of places on zillow, and I have printed a lot of material, filled in some “inter-active” info. regarding mortgage costs, etc. Is this a problem which will go away later today! Help! I have to go now but will check back for an answer.

    ScrapeHero

    Maria – sorry to hear about your story. It just highlights the overzealous tactics used by Zillow etc that end up blocking regular users.
    If you have a dynamic IP address just shut down and restart your router for a few minutes and hopefully that will fix the block.
    And then cancel your broadband and get a dialup connection so you don’t end up searching for a house at broadband speeds – just kidding 😉

      Maria L.

      Thank you so much for your speedy reply, ScrapeHero. My ISP was VerizonFIOS, which was sold to Frontier. I have a Verizon FIOs router. I will try shutting it all down later and I hope this will work. I am a 65 yr. old “senior” lady who is not terribly tech savvy. I have a desk top computer, running windows 10, but I run it as close as I can to “Windows XP” mode. I don’t use a tablet or a smart-phone (Yet!) I will let you know if shutting down the router and rebooting the whole system works. I hope it does as my home search is very impeded by lack of access to zillow! Thanks again!

Keith S

Just a regular guy (not a computer scrapping guy). Stumbled on this page from Google. I have the same problem – Zillow just blocked me and shows me some numbers or no pages at times. I am looking for a rental and am shocked they could block me. Who do they not block?
I know the experts can get by their blocks, so all the innocent people like me are caught in their silly blocks

    ScrapeHero

    Keith – sorry to hear that you too are having issues.
    Please check the comment above and turn off the router for a few minutes.
    It should unblock you.
    Thanks

Maria L.

I have some good news to report which may help you, too, Keith S. I was restored to Zillow-access after I completely shut down my computer and the FIO’s router. I turned off the power source and let it sit for 4 or 5 hours, while I took care of other non-computer-related chores. Then, presto, i was able to connect with Zillow, again and have had no problems since. I really don’t know if the problem was fixed by doing this, or if it was fixed by Zillow. I did notice that some of the daily e-mails I now receive from Zillow have a different type or subject line and “format” — so maybe zillow was working on “changes” in its website and fixed some of their “bugs” or “bots” or what-not! I got no results after shutting down my computer and re-booting both my computer and router, several times. The results came after I completely cut the power source for the router and computer, and let it be off for several hours. But, like I said, I’m not really sure if this is what restored my access to Zillow, or if Zillow did something to “fix things” . . . Good Luck! Thank you, again, “Scrape-Hero” for having this website and providing help to the public! I sincerely appreciate that! Ms. ML

    ScrapeHero

    Maria – the shutting off fixed exactly what we believed to be the problem. Your IP was blocked and when you turn the router off for a long time you are almost guaranteed to get a new IP.
    The Zillow changes are just coincidental and most likely had nothing to do with your unblocking.

ScrapeHero
Naznin

scrapped, and now it is showing as forbidden. what to do next?

    Shabbir

    wait for a day and check if you are still blocked. If you are using a proxy, change the proxy in request params and retry. If it doesn’t work, you have switch to a different I.P, flush you DNS and renew your IP on DHCP. If static then sorry:-)

    thedude that doesntabide

    Dont listen to numb nuts down there, change your user agent. If you are spawning alot of requests, use vpn or proxy every other request.. but change user agent often as that will be the first thing marked.

Alex

Hey,

First off, great article! A lot of good information here. Secondly, I was hoping you might be able to help me out. I’ve created a spider using Guzzle (php) and I am using a spoof header (only a fake user agent), but it only works 60% of the time. The other 40% I get a 503 error. The weird thing is, is that I noticed when I set User-Agent to null, it passes 100% of the time. I would like to use fake user-agents, because I know I’ll eventually get blocked.

Do you have any idea why this might be?

Thanks again

VotersofNY

New at this. Can I just do a view source and then save the source and use a php script to extract the information I want from it?

    Martín

    You can while you are not doing it 1000 times per minute with an automated software/script.

Join the conversation


Turn websites into meaningful and structured data through our web data extraction service