How to Scrape Websites Without Getting Blocked

Web scraping is a task that has to be performed responsibly so that it does not have a detrimental effect on the sites being scraped. Web Crawlers can retrieve data much quicker, in greater depth than humans, so bad scraping practices can have some impact on the performance of the site. While most websites may not have anti-scraping mechanisms, some sites use measures that can lead to web scraping getting blocked, because they do not believe in open data access.

If a crawler performs multiple requests per second and downloads large files, an under-powered server would have a hard time keeping up with requests from multiple crawlers. Since web crawlers, scrapers or spiders (words used interchangeably) don’t really drive human website traffic and seemingly affect the performance of the site, some site administrators do not like spiders and try to block their access.

In this article, we will talk about the best web scraping practices to follow to scrape websites without getting blocked by the anti-scraping or bot detection tools.

bypass-antiscraping

Web Scraping best practices to follow to scrape without getting blocked

  1. Respect Robots.txt
  2. Make the crawling slower, do not slam the server, treat websites nicely
  3. Do not follow the same crawling pattern
  4. Make requests through Proxies and rotate them as needed
  5. Rotate User Agents and corresponding HTTP Request Headers between requests
  6. Use a headless browser like Puppeteer, Selenium or Playwright
  7. Beware of Honey Pot Traps
  8. Check if Website is Changing Layouts
  9. Avoid scraping data behind a login
  10. Use Captcha Solving Services
  11. How can websites detect web scraping?
  12. How do you find out if a website has blocked or banned you ?

Basic Rule: “Be Nice”

An overarching rule to keep in mind for any kind of web scraping is
BE GOOD AND FOLLOW A WEBSITE’S CRAWLING POLICIES

Here are the web scraping best practices you can follow to avoid getting web scraping blocked:

Respect Robots.txt

Web spiders should ideally follow the robot.txt file for a website while scraping. It has specific rules for good behavior, such as how frequently you can scrape, which pages allow scraping, and which ones you can’t. Some websites allow Google to scrape their websites, by not allowing any other websites to scrape. This goes against the open nature of the Internet and may not seem fair, but the owners of the website are within their rights to resort to such behavior. 

You can find the robot.txt file on websites. It is usually the root directory of a website – http://example.com/robots.txt.

If it contains lines like the ones shown below, it means the site doesn’t like and does not want to be scraped.

User-agent: *

Disallow:/ 

However, since most sites want to be on Google, arguably the largest scraper of websites globally, they allow access to bots and spiders. 

What if you need some data that is forbidden by Robots.txt. You could still scrape it. Most anti-scraping tools block web scraping when you are scraping pages that are not allowed by Robots.txt.

What do these tools look for? Is this client a bot or a real user? And how do they find that? By looking for a few indicators that real users do and bots don’t. Humans are random, bots are not. Humans are not predictable, bots are.

Here are a few easy giveaways that you are bot/scraper/crawler –

  • Scraping too fast and too many pages, faster than a human ever can
  • Following the same pattern while crawling. For example – go through all pages of search results, and go to each result only after grabbing links to them. No human ever does that.
  • Too many requests from the same IP address in a very short time
  • Not identifying as a popular browser. You can do this by specifying a ‘User-Agent’.
  • using a user agent string of a very old browser

The points below should get you past most of the basic to intermediate anti-scraping mechanisms used by websites to block web scraping.

Make the crawling slower, do not slam the server, treat websites nicely

Web scraping bots fetch data very fast, but it is easy for a site to detect your scraper, as humans cannot browse that fast. The faster you crawl, the worse it is for everyone. If a website gets too many requests than it can handle it might become unresponsive.

Make your spider look real, by mimicking human actions. Put some random programmatic sleep calls in between requests, add some delays after crawling a small number of pages and choose the lowest number of concurrent requests possible. Ideally, put a delay of 10-20 seconds between clicks and not put much load on the website, treating the website nice.

Use auto throttling mechanisms which will automatically throttle the crawling speed based on the load on both the spider and the website that you are crawling. Adjust the spider to an optimum crawling speed after a few trial runs. Do this periodically because the environment does change over time.

Do not follow the same crawling pattern

Humans generally will not perform repetitive tasks as they browse through a site with random actions. Web scraping bots tend to have the same crawling pattern because they are programmed that way unless specified. Sites that have intelligent anti-crawling mechanisms can easily detect spiders by finding patterns in their actions and can lead to web scraping getting blocked.

Incorporate some random clicks on the page, mouse movements and random actions that will make a spider look like a human.

Make requests through Proxies and rotate them as needed

When scraping, your IP address can be seen. A site will know what you are doing and if you are collecting data. They could take data such as – user patterns or experience if they are first-time users.

Multiple requests coming from the same IP will lead you to get blocked, which is why we need to use multiple addresses. When we send requests from a proxy machine, the target website will not know where the original IP is from, making the detection harder.

Create a pool of IPs that you can use and use random ones for each request. Along with this, you have to spread a handful of requests across multiple IPs.

There are several methods that can change your outgoing IP.

  • TOR
  • VPNs
  • Free Proxies
  • Shared Proxies – the least expensive proxies shared by many users. The chances of getting blocked are high.
  • Private Proxies – usually used only by you, and lower chances of getting blocked if you keep the frequency low.
  • Data Center Proxies, if you need a large number of IP Addresses and faster proxies, larger pools of IPs. They are cheaper than residential proxies and could be detected easily.
  • Residential Proxies, if you are making a huge number of requests to websites that block to actively. These are very expensive (and could be slower, as they are real devices). Try everything else before getting a residential proxy.

In addition, various commercial providers also provide services for automatic IP rotation. A lot of companies now provide residential IPs to make scraping even easier – but most are expensive.

 

Rotate User Agents and corresponding HTTP Request Headers between requests

A user agent is a tool that tells the server which web browser is being used. If the user agent is not set, websites won’t let you view content. Every request made from a web browser contains a user-agent header and using the same user-agent consistently leads to the detection of a bot. You can get your User-Agent by typing ‘what is my user agent’ in Google’s search bar. The only way to make your User-Agent appear more real and bypass detection is to fake the user agent. Most web scrapers do not have a User Agent by default, and you need to add that yourself.

You could even pretend to be the Google Bot: Googlebot/2.1 if you want to have some fun! (http://www.google.com/bot.html)

Now, just sending User-Agents alone would get you past most basic bot detection scripts and tools. If you find your bots getting blocked even after putting in a recent User-Agent string, you should add some more request headers.

Most browsers send more headers to the websites than just the User-Agent. For example, here is a set of headers a browser sent to Scrapeme.live (Our Web Scraping Test Site). It would be ideal to send these common request headers too.

The most basic ones are:

  • User-Agent
  • Accept
  • Accept-Language
  • Referer
  • DNT
  • Updgrade-Insecure-Requests
  • Cache-Control

Do not send cookies unless your scraper depends on Cookies for functionality.

You can find the right values for these by inspecting your web traffic using Chrome Developer Tools, or a tool like MitmProxy or Wireshark. You can also copy a curl command to your request from them. For example

curl 'https://scrapeme.live/shop/Ivysaur/' \
  -H 'authority: scrapeme.live' \
  -H 'dnt: 1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'sec-fetch-site: none' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-fetch-dest: document' \
  -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' \
  --compressed

You can get this converted to any language using a tool like https://curl.trillworks.com

Here is how this was converted to python

import requests

headers = {
    'authority': 'scrapeme.live',
    'dnt': '1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

response = requests.get('https://scrapeme.live/shop/Ivysaur/', headers=headers)

You can create similar header combinations for multiple browsers and start rotating those headers between each request to reduce the chances of getting your web scraping blocked.

Use a headless browser like Puppeteer, Selenium or Playwright

If none of the methods above works, the website must be checking if you are a REAL browser.

The simplest check is if the client (web browser) can render a block of JavaScript. If it doesn’t, then it pretty much flags the visitor to be a bot. While it is possible to block running JavaScript in the browser, most of the Internet sites will be unusable in such a scenario and as a result, most browsers will have JavaScript enabled.

Once this happens, a real browser is necessary in most cases to scrape the data. There are libraries to automatically control browsers such as

  1. Selenium
  2. Puppeteer and Pyppeteer
  3. Playwright

Anti Scraping tools are smart and are getting smarter daily, as bots feed a lot of data to their AIs to detect them. Most advanced Bot Mitigation Services use Browser Side Fingerprinting (Client Side Bot Detection)  by more advanced methods than just checking if you can execute Javascript.

Bot detection tools look for any flags that can tell them that the browser is being controlled through an automation library.

  1. Presence of bot specific signatures
  2. Support for nonstandard browser features
  3. Presence of common automation tools such as Selenium, Puppeteer, Playwright, etc.
  4. Human-generated events such as randomized Mouse Movement, Clicks, Scrolls, Tab Changes etc.

All this information is combined to construct a unique client-side fingerprint that can tag one as bot or human.

Here are a few workarounds or tools which could help your headless browser-based scrapers from getting banned.

  1. Puppeteer Extra  – Puppeteer Stealth Plugin
  2. Patching Selenium/ Phantom JS Stack OverFlow Answer on Patching Selenium with Chrome Driver
  3. Fingerprint RotationMicrosoft Paper on Fingerprint Rotation

But as you might have guessed, just like Bots, Bot Detection companies are getting smarter. They have been improving their AI models and look for variables, actions, events, etc that can still give away the presence of an automation library leading to web scraping getting blocked.

Beware of Honey Pot Traps

Honeypots are systems set up to lure hackers and detect any hacking attempts that try to gain information. It is usually an application that imitates the behavior of a real system. Some websites install honeypots, which are links invisible to normal users but can be seen by web scrapers.

When following links always take care that the link has proper visibility with no nofollow tag. Some honeypot links to detect spiders will have the CSS style display:none or will be color disguised to blend in with the page’s background color.

This detection is obviously not easy and requires a significant amount of programming work to accomplish properly, as a result, this technique is not widely used on either side – the server side or the bot or scraper side.

Check if Website is Changing Layouts

Some websites make it tricky for scrapers, serving slightly different layouts.

For example, in a website pages 1-20 will display a layout, and the rest of the pages may display something else. To prevent this, check if you are getting data scraped using XPaths or CSS selectors. If not, check how the layout is different and add a condition in your code to scrape those pages differently.

Avoid scraping data behind a login

Login is basically permission to get access to web pages. Some websites like Indeed do not allow permission.

If a page is protected by login, the scraper would have to send some information or cookies along with each request to view the page. This makes it easy for the target website to see requests coming from the same address. They could take away your credentials or block your account which can, in turn, lead to your web scraping efforts being blocked.

Its generally preferred to avoid scraping websites that have a login as you will get blocked easily, but one thing you can do is imitate human browsers whenever authentication is required you get the target data you need.

Use Captcha Solving Services

Many websites use anti web scraping measures. If you are scraping a website on a large scale, the website will eventually block you. You will start seeing captcha pages instead of web pages. There are services to get past these restrictions such as 2Captcha or Anticaptcha.

If you need to scrape websites that use Captcha, it is better to resort to captcha services. Captcha services are relatively cheap, which is useful when performing large scale scrapes.

How can websites detect and block web scraping?

how-do-websites-detect-web-scraping

Websites can use different mechanisms to detect a scraper/spider from a normal user. Some of these methods are enumerated below:

  1. Unusual traffic/high download rate especially from a single client/or IP address within a short time span.
  2. Repetitive tasks performed on the website in the same browsing pattern – based on an assumption that a human user won’t perform the same repetitive tasks all the time.
  3. Checking if you are real browser – A simple check is to try and execute javascript. Smarter tools can go a lot more and check your Graphic cards and CPUs 😉 to make sure you are coming from real browser.
  4. Detection through honeypots – these honeypots are usually links which aren’t visible to a normal user but only to a spider. When a scraper/spider tries to access the link, the alarms are tripped.

Learn more about how websites detect and block web scrapers

How do Websites detect and block bots using Bot Mitigation Tools

How to address this detection and avoid web scraping getting blocked? 

Spend some time upfront and investigate the anti-scraping mechanisms used by a site and build the spider accordingly. It will provide a better outcome in the long run and increase the longevity and robustness of your work.

How do you find out if a website has blocked or banned you?

website-blocked-scraping

If any of the following signs appear on the site that you are crawling, it is usually a sign of being blocked or banned.

  • CAPTCHA pages
  • Unusual content delivery delays
  • Frequent response with HTTP 404, 301 or 50x errors

Frequent appearance of these HTTP status codes is also indication of blocking

  • 301 Moved Temporarily
  • 401 Unauthorized
  • 403 Forbidden
  • 404 Not Found
  • 408 Request Timeout
  • 429 Too Many Requests
  • 503 Service Unavailable

Here is what Amazon.com tells you when you are blocked.

To discuss automated access to Amazon data please contact api-services-support@amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at <link>  or our Product Advertising API at  <link> for advertising use cases.
Sorry! Something went wrong!
With pictures of cute dog of Amazon.

You may also see response or message from websites like these from some popular anti-scraping tools.

We want to make sure it is actually you that we are dealing with and not a robot

Please check the box below to access the site

<reCaptcha>

Why is this verification required? Something about the behavior of the browser has caught our attention.

There are various explanations for this:

  • you are browsing and clicking at a speed much faster than expected of a human being
  • something is preventing Javascript from working on your computer
  • there is a robot on the same network (IP address) as you

Having problems accessing the site? Contact Support

Authenticate your robot

or

Please verify you are a human 

<Captcha> 

Access to this page has been denied because we believe you are using automation tools to browse the website

This may happen as a result of the following: 

  • Javascript is disabled or blocked by an extension (ad blockers for example) 
  • Your browser does not support cookies 

Please make sure that Javascript and cookies are enabled on your browser and that you are not blocking them from loading 

or

Pardon our interruption

As you were browsing <website> something about your browser made us think you were a bot. There are a few reasons this might happen

  • You’re a power user using moving through this website with super-human speed 
  • You’ve disabled JavaScript in your web browser 
  • A third-party bowser plugin such as Ghostery or NoScript, is preventing Javascript from running. Additional information is available in this support article.

After completing the CAPTCHA below, you will immediately regain access to <website>

or

Error 1005 Ray ID: <hash> • <time>
Access denied

What happened?

The owner of this website (<website>) has banned the autonomous system number (ASN) your IP address is in (<number>) from accessing this website.

A comprehensive list of HTTP return codes (successes and failures) can be found here. It will be worth your time to read through these codes and be familiar with them.

Summary

All these ideas above provide a starting point for you to build your own solutions or refine your existing solution. If you have any ideas or suggestions, please join the discussion in the comments section.

Thank you for reading.

Or you can ignore everything above, and just get the data delivered to you as a service. Interested ?

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Posted in:   Developers, Scraping Tips, Web Scraping Tutorials

Responses

netsi1964 June 27, 2015

Could scraping make a website try to blacklist your IP in some “global blacklist” of IP adresses?
– My IP have been reported as a “bad IP” on Facebook… I am not sure why…
I made a simple Node.js app which scraped a site: “https://domaintyper.com”.

/Sten


    scrapehero June 30, 2015

    We don’t believe there is a global blacklist like a email RBL


      alex February 20, 2016

      atomicorp run a global rbl for their apache modsecurity rules customers.


        scrapehero February 23, 2016

        Thanks Alex,
        That is good to know – assume it is just a private list maintained by this company not a global and public list?


Thomas Afrim July 7, 2015

Hi!

Do you have any ideas how this website work? Is this website scraping ebay and amazon content?
http://shopotam.ru/catalog/Consumer_Electronics

Can we do the same with your tool (million products, refresh every 5 seconds)?

Tommy


    scrapehero July 7, 2015

    Hi Tommy – blatantly scraping sites with no value add isn’t a recipe for success.
    We don’t have a tool but rather a service that does thousands of pages per second – we haven’t tried millions yet !


    Pighead Wang November 19, 2016

    This site is worked by API, not website scraping.


simbian92 March 20, 2016

This is a very nice guide, big thanks.


Mr. Jiggs May 25, 2016

Hi, in case you are scraping a website that requires authentication (login and password), do proxies become useless?

What is the best technique fro crawling websites that require authentication without being banned?

Should one use multiple user accounts?


    scrapehero May 25, 2016

    Hello Mr Jiggs,
    Lets try our best to answer you questions in order
    In case you are scraping a website that requires authentication (login and password), do proxies become useless?
    It depends on what kind of detection mechanism is used by the site. Authentication based sites are easy to block – disable the account and you are done.
    Proxies serve a different purpose not directly related to preventing authentication based blocks.

    What is the best technique fro crawling websites that require authentication without being banned?
    Speed is probably your best technique – if you can mimic a real human that would be your best approach

    Should one use multiple user accounts?
    This depends on the site, but banning accounts is fairly easy for sites, so multiple accounts may not be an ultimate solution.


Jinzhen Wang July 9, 2016

It looks like I got banned by a website since I tried to crawl it without limit of speed. I have to click the CAPTCHA every time I visit the page. How can I make my crawl work again?

BTW, goole chrome got banned but safari still works.


    scrapehero July 9, 2016

    You have a few options:
    1. Use a proxy server for this site – free and paid options are available
    2. Renew your dynamic IP if you have one – disconnect your router from the Internet and reconnect after 5 minutes or so.
    3. If you have a static IP, you will need to ask your ISP to get a new IP

    Good luck !


Jinzhen Wang July 9, 2016

Thanks! I tried to connect to vpn but it does not seem to work. Will this affect my crawling?


    scrapehero July 9, 2016

    If it is just a browser issue, you can also try clearing all cookies and the cache and try.
    Blocking will obviously affect your crawling – unless you mind a CAPTCHA in every page 😉


Narciso August 10, 2016

I really like this post! I was looking for post like this, i means, i am new in the scraper world and i love it. But i have a question….Is it possible scrap webs like https://www.oportunidadbancaria.com/ . Because i am using Hub Spot for Scrap, but the URL and the order of the products is changing when i search or i use filters. Is possible do something?


    ScrapeHero August 10, 2016

    Hi Narcisco,
    Glad you liked the post.
    We are not aware of Hub Spot as scraper so are unable to comment on its capabilities.
    However, given time and money most sites are scrapeable.
    Thanks


      Narciso Jimenez August 10, 2016

      Thanks for the answer! I only wanted to know if was posible!


Maria L. October 7, 2016

I am not trying to scrape anything and I am not “crawling”. I don’t even know what that means! My problem is this — Suddenly, this morning I cannot connect to Zillow using either Chrome or Internet Explorer. Chrome gives me error msg. saying “request blocked; crawler detected”. On IE it says the error is (HTTP 403 Forbiddent) I have been using zillow extensively over the past year, b/c I am getting ready to buy a house and I have looked at a lot of places on zillow, and I have printed a lot of material, filled in some “inter-active” info. regarding mortgage costs, etc. Is this a problem which will go away later today! Help! I have to go now but will check back for an answer.


    ScrapeHero October 7, 2016

    Maria – sorry to hear about your story. It just highlights the overzealous tactics used by Zillow etc that end up blocking regular users.
    If you have a dynamic IP address just shut down and restart your router for a few minutes and hopefully that will fix the block.
    And then cancel your broadband and get a dialup connection so you don’t end up searching for a house at broadband speeds – just kidding ?


      Maria L. October 8, 2016

      Thank you so much for your speedy reply, ScrapeHero. My ISP was VerizonFIOS, which was sold to Frontier. I have a Verizon FIOs router. I will try shutting it all down later and I hope this will work. I am a 65 yr. old “senior” lady who is not terribly tech savvy. I have a desk top computer, running windows 10, but I run it as close as I can to “Windows XP” mode. I don’t use a tablet or a smart-phone (Yet!) I will let you know if shutting down the router and rebooting the whole system works. I hope it does as my home search is very impeded by lack of access to zillow! Thanks again!


Keith S October 10, 2016

Just a regular guy (not a computer scrapping guy). Stumbled on this page from Google. I have the same problem – Zillow just blocked me and shows me some numbers or no pages at times. I am looking for a rental and am shocked they could block me. Who do they not block?
I know the experts can get by their blocks, so all the innocent people like me are caught in their silly blocks


    ScrapeHero October 13, 2016

    Keith – sorry to hear that you too are having issues.
    Please check the comment above and turn off the router for a few minutes.
    It should unblock you.
    Thanks


Maria L. October 13, 2016

I have some good news to report which may help you, too, Keith S. I was restored to Zillow-access after I completely shut down my computer and the FIO’s router. I turned off the power source and let it sit for 4 or 5 hours, while I took care of other non-computer-related chores. Then, presto, i was able to connect with Zillow, again and have had no problems since. I really don’t know if the problem was fixed by doing this, or if it was fixed by Zillow. I did notice that some of the daily e-mails I now receive from Zillow have a different type or subject line and “format” — so maybe zillow was working on “changes” in its website and fixed some of their “bugs” or “bots” or what-not! I got no results after shutting down my computer and re-booting both my computer and router, several times. The results came after I completely cut the power source for the router and computer, and let it be off for several hours. But, like I said, I’m not really sure if this is what restored my access to Zillow, or if Zillow did something to “fix things” . . . Good Luck! Thank you, again, “Scrape-Hero” for having this website and providing help to the public! I sincerely appreciate that! Ms. ML


    ScrapeHero October 13, 2016

    Maria – the shutting off fixed exactly what we believed to be the problem. Your IP was blocked and when you turn the router off for a long time you are almost guaranteed to get a new IP.
    The Zillow changes are just coincidental and most likely had nothing to do with your unblocking.


Naznin January 10, 2017

scrapped, and now it is showing as forbidden. what to do next?


    Shabbir February 14, 2017

    wait for a day and check if you are still blocked. If you are using a proxy, change the proxy in request params and retry. If it doesn’t work, you have switch to a different I.P, flush you DNS and renew your IP on DHCP. If static then sorry:-)


    thedude that doesntabide April 29, 2017

    Dont listen to numb nuts down there, change your user agent. If you are spawning alot of requests, use vpn or proxy every other request.. but change user agent often as that will be the first thing marked.


Alex April 24, 2017

Hey,

First off, great article! A lot of good information here. Secondly, I was hoping you might be able to help me out. I’ve created a spider using Guzzle (php) and I am using a spoof header (only a fake user agent), but it only works 60% of the time. The other 40% I get a 503 error. The weird thing is, is that I noticed when I set User-Agent to null, it passes 100% of the time. I would like to use fake user-agents, because I know I’ll eventually get blocked.

Do you have any idea why this might be?

Thanks again


VotersofNY December 20, 2017

New at this. Can I just do a view source and then save the source and use a php script to extract the information I want from it?


    Martín January 29, 2018

    You can while you are not doing it 1000 times per minute with an automated software/script.


sajad September 6, 2018

Is it possible to scrap from a website that has a strict limitation ????
I just check robots.txt for a web page and it seems it even prevent the google_pm to have access but the data that I want to scrap from it is public.
so when I requests.get(URL) in python I always got error like:
ConnectionError: (‘Connection aborted.’, OSError(“(10054, ‘WSAECONNRESET’)”,))

here is the rovots.txt of the website:
here is the robots.txt rules for this website.
Disallow: /search/
Disallow: /google_pm/
Disallow: /research/get_news.php
Disallow: /news_comtex_sitemap.xml
Disallow: /news_partner_sitemap.xml
Disallow: /blog/post_archive.html
Disallow: /external/all_commentary/
Disallow: /article/stock/news/
Disallow: /article/stock/commentary/
Disallow: /pr/
Disallow: /comments.php
Disallow: /ZER/zer_get_pdf.php
Disallow: /ZER/free_report.php
Disallow: /etf/etf_get_pdf.php
Disallow: /research/pdf_snapshot.php
Disallow: /stock/quote/pdf_snapshot.php
Disallow: /zer_comp_reports.php
Disallow: /commentary_print.php
Disallow: /research/print.php
Disallow: /forgot.php
Disallow: /research/report.php
Disallow: /tracks/
Disallow: /stock/stockcompare/comparestocks.php
Disallow: /ZER/zer_comp_reports.php
Disallow: /ZER/zer_industry_drilling_detail.php
Disallow: /performance_guarantee.php
Disallow: /stock/quote/report.php
Disallow: /research/report.php
Disallow: /research/reports/index.php
Disallow: /research/reports/
Disallow: /logout.php
Disallow: /performance/
Disallow: /z2_index.php
Disallow: /2802258/
Disallow: /funds/mfrank/showAnalyst_report.php
Disallow: /registration/ultimatetrader/
Disallow: /registration/zic/
Disallow: /registration/confidential/
Disallow: /registration/premium/
Disallow: /registration/blackboxtrader/
Disallow: /registration/etftrader/
Disallow: /registration/ftmtrader/
Disallow: /registration/homerun/
Disallow: /registration/incomeinvestor/
Disallow: /registration/insidertrader/
Disallow: /registration/internationaltrader/
Disallow: /registration/markettimer/
Disallow: /registration/momentumtrader/
Disallow: /registration/optionstrader/
Disallow: /registration/rta/
Disallow: /registration/stocksunder10/
Disallow: /registration/surprisetrader/
Disallow: /registration/top10/
Disallow: /registration/valueinvestor/
Disallow: /registration/order.php


Shaimaa Hafez September 19, 2018

I got blocked from a website I was scraping. Every time I try to open the site through any browser, it says 403 forbidden and the scraping code doesn’t work anymore.
What should I do to be able to access the website again?


    ScrapeHero September 19, 2018

    Changing your IP would be the best bet and our website has other ideas if that doesn’t work.


      Shaimaa Hafez September 19, 2018

      Thank you for replying.
      It’s a windows server 2012 IP address so how to change it?


        Chad January 18, 2019

        it would mean changing your public IP address. A proxy would be one way…


        VBScript VBScript June 6, 2019

        Is scraping with repetitive keystrokes Ctrl+a, Ctrl+c (sendkeys commands in VBScript) detectable? I would easily analyze data from the clipboard!


Umer October 26, 2018

I am trying to scrape some information from website http://www.similarweb.com through python script (tried through both python shell and IDE) but ends up into a captcha page but the same url loads completely in chrome or any other browser. Captcha message:

Pardon Our Interruption…

As you were browsing similarweb.com something about your browser made us think you were a bot. There are a few reasons this might happen:

You’re a power user moving through this website with super-human speed.
You’ve disabled JavaScript in your web browser.
A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article.

After completing the CAPTCHA below, you will immediately regain access to similarweb.com.

Could you please let me know about the fix?


Jerrad March 14, 2019

What is a good speed to start out with when trying a new spider? For example in clicking links or copying text.? I’m not in a hurry I just want my search to be complete.


    ScrapeHero March 15, 2019

    A delay of 10 – 30 seconds between clicks would not put much load on the website and the scraper would be “nice” to the website.


Lee April 23, 2019

If I am using a website to scrape emails from a list of domains. Is that using my IPs to do that or the websites? The website in question is https://www.onlineemailextractor.com/.


Prashant October 14, 2019

How to solve distil captcha for the purspose of scraping


Aislinn April 30, 2020

Hi, how would you go around a site using datadome (such as fnac.com)?


    ScrapeHero April 30, 2020

    This article describes some of the basic techniques. This industry changes everyday but some of the basic techniques stay the same.
    Thanks


Abhijit Mandal July 20, 2020

I would like to scrape “www.zoopla.co.uk”. I can do this when I use Azure Notebooks, but the same code does not work with Google Colab – it gives 403 Forbidden error. Can you suggest a way around?


    ScrapeHero July 20, 2020

    Sorry – we cant help with every platform out there, but hopefully someone else in the community can


Comments are closed.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?