How To Rotate Proxies and change IP Addresses using Python 3

A common problem faced by web scrapers is getting blocked by websites while scraping them. There are many techniques to prevent getting blocked, like

  • Rotating IP addresses
  • Using Proxies
  • Rotating and Spoofing user agents
  • Using headless browsers
  • Reducing the crawling rate

What is a rotating proxy?

A rotating proxy is a proxy server that assigns a new IP address from the proxy pool for every connection. That means you can launch a script to send 1,000 requests to any number of sites and get 1,000 different IP addresses. Using proxies and rotating IP addresses in combination with rotating user agents can help you get scrapers past most of the anti-scraping measures and prevent being detected as a scraper.

The concept of rotating IP addresses while scraping is simple – you can make it look to the website that you are not a single ‘bot’ or a person accessing the website, but multiple ‘real’ users accessing the website from multiple locations. If you do it right, the chances of getting blocked are minimal.

In this blog post, we will show you how to send your requests to a website using a proxy, and then we’ll show you how to send these requests through multiple IP addresses or proxies.

How to send requests through a Proxy in Python 3 using Requests

If you are using Python-Requests, you can send requests through a proxy by configuring the proxies argument. For example

import requests

proxies = {
  'http': 'http://10.10.1.10:3128',
  'https': 'http://10.10.1.10:1080',
}

requests.get('http://example.org', proxies=proxies)

We’ll show how to send a real request through a free proxy.

Let’s find a proxy

There are many websites dedicated to providing free proxies on the internet. One such site is https://free-proxy-list.net/. Let’s go there and pick a proxy that supports https (as we are going to test this on an https website).

Here is our proxy –

IP: 207.148.1.212 Port: 8080

Note:

This proxy might not work when you test it. You should pick another proxy from the website if it doesn’t work.

Now let’s make a request to HTTPBin’s IP endpoint and test if the request went through the proxy

import requests
url = 'https://httpbin.org/ip'
proxies = {
    "http": 'http://209.50.52.162:9050', 
    "https": 'http://209.50.52.162:9050'
}
response = requests.get(url,proxies=proxies)
print(response.json())
{'origin': '209.50.52.162'}

You can see that the request went through the proxy. Let’s get to sending requests through a pool of IP addresses.

Rotating Requests through a pool of Proxies in Python 3

We’ll gather a list of some active proxies from https://free-proxy-list.net/. You can also use private proxies if you have access to them.

You can make this list by manually copy and pasting, or automate this by using a scraper (If you don’t want to go through the hassle of copy and pasting every time the proxies you have gets removed). You can write a script to grab all the proxies you need and construct this list dynamically every time you initialize your web scraper. Once you have the list of Proxy IPs to rotate, the rest is easy.

We have written some code to pick up IPs automatically by scraping. (This code could change when the website updates its structure)

import requests
from lxml.html import fromstring
def get_proxies():
    url = 'https://free-proxy-list.net/'
    response = requests.get(url)
    parser = fromstring(response.text)
    proxies = set()
    for i in parser.xpath('//tbody/tr')[:10]:
        if i.xpath('.//td[7][contains(text(),"yes")]'):
            #Grabbing IP and corresponding PORT
            proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
            proxies.add(proxy)
    return proxies

The function get_proxies will return a set of proxy strings that can be passed to the request object as proxy config.

proxies = get_proxies()
print(proxies)
{'121.129.127.209:80', '124.41.215.238:45169', '185.93.3.123:8080', '194.182.64.67:3128', '106.0.38.174:8080', '163.172.175.210:3128', '13.92.196.150:8080'}

Now that we have the list of Proxy IP Addresses in a variable proxies, we’ll go ahead and rotate it using a Round Robin method.

import requests
from itertools import cycle
import traceback
#If you are copy pasting proxy ips, put in the list below
#proxies = ['121.129.127.209:80', '124.41.215.238:45169', '185.93.3.123:8080', '194.182.64.67:3128', '106.0.38.174:8080', '163.172.175.210:3128', '13.92.196.150:8080']
proxies = get_proxies()
proxy_pool = cycle(proxies)

url = 'https://httpbin.org/ip'
for i in range(1,11):
    #Get a proxy from the pool
    proxy = next(proxy_pool)
    print("Request #%d"%i)
    try:
        response = requests.get(url,proxies={"http": proxy, "https": proxy})
        print(response.json())
    except:
        #Most free proxies will often get connection errors. You will have retry the entire request using another proxy to work. 
        #We will just skip retries as its beyond the scope of this tutorial and we are only downloading a single url 
        print("Skipping. Connnection error")
Request #1
{'origin': '121.129.127.209'}
Request #2
{'origin': '124.41.215.238'}
Request #3
{'origin': '185.93.3.123'}
Request #4
{'origin': '194.182.64.67'}
Request #5
Skipping. Connnection error
Request #6
{'origin': '163.172.175.210'}
Request #7
{'origin': '13.92.196.150'}
Request #8
{'origin': '121.129.127.209'}
Request #9
{'origin': '124.41.215.238'}
Request #10
{'origin': '185.93.3.123'}

Okay – it worked. Request #5 had a connection error probably because the free proxy we grabbed was overloaded with users trying to get their proxy traffic through. Below is the full code to do this.

Full Code

from lxml.html import fromstring
import requests
from itertools import cycle
import traceback

def get_proxies():
    url = 'https://free-proxy-list.net/'
    response = requests.get(url)
    parser = fromstring(response.text)
    proxies = set()
    for i in parser.xpath('//tbody/tr')[:10]:
        if i.xpath('.//td[7][contains(text(),"yes")]'):
            proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
            proxies.add(proxy)
    return proxies


#If you are copy pasting proxy ips, put in the list below
#proxies = ['121.129.127.209:80', '124.41.215.238:45169', '185.93.3.123:8080', '194.182.64.67:3128', '106.0.38.174:8080', '163.172.175.210:3128', '13.92.196.150:8080']
proxies = get_proxies()
proxy_pool = cycle(proxies)

url = 'https://httpbin.org/ip'
for i in range(1,11):
    #Get a proxy from the pool
    proxy = next(proxy_pool)
    print("Request #%d"%i)
    try:
        response = requests.get(url,proxies={"http": proxy, "https": proxy})
        print(response.json())
    except:
        #Most free proxies will often get connection errors. You will have retry the entire request using another proxy to work. 
        #We will just skip retries as its beyond the scope of this tutorial and we are only downloading a single url 
        print("Skipping. Connnection error")

Rotating Proxies in Scrapy

Scrapy does not have built in proxy rotation. There are many middlewares in scrapy for rotating proxies or ip address in scrapy. We have found scrapy-rotating-proxies to be the most useful among them.

Install scrapy-rotating-proxies using

pip install scrapy-rotating-proxies

In your scrapy project’s settings.py add,

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
}

ROTATING_PROXY_LIST = [
    'proxy1.com:8000',
    'proxy2.com:8031',
    # ...
]

As an alternative to ROTATING_PROXY_LIST, you can specify a ROTATING_PROXY_LIST_PATH options with a path to a file with proxies, one per line:
ROTATING_PROXY_LIST_PATH = ‘/my/path/proxies.txt’

You can read more about this middleware on its github repo.

5 Things to keep in mind while using proxies and rotating IP addresses

Here are a few tips that you should remember:

Do not rotate IP Address when scraping websites after logging in or using Sessions

We don’t recommend rotating IPs if you are logging into a website. The website already knows who you are when you log in, through the session cookies it sets. To maintain the logged-in state, you need to keep passing the Session ID in your cookie headers. The servers can easily tell that you are bot when the same session cookie is coming from multiple IP addresses and block you.

A similar logic applies if you are sending back that session cookie to a website. The website already knows this session is using a certain IP and a User-Agent. Rotating these two fields would do you more harm than good in these cases.

In these situations, it’s better just to use a single IP address and maintain the same request headers for each unique login.

Avoid Using Proxy IP addresses that are in a sequence

Even the simplest anti-scraping plugins can detect that you are a scraper if the requests come from IP addresses that are continuous or belong to the same range like this:

64.233.160.0

64.233.160.1

64.233.160.2

64.233.160.3

Some websites have gone as far as blocking the entire providers like AWS and have even blocked entire countries.

If you are using free proxies – automate

Free proxies tend to die out soon, mostly in days or hours and would expire before the scraping even completes. To prevent that from disrupting your scrapers, write some code that would automatically pick up and refresh the proxy list you use for scraping with working IP addresses. This will save you a lot of time and frustration.

Use Elite Proxies whenever possible if you are using Free Proxies ( or even if you are paying for proxies )

All proxies aren’t the same. There are mainly three types of proxies available in the internet.

  1. Transparent Proxy – A transparent proxy is a server that sits between your computer and the internet and redirects your requests and responses without modifying them. It sends your real IP address in the HTTP_X_FORWARDED_FOR header, this means a website that does not only determine your REMOTE_ADDR but also checks for specific proxy headers that will still know your real IP address. The HTTP_VIA header is also sent, revealing that you are using a proxy server.
  2. Anonymous Proxy – An anonymous proxy does not send your real IP address in the HTTP_X_FORWARDED_FOR header, instead, it submits the IP address of the proxy or it’ll just be blank. The HTTP_VIA header is sent with a transparent proxy, which would reveal you are using a proxy server. An anonymous proxy server does not tell websites your real IP address anymore. This can be helpful to just keep your privacy on the internet. The website can still see you are using a proxy server, but in the end, it does not really matter as long as the proxy server does not disclose your real IP address. If someone really wants to restrict page access, an anonymous proxy server will be detected and blocked.
  3. Elite Proxy – An elite proxy only sends REMOTE_ADDR header while the other headers are empty. It will make you seem like a regular internet user who is not using a proxy at all. An elite proxy server is ideal to pass any restrictions on the internet and to protect your privacy to the fullest extent. You will seem like a regular internet user who lives in the country that your proxy server is running in.

Elite Proxies are your best option as they are hard to be detected. Use anonymous proxies if it’s just to keep your privacy on the internet. Lastly, use transparent proxies – although the chances of success are very low.

Get Premium Proxies if you are Scraping Thousands of Pages

Free proxies available on the internet are always abused and end up being in blacklists used by anti-scraping tools and web servers. If you are doing serious large-scale data extraction, you should pay for some good proxies. There are many providers who would even rotate the IPs for you.

Use IP Rotation in combination with Rotating User Agents

IP rotation on its own can help you get past some anti-scraping measures. If you find yourself being banned even after using rotating proxies, a good solution is adding header spoofing and rotation.

That’s all we’ve got to say. Happy Scraping 🙂

Having problems collecting the data you need? We can help

Are your periodic data extraction jobs interrupted due to website blocking or other IT infrastructural issues? Using ScrapeHero's data extraction service will make it hassle-free for you.



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Posted in:   Developers, Scraping Tips, Web Scraping Tutorials

Responses

SkyChaos March 9, 2018

This is perfect, but most users here on your website and from github are asking for help to scrape multiple pages, further reading didn’t help me with it as Your previous scraping post results only the first page of reviews, so this post doesn’t do much without that.

Would be perfect ig you’d create/edit that post and add multiple pages rotation so we can combine it with this one!

Thanks in advance.


rootVIII August 8, 2018

Hello. May I sugges this requests wrapper class?

ProxyRequests

https://github.com/rootVIII/proxy_requests

It automates the process of scraping proxies and making the request. It’s pretty simple to use and very effective


Nathan Breedlove August 26, 2018

Did you ever cover retries with failed proxies? I’m trying to implement that


Sahil September 14, 2018

Thanks a lot for this article. It really saved my day. My professor asks me to collect data and do analyses and this proxy was always an issue.
I cannot thank you enough.


    ScrapeHero September 15, 2018

    Hi Sahil. We are glad to be of help 😀


vampyraeThomas September 26, 2018

Very useful article! Somehow though, when I use the code my requests always process with the last proxy in my list. Any idea how I could overcome that?


Yunior Aguirre June 19, 2019

if you I was to use this code with threading. Would the proxies overlap and be used at the same time with threading or does the proxy_pool variable prevent this?


federico July 12, 2019

Always getting (except: ) Skipping. Connection Error while testing the code. The list creation is fine, but i’m unable to make the request


    ScrapeHero July 19, 2019

    Sounds like you are getting blocked


Mranalinee Chouhan December 17, 2019

raise ProxyError(e, request=request)
[Tue Dec 17 11:11:14.869383 2019] [wsgi:error] [pid 30135:tid 139877152048896] [remote 27.56.251.32:16683] requests.exceptions.ProxyError: HTTPSConnectionPool(host=’www.realtor.com’, port=443): Max retries exceeded with url:
(Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(‘<urllib3.connection.VerifiedHTTPSConnection object at 0x7f37a4c18f60>: Failed to establish a new connection: [Errno 111] Connection refused’,)))


Mranalinee Chouhan December 17, 2019

raise ProxyError(e, request=request)
[Tue Dec 17 11:11:14.869383 2019] [wsgi:error] [pid 30135:tid 139877152048896] [remote 27.56.251.32:16683] requests.exceptions.ProxyError: HTTPSConnectionPool(host=’www.realtor.com’, port=443): Max retries exceeded with url:
(Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(‘<urllib3.connection.VerifiedHTTPSConnection object at 0x7f37a4c18f60>: Failed to establish a new connection: [Errno 111] Connection refused’,)))
Please help me out from this why i am getting this error


Xavier Julien May 29, 2020

I really appreciate the effort you have put into educating your readers. I was curious if you could direct me to an article or some other resource for me to understand more about these headers for proxies, i want to be able to see these headers when testing my purchased proxies. In other words, If i buy a premium proxy, send a request out to a url, I would like to see that requests headers as it is being sent, along with all the rest of http headers and body. This is the closest and most informative article i have found, but i’m still clueless how to resolve. Please if you have the time can you point me in the right direction.


    ScrapeHero May 29, 2020

    Those headers can ONLY be provided by your proxy provider or the website that is getting your request.
    Some proxy providers provide some basic data back using their own custom headers but most will not.

    The other way to do this is to setup your own basic website and access that through the proxy. Write a basic PHP or some other script on that server to capture those header variables and print them to file to analyze later.

    Here is a PHP code that works well in nginx (or apache) to dump the headers to a JSON payload which can be printed or written to a file


    if (!function_exists('getallheaders'))
    {
    function getallheaders()
    {
    $headers = [];
    foreach ($_SERVER as $name => $value)
    {
    if (substr($name, 0, 5) == 'HTTP_')
    {
    $headers[str_replace(' ', '-', strtolower(str_replace('_', ' ', substr($name, 5))))] = $value;
    }
    }
    return json_encode($headers);
    }
    }


      ScrapeHero May 29, 2020

      To print these headers back to the browser you add the line at the end
      print_r(getallheaders());


        Xavier Julien May 30, 2020

        You’re awesome


LIVEMTC July 24, 2020

so how would one go and keep the proxy from disconnecting from that url its sent too?


    ScrapeHero July 26, 2020

    Read up on sessions. It is a complex topic beyond the scope of what we cover.


DragonBall July 28, 2020

Awesome tutorial, may i know why am i keep getting connection errors when i changed the url = ‘https://httpbin.org/ip’ to some other URLS? How to resolve this issue?


Chris September 3, 2020

Hey, thanks for this helpful article, I hope this will work for my scraping project :). One question: You are importing “traceback” but I don’t see it beeing used anywhere. Is it needed?

Thanks
Chris


    ScrapeHero September 4, 2020

    Thanks Chris – glad we could help.
    It is probably a leftover artifact – if the code works without it – go ahead remove it


vijai June 1, 2021

why i could not find free correct proxy ip that may work


Comments are closed.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?