Web scraping can often lead to you having scraped address data which are unstructured. If you have come across a large number of freeform address as a single string, for example - “9 Downing St…
A common problem faced by web scrapers is getting blocked by websites while scraping them. There are many techniques to prevent getting blocked, like
- Rotating IP addresses
- Using Proxies
- Rotating and Spoofing user agents
- Using headless browsers
- Reducing the crawling rate
What is a rotating proxy?
A rotating proxy is a proxy server that assigns a new IP address from the proxy pool for every connection. That means you can launch a script to send 1,000 requests to any number of sites and get 1,000 different IP addresses. Using proxies and rotating IP addresses in combination with rotating user agents can help you get scrapers past most of the anti-scraping measures and prevent being detected as a scraper.
The concept of rotating IP addresses while scraping is simple – you can make it look to the website that you are not a single ‘bot’ or a person accessing the website, but multiple ‘real’ users accessing the website from multiple locations. If you do it right, the chances of getting blocked are minimal.
In this blog post, we will show you how to send your requests to a website using a proxy, and then we’ll show you how to send these requests through multiple IP addresses or proxies.
How to send requests through a Proxy in Python 3 using Requests
If you are using Python-Requests, you can send requests through a proxy by configuring the proxies
argument. For example
import requests proxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', } requests.get('http://example.org', proxies=proxies)
We’ll show how to send a real request through a free proxy.
Let’s find a proxy
There are many websites dedicated to providing free proxies on the internet. One such site is https://free-proxy-list.net/. Let’s go there and pick a proxy that supports https (as we are going to test this on an https website).
Here is our proxy –
IP: 207.148.1.212 Port: 8080
Note:
This proxy might not work when you test it. You should pick another proxy from the website if it doesn’t work.
Now let’s make a request to HTTPBin’s IP endpoint and test if the request went through the proxy
import requests url = 'https://httpbin.org/ip' proxies = { "http": 'http://209.50.52.162:9050', "https": 'http://209.50.52.162:9050' } response = requests.get(url,proxies=proxies) print(response.json())
{'origin': '209.50.52.162'}
You can see that the request went through the proxy. Let’s get to sending requests through a pool of IP addresses.
Rotating Requests through a pool of Proxies in Python 3
We’ll gather a list of some active proxies from https://free-proxy-list.net/. You can also use private proxies if you have access to them.
You can make this list by manually copy and pasting, or automate this by using a scraper (If you don’t want to go through the hassle of copy and pasting every time the proxies you have gets removed). You can write a script to grab all the proxies you need and construct this list dynamically every time you initialize your web scraper. Once you have the list of Proxy IPs to rotate, the rest is easy.
We have written some code to pick up IPs automatically by scraping. (This code could change when the website updates its structure)
import requests from lxml.html import fromstring def get_proxies(): url = 'https://free-proxy-list.net/' response = requests.get(url) parser = fromstring(response.text) proxies = set() for i in parser.xpath('//tbody/tr')[:10]: if i.xpath('.//td[7][contains(text(),"yes")]'): #Grabbing IP and corresponding PORT proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]]) proxies.add(proxy) return proxies
The function get_proxies
will return a set of proxy strings that can be passed to the request object as proxy config.
proxies = get_proxies() print(proxies)
{'121.129.127.209:80', '124.41.215.238:45169', '185.93.3.123:8080', '194.182.64.67:3128', '106.0.38.174:8080', '163.172.175.210:3128', '13.92.196.150:8080'}
Now that we have the list of Proxy IP Addresses in a variable proxies
, we’ll go ahead and rotate it using a Round Robin method.
import requests from itertools import cycle import traceback #If you are copy pasting proxy ips, put in the list below #proxies = ['121.129.127.209:80', '124.41.215.238:45169', '185.93.3.123:8080', '194.182.64.67:3128', '106.0.38.174:8080', '163.172.175.210:3128', '13.92.196.150:8080'] proxies = get_proxies() proxy_pool = cycle(proxies) url = 'https://httpbin.org/ip' for i in range(1,11): #Get a proxy from the pool proxy = next(proxy_pool) print("Request #%d"%i) try: response = requests.get(url,proxies={"http": proxy, "https": proxy}) print(response.json()) except: #Most free proxies will often get connection errors. You will have retry the entire request using another proxy to work. #We will just skip retries as its beyond the scope of this tutorial and we are only downloading a single url print("Skipping. Connnection error")
Request #1 {'origin': '121.129.127.209'} Request #2 {'origin': '124.41.215.238'} Request #3 {'origin': '185.93.3.123'} Request #4 {'origin': '194.182.64.67'} Request #5 Skipping. Connnection error Request #6 {'origin': '163.172.175.210'} Request #7 {'origin': '13.92.196.150'} Request #8 {'origin': '121.129.127.209'} Request #9 {'origin': '124.41.215.238'} Request #10 {'origin': '185.93.3.123'}
Okay – it worked. Request #5 had a connection error probably because the free proxy we grabbed was overloaded with users trying to get their proxy traffic through. Below is the full code to do this.
Full Code
from lxml.html import fromstring import requests from itertools import cycle import traceback def get_proxies(): url = 'https://free-proxy-list.net/' response = requests.get(url) parser = fromstring(response.text) proxies = set() for i in parser.xpath('//tbody/tr')[:10]: if i.xpath('.//td[7][contains(text(),"yes")]'): proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]]) proxies.add(proxy) return proxies #If you are copy pasting proxy ips, put in the list below #proxies = ['121.129.127.209:80', '124.41.215.238:45169', '185.93.3.123:8080', '194.182.64.67:3128', '106.0.38.174:8080', '163.172.175.210:3128', '13.92.196.150:8080'] proxies = get_proxies() proxy_pool = cycle(proxies) url = 'https://httpbin.org/ip' for i in range(1,11): #Get a proxy from the pool proxy = next(proxy_pool) print("Request #%d"%i) try: response = requests.get(url,proxies={"http": proxy, "https": proxy}) print(response.json()) except: #Most free proxies will often get connection errors. You will have retry the entire request using another proxy to work. #We will just skip retries as its beyond the scope of this tutorial and we are only downloading a single url print("Skipping. Connnection error")
Rotating Proxies in Scrapy
Scrapy does not have built in proxy rotation. There are many middlewares in scrapy for rotating proxies or ip address in scrapy. We have found scrapy-rotating-proxies to be the most useful among them.
Install scrapy-rotating-proxies using
pip install scrapy-rotating-proxies
In your scrapy project’s settings.py add,
DOWNLOADER_MIDDLEWARES = { 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610, } ROTATING_PROXY_LIST = [ 'proxy1.com:8000', 'proxy2.com:8031', # ... ]
As an alternative to ROTATING_PROXY_LIST, you can specify a ROTATING_PROXY_LIST_PATH options with a path to a file with proxies, one per line:
ROTATING_PROXY_LIST_PATH = ‘/my/path/proxies.txt’
You can read more about this middleware on its github repo.
5 Things to keep in mind while using proxies and rotating IP addresses
Here are a few tips that you should remember:
Do not rotate IP Address when scraping websites after logging in or using Sessions
We don’t recommend rotating IPs if you are logging into a website. The website already knows who you are when you log in, through the session cookies it sets. To maintain the logged-in state, you need to keep passing the Session ID in your cookie headers. The servers can easily tell that you are bot when the same session cookie is coming from multiple IP addresses and block you.
A similar logic applies if you are sending back that session cookie to a website. The website already knows this session is using a certain IP and a User-Agent. Rotating these two fields would do you more harm than good in these cases.
In these situations, it’s better just to use a single IP address and maintain the same request headers for each unique login.
Avoid Using Proxy IP addresses that are in a sequence
Even the simplest anti-scraping plugins can detect that you are a scraper if the requests come from IP addresses that are continuous or belong to the same range like this:
64.233.160.0
64.233.160.1
64.233.160.2
64.233.160.3
Some websites have gone as far as blocking the entire providers like AWS and have even blocked entire countries.
If you are using free proxies – automate
Free proxies tend to die out soon, mostly in days or hours and would expire before the scraping even completes. To prevent that from disrupting your scrapers, write some code that would automatically pick up and refresh the proxy list you use for scraping with working IP addresses. This will save you a lot of time and frustration.
Use Elite Proxies whenever possible if you are using Free Proxies ( or even if you are paying for proxies )
All proxies aren’t the same. There are mainly three types of proxies available in the internet.
- Transparent Proxy – A transparent proxy is a server that sits between your computer and the internet and redirects your requests and responses without modifying them. It sends your real IP address in the HTTP_X_FORWARDED_FOR header, this means a website that does not only determine your REMOTE_ADDR but also checks for specific proxy headers that will still know your real IP address. The HTTP_VIA header is also sent, revealing that you are using a proxy server.
- Anonymous Proxy – An anonymous proxy does not send your real IP address in the HTTP_X_FORWARDED_FOR header, instead, it submits the IP address of the proxy or it’ll just be blank. The HTTP_VIA header is sent with a transparent proxy, which would reveal you are using a proxy server. An anonymous proxy server does not tell websites your real IP address anymore. This can be helpful to just keep your privacy on the internet. The website can still see you are using a proxy server, but in the end, it does not really matter as long as the proxy server does not disclose your real IP address. If someone really wants to restrict page access, an anonymous proxy server will be detected and blocked.
- Elite Proxy – An elite proxy only sends REMOTE_ADDR header while the other headers are empty. It will make you seem like a regular internet user who is not using a proxy at all. An elite proxy server is ideal to pass any restrictions on the internet and to protect your privacy to the fullest extent. You will seem like a regular internet user who lives in the country that your proxy server is running in.
Elite Proxies are your best option as they are hard to be detected. Use anonymous proxies if it’s just to keep your privacy on the internet. Lastly, use transparent proxies – although the chances of success are very low.
Get Premium Proxies if you are Scraping Thousands of Pages
Free proxies available on the internet are always abused and end up being in blacklists used by anti-scraping tools and web servers. If you are doing serious large-scale data extraction, you should pay for some good proxies. There are many providers who would even rotate the IPs for you.
Use IP Rotation in combination with Rotating User Agents
IP rotation on its own can help you get past some anti-scraping measures. If you find yourself being banned even after using rotating proxies, a good solution is adding header spoofing and rotation.
That’s all we’ve got to say. Happy Scraping 🙂
Having problems collecting the data you need? We can help
Are your periodic data extraction jobs interrupted due to website blocking or other IT infrastructural issues? Using ScrapeHero's data extraction service will make it hassle-free for you.
Responses
This is perfect, but most users here on your website and from github are asking for help to scrape multiple pages, further reading didn’t help me with it as Your previous scraping post results only the first page of reviews, so this post doesn’t do much without that.
Would be perfect ig you’d create/edit that post and add multiple pages rotation so we can combine it with this one!
Thanks in advance.
Hello. May I sugges this requests wrapper class?
ProxyRequests
https://github.com/rootVIII/proxy_requests
It automates the process of scraping proxies and making the request. It’s pretty simple to use and very effective
Did you ever cover retries with failed proxies? I’m trying to implement that
We will add it soon.
Meanwhile, please take a look at the code in our Amazon Scraper – https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python/. We have implemented retries there.
Yeah the proxy-requests package does this nicley
https://pypi.org/project/proxy-requests/
Thanks a lot for this article. It really saved my day. My professor asks me to collect data and do analyses and this proxy was always an issue.
I cannot thank you enough.
Hi Sahil. We are glad to be of help 😀
Very useful article! Somehow though, when I use the code my requests always process with the last proxy in my list. Any idea how I could overcome that?
if you I was to use this code with threading. Would the proxies overlap and be used at the same time with threading or does the proxy_pool variable prevent this?
Always getting (except: ) Skipping. Connection Error while testing the code. The list creation is fine, but i’m unable to make the request
Sounds like you are getting blocked
raise ProxyError(e, request=request)
[Tue Dec 17 11:11:14.869383 2019] [wsgi:error] [pid 30135:tid 139877152048896] [remote 27.56.251.32:16683] requests.exceptions.ProxyError: HTTPSConnectionPool(host=’www.realtor.com’, port=443): Max retries exceeded with url:
(Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(‘<urllib3.connection.VerifiedHTTPSConnection object at 0x7f37a4c18f60>: Failed to establish a new connection: [Errno 111] Connection refused’,)))
raise ProxyError(e, request=request)
[Tue Dec 17 11:11:14.869383 2019] [wsgi:error] [pid 30135:tid 139877152048896] [remote 27.56.251.32:16683] requests.exceptions.ProxyError: HTTPSConnectionPool(host=’www.realtor.com’, port=443): Max retries exceeded with url:
(Caused by ProxyError(‘Cannot connect to proxy.’, NewConnectionError(‘<urllib3.connection.VerifiedHTTPSConnection object at 0x7f37a4c18f60>: Failed to establish a new connection: [Errno 111] Connection refused’,)))
Please help me out from this why i am getting this error
I really appreciate the effort you have put into educating your readers. I was curious if you could direct me to an article or some other resource for me to understand more about these headers for proxies, i want to be able to see these headers when testing my purchased proxies. In other words, If i buy a premium proxy, send a request out to a url, I would like to see that requests headers as it is being sent, along with all the rest of http headers and body. This is the closest and most informative article i have found, but i’m still clueless how to resolve. Please if you have the time can you point me in the right direction.
Those headers can ONLY be provided by your proxy provider or the website that is getting your request.
Some proxy providers provide some basic data back using their own custom headers but most will not.
The other way to do this is to setup your own basic website and access that through the proxy. Write a basic PHP or some other script on that server to capture those header variables and print them to file to analyze later.
Here is a PHP code that works well in nginx (or apache) to dump the headers to a JSON payload which can be printed or written to a file
if (!function_exists('getallheaders'))
{
function getallheaders()
{
$headers = [];
foreach ($_SERVER as $name => $value)
{
if (substr($name, 0, 5) == 'HTTP_')
{
$headers[str_replace(' ', '-', strtolower(str_replace('_', ' ', substr($name, 5))))] = $value;
}
}
return json_encode($headers);
}
}
To print these headers back to the browser you add the line at the end
print_r(getallheaders());
You’re awesome
so how would one go and keep the proxy from disconnecting from that url its sent too?
Read up on sessions. It is a complex topic beyond the scope of what we cover.
can i order python script for scrapping? my email ridwanratman@gmail.com
It is free to use
how to combine 3 python scripts from this web tutorial:
1. amazon.py
—-> https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python-and-selectorlib/
2. https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/
3. https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/
Please help..
Awesome tutorial, may i know why am i keep getting connection errors when i changed the url = ‘https://httpbin.org/ip’ to some other URLS? How to resolve this issue?
Hey, thanks for this helpful article, I hope this will work for my scraping project :). One question: You are importing “traceback” but I don’t see it beeing used anywhere. Is it needed?
Thanks
Chris
Thanks Chris – glad we could help.
It is probably a leftover artifact – if the code works without it – go ahead remove it
why i could not find free correct proxy ip that may work
Comments are closed.