Scraping Tips


Interesting tips and articles about Web Scraping. How to successfully use automation to gather data from websites. Data extraction techniques and code are available in our tutorials

Scalable do-it-yourself scraping – How to build and run scrapers on a large scale

Scalable do-it-yourself scraping – How to build and run scrapers on a large scale

Businesses that don’t rely on data have a meager chance of success in a data-driven world. One of the best sources of data is the data available publicly online on various websites and to get this data you have to employ the technique called Web Scraping or Data Scraping. You can use full-service professionals such […]

Python Frameworks and Libraries for Web Scraping

Python Frameworks and Libraries for Web Scraping

Comparison and Use Cases of popular python frameworks and libraries used for webs scraping like – Scrapy,Urllib, Requests, Selenium, Beautifulsoup and LXML

How To Make  Anonymous Requests using TorRequests and Python

How To Make Anonymous Requests using TorRequests and Python

Tor is quite useful when you have to use requests without revealing your IP address, especially when you are web scraping. This tutorial will use a wrapper in python that helps you with the same.

How To Rotate Proxies and IP Addresses using Python 3

How To Rotate Proxies and IP Addresses using Python 3

When scraping many pages from a website, using the same IP addresses will lead to getting blocked. A way to avoid this is by rotating IP addresses that can prevent your scrapers from being disrupted. In this tutorial, we will show you how to rotate IP addresses to prevent getting blocked while scraping.

How To Install Python Packages for Web Scraping in Windows 10

How To Install Python Packages for Web Scraping in Windows 10

Web scraping using Python in Windows can be tough. In this tutorial follow the steps to setup python 3 and python packages on your Windows 10 computer for web scraping in Windows 10.

How to fake and rotate User Agents using Python 3

How to fake and rotate User Agents using Python 3

When scraping many pages from a website, using the same user-agent consistently leads to the detection of a scraper. A way to bypass that detection is by faking your user agent and changing it with every request you make to a website. In this tutorial, we will show you how to fake user agents, and randomize them to prevent getting blocked while scraping websites.

How to Solve Simple Captchas using Python Tesseract

How to Solve Simple Captchas using Python Tesseract

  CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. As the acronym suggests, it is a test used to determine whether the user is human or not. A typical captcha consists of a distorted test, which a computer program cannot interpret but a human can (hopefully) still read. This […]

How to Parse Unstructured Addresses using Python and Google GeoCoding API

How to Parse Unstructured Addresses using Python and Google GeoCoding API

If you have come across a large number of freeform address as a single string, for example – “9 Downing St Westminster London SW1A, UK”,  you know how hard it would be to validate, compare and deduplicate these addresses. To start with you’ll have to split this address into a more structured form with house […]

The best data and file formats for scraped data

The best data and file formats for scraped data

The data we provide comes in various forms from the source and is largely text (barring rich media such as images and videos or proprietary file formats such as PDFs). Our customers need this data in various formats and the key to a successful and scalable solution that works best for our customers and us […]

An API for every site using web scraping

An API for every site using web scraping

There is a lot of content available on the millions of websites on the Internet, and all of them involve some amount of programming to get them there, however, to get to all this content using a programmatic API isn’t really possible. It feels like somehow the creators of the Internet protocols forgot this essential […]

XPaths and their relevance in Web Scraping

XPaths and their relevance in Web Scraping

XPath (XML Path Language) is a syntax for defining parts of an XML document. XPath is a query language for identifying and selecting nodes or elements in an XML document using a tree like representation of the document. XPath was defined by the World Wide Web Consortium (W3C). XPaths are one of the few ways […]

Why *not* scrape yourself

Why *not* scrape yourself

Before you get all kinds of ideas about what the topic of this article means – please look at the context – We are talking about Web Scraping here ! This post will talk about reason why not to do this yourself and why to call in a professional (wink wink – use ScrapeHero) You […]

Turn the Internet into meaningful, structured and usable data