Scraping Tips


Interesting tips and articles about Web Scraping. How to successfully use automation to gather data from websites. Data extraction techniques and code are available in our tutorials

How to Solve Simple Captchas using Python Tesseract

How to Solve Simple Captchas using Python Tesseract

CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. As the acronym suggests, it is a test used to determine whether the user is human or not. A typical captcha consists of a distorted test, which a computer program cannot interpret but a human can (hopefully) still read. This tutorial will […]

How to Parse Addresses using Python and Google GeoCoding API

How to Parse Addresses using Python and Google GeoCoding API

Web scraping can often lead to you having scraped address data which are unstructured. If you have come across a large number of freeform address as a single string, for example – “9 Downing St Westminster London SW1A, UK”,  you know how hard it would be to validate, compare and deduplicate these addresses. To start […]

The best data and file formats for scraped data

The best data and file formats for scraped data

The data we provide comes in various forms from the source and is largely text (barring rich media such as images and videos or proprietary file formats such as PDFs). Our customers need this data in various formats and the key to a successful and scalable solution that fits the best data formats for web […]

An API for every site using web scraping

An API for every site using web scraping

There is a lot of content available on the millions of websites on the Internet, and all of them involve some amount of programming to get them there, however, to get to all this content using a programmatic API isn’t really possible. If you need data scraped from a website in a specific format in […]

XPaths and their relevance in Web Scraping

XPaths and their relevance in Web Scraping

XPath (XML Path Language) is a syntax for defining parts of an XML document. We will explain the relevance of Xpath in web scraping. XPath is a query language for identifying and selecting nodes or elements in an XML document using a tree like representation of the document. XPath was defined by the World Wide […]

Why *not* scrape yourself

Why *not* scrape yourself

Before you get all kinds of ideas about what the topic of this article means – please look at the context – We are talking about Web Scraping here ! This post will talk about reason why not to do this yourself and why to call in a professional (wink wink – use ScrapeHero) You […]

Webscraping using Python without using large frameworks like Scrapy

Webscraping using Python without using large frameworks like Scrapy

Scrapy is a well-established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow. If you would like to roll up your sleeves and perform web scraping in Python. continue reading. If you need publicly available data from scraping […]

5 tips for scraping big websites

5 tips for scraping big websites

Scraping bigger websites can be a challenge if done the wrong way.Bigger websites would have more data, more security and more pages. We’ve learned a lot from our years of crawling such large complex websites, and these web scraping tips could help solve some of your challenges Web Scraping Tips Here are 5 web scraping […]

Turn the Internet into meaningful, structured and usable data