Web Scraping Tutorials


LEARN HOW TO USE WEB SCRAPING TO ENHANCE PRODUCTIVITY AND AUTOMATION

We provide many step-by-step tutorials with source code for web scraping, web crawling, data extraction, headless browsers, etc.

Our web scraping tutorials are usually written in Python using libraries such as LXML, Beautiful Soup, Selectorlib and occasionally in Node.js.

The full source code is also available to download in most cases or available to be easily cloned using Git.

We also provide various in-depth articles about Web Scraping tips, techniques and the latest technologies which include the latest anti-bot technologies, methods used to safely and responsibly gather publicly available data from the Internet.

The community that has coalesced around these tutorials and their comments help anyone from a beginner hobbyist person to an advanced programmer solve some of the issues they face with web scraping.

These tutorials are frequently linked to as StackOverflow solutions and discussed on Reddit.

Please feel free to read and participate in the discussions with your comments.

All Tutorials

How to Solve Simple Captchas using Python Tesseract

How to Solve Simple Captchas using Python Tesseract

CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. As the acronym suggests, it is a test used to determine whether the user is human or not. A typical captcha consists of a distorted test, which a computer program cannot interpret but a human can (hopefully) still read. This tutorial will […]

How to Parse Addresses using Python and Google GeoCoding API

How to Parse Addresses using Python and Google GeoCoding API

Web scraping can often lead to you having scraped address data which are unstructured. If you have come across a large number of freeform address as a single string, for example – “9 Downing St Westminster London SW1A, UK”,  you know how hard it would be to validate, compare and deduplicate these addresses. To start […]

How to Scrape Expedia using Python and LXML

How to Scrape Expedia using Python and LXML

Learn how to scrape flight details from Expedia.com, a leading travel and hotel site, using Python 3 and LXML in this web scraping tutorial. You’ll learn how to extract flight details such as flight timings, plane names, flight duration and more for a given source and destination.

How to scrape Yelp Business Details using Python and LXML

How to scrape Yelp Business Details using Python and LXML

This tutorial is a follow-up of How to scrape Yelp.com for Business Listings using Python. In this tutorial, we will help you in scraping Yelp.com data from the detail page of a business. You can use URLs of businesses you are interested in OR the ones you got from part one of this tutorial. Let’s […]

How to scrape Yelp for Business Listings

How to scrape Yelp for Business Listings

Yelp.com is a reliable source for extracting information regarding local businesses. In this tutorial, you will learn how to extract information of business listings such as name, search rank, number of reviews and more from Yelp.com based on a given city/state and type of business using Python 3 and LXML.

The best data and file formats for scraped data

The best data and file formats for scraped data

The data we provide comes in various forms from the source and is largely text (barring rich media such as images and videos or proprietary file formats such as PDFs). Our customers need this data in various formats and the key to a successful and scalable solution that fits the best data formats for web […]

Turn the Internet into meaningful, structured and usable data