Scalable do-it-yourself scraping – How to build and run scrapers on a large scale

Large scale scraping Businesses that don’t rely on data have a very low chance of success in a data driven world.

One of the best sources of data is the data available publicly online on various websites and to get this data you have to employ the technique called Web Scraping or Data Scraping.

You can use full service professionals such as ScrapeHero to do all this for you or if you feel brave enough, you can tackle this yourself.

The purpose of this article is to walk you through some of the things you need to do and the issues you need to be cognizant of when you do decide to do it yourself.

When you decide to do this yourself, you will most likely be hiring a few developers who know how to build scrapers and setting up some servers and related infrastructure to run these scrapers without interruption, and integrating the data you extract into your business process.

Building and maintaining a large number of web scrapers is a very complex process so proceed with caution.

Here are the high level steps involved in this process and we will go through each of these in detail in this article.

  1. Build Scrapers
  2. Run the Scrapers
  3. Store the data
  4. IP Rotation, Proxies and Blacklisting
  5. Quality Checks on Data
  6. Maintenance

Build Scrapers and Set up the servers

The first thing to do is build the scrapers.

It may be best to choose an open-source framework for building your scrapers, like Scrapy or PySpider. These are excellent frameworks with a large community of developers. Both these frameworks are based on Python. You won’t run into the risk of your developer(s) disappearing in a day, and no one to maintain your scrapers because Python is popular and the community is really supportive.

There is also a massive difference between writing and running one scraper that scrapes 100 pages to a large scale distributed scraping infrastructure that can scrape thousands of websites and millions of pages a day.

If you are scraping a large number of big websites, you might need lot of servers to get the data in a reasonable time frame. We would suggest using Scrapy Redis or Run PySpider in scaled mode, across multiple servers.

Once you have chosen a framework, hire some good developers to build these scrapers, and set up the servers required to run them and to store the data.

Run Scrapers

If you need the data to be refreshed periodically, you’ll either have to run it manually or automate it using some tool or process.

If you are using Scrapy,scrapyd + cron can schedule the spiders for you, and it will update the data the way you need it.  PySpider also has a UI to do that

Store your data

Once you have this massive data trove, you need a place to store it. We would suggest using a NoSQL database like MongoDB, Cassandra or HBase to store this data, depending upon the frequency and speed of scraping.

You can then extract this data from this database/datastore and integrate it with your business process. But before you do that, you should setup some Quality Assurance tests for your data (more on that later)

IP Rotation, Proxies and Blacklisting

Large scale scraping comes with a multitude of problems and one of the big ones is anti-scraping measures by the websites that you are trying to scrape.

If any of the target websites has any kind of IP based blocking involved, your servers’ IP address will be black listed in no time and the site won’t respond to requests from your servers. You’ll be left with very few options after getting blacklisted.

So, how do you bypass that? You’ll have to get some Proxies or Rotating IP solutions to use these for making requests from the scraper.

Here are few tips to prevent getting blacklisted

Quality Assurance

The data you scrape is only as good as its quality. To ensure the data that you scraped in accurate and complete, you need to run a variety of QA tests on it right after it is scraped.

Having a set of Tests for the integrity of the data is essential. Some of it can be automated by using Regular Expressions, to check if the data follows a predefined pattern and if it doesn’t then generate some alerts so that it can be manually inspected.

Maintenance

Scrapers

Every website will change their structure now and then, and so should your scrapers. Scrapers usually need adjustments every few weeks, as a minor change in the target website affecting the fields you scrape might either give you incomplete data or even crash the scraper, depending on the logic of the scraper.

You have to be smart and detect this change and fix it before this ruins the data you are collecting.

Database & Servers

Depending upon the size of data, you will have to clean up your database of outdated data to save space and money. You might also have to scale up your systems if you still need the old data. Sharding and Replication of databases can be of help.

Server logs should also be cleaned periodically.

Know when to ask for help

This whole process is expensive and time consuming and you need to be ready to take on this challenge.

You also need to know when to stop and ask for help. ScrapeHero has been doing all this and more for many years now, so let us know if you need any help.

Or you can just worry about what to do with the data, we deliver to you. Interested ?

Turn the Internet into meaningful, structured and usable data

One thought on “Scalable do-it-yourself scraping – How to build and run scrapers on a large scale

scrapehero

Coincidentally, import.io had a blog post about a similar topic yesterday on how to get data for your business http://blog.import.io/post/how-to-buy-data-for-your-business

We have added a comment in there which links their blog readers to Data as a Service provider like ScrapeHero

Comments or Questions?

Turn the Internet into meaningful, structured and usable data