Top Web Scraping Cloud Services and Providers in 2018

Web Scraping Cloud based platforms provide a relatively speedy entry point into “Self Service” scraping, Such self service cloud providers are a good choice if you want to try out web scraping and have the technical knowledge to build scrapers. Despite their seemingly easy visual interfaces, they do require quite a bit of technical knowledge once you get past the easiest of the examples.

In this post, we will go through some of the popular cloud-based web scraping providers and provide you details about how they work, their pros, cons and pricing from information that is publicly available on their websites.

Cloud-Based Web Scraping Platforms:

  1. Webscraper.io Cloud Scraper
  2. Scrapy Cloud
  3. Octoparse
  4. Parsehub
  5. Dexi.io
  6. Diffbot
  7. Import.io

Here is a comparison chart showing the important features of all the cloud based web scraping platform that we will go through in this post:

web-scraping-cloud-providers

Webscraper.io Cloud Scraper

webscraper.io-web-scraping-tool

Webscraper.io Cloud scraper is an online platform where you can deploy scrapers built and tested using the free point-and-click Webscraper.io Chome Extension. Using the extension you create “sitemaps” that shows how the data should be traversed and extracted. You can write the data directly in CouchDB or download it as a CSV file.

Data Export

CSV or Couch DB

Pricing

Based on the number of pages scraped. You purchase blocks of page credits and each page that your sitemap traverses will deduct one credit from your balance.

  • 100,000 page credits – $50
  • 250,000 page credits – $90
  • 500,000 page credits – $125
  • 1,000,000 page credits – $175
  • 2,000,000 page credits – $250

Pros

  • You can get started quickly as the tool is as simple as it gets and has great tutorial videos. 
  • Supports javascript heavy websites
  • The extension is opensource, so you will not be locked in with the vendor if the service shuts down

Cons

  • Not ideal of large-scale scrapes, as it is based on a chrome extension. Once the number of pages you need to scrape goes beyond a few thousands there are chances for the scrapes to be stuck or fail.
  • No support for external proxies or IP Rotation
  • Cannot Fill Forms or Inputs

Links

 

Scrapy Cloud

Scrapy Cloud is a hosted, cloud-based service by Scrapinghub, where you can deploy scrapers built using the scrapy framework. Scrapy Cloud removes the need to set up and monitor servers and provides a nice UI to manage spiders and review scraped items, logs, and stats. 

Data Export

  • File Formats – CSV, JSON, XML
  • Scrapy Cloud API
  • Write to any database or location using ItemPipelines

Pricing

Pricing generally depends on the number of concurrent scrapers you can run, data retention and it goes up with the add-ons like Crawlera (Intelligent Proxy) and Splash (Website Rendering).

  • Free for 1 concurrent job (maximum lifetime of one hour, then its auto-terminated), 7-day data retention.
  • $9 per “capacity unit”/month – 1 additional concurrent job, unlimited job runtime, 120 days of data retention (vs 7 days in the free tier) and personalized support
  • $25/month (or more) for Splash if you need to render websites in a headless browser. This is usually required for Javascript-heavy websites.
  • $25/month (or more) for 150K Requests made through Crawlera, an intelligent proxy which claims to be the worlds smartest proxy network.

Even though the pricing looks simple, a typical crawl of 150K pages per month on a website that requires a headless browser and moderate scale of blocking would cost about $59 per month, for a crawl spread out through the month.

Pros

  • The only cloud service that lets you deploy a scraper built using Scrapy – the most popular web scraping framework
  • Highly Customizable as it is Scrapy
  • Unlimited Pages Per Crawl (if you are not using Crawlera)
  • No Vendor Lock-In as Scrapy is open source and you can deploy Scrapy Spiders to the less functional open source ScrapyD Platform if you feel like switching
  • An array of useful add-ons that can improve the crawl
  • Useful for large scale scraping
  • A decent user interface which lets you see all sorts of logs a developer would need

Cons

  • No point and click utility.
  • You still need to “code” scrapers
  • Large-scale crawls can get expensive as you move up to higher pricing tiers

Links

 

Octoparse

Octoparse Cloud Service offers a cloud-based platform for users to run their extraction tasks built with the Octoparse Desktop App. 

Data Export

  • File Formats – CSV, HTML, XLS and JSON
  • Databases – MySQL, SQL Server, Oracle
  • Octoparse API

Pricing

Pricing is based on the jobs you can run simultaneously

  • $89/month with 6 concurrent jobs, Automatic IP Rotation up to 100 crawlers.
  • $249/month for 20 concurrent jobs, an “advanced” API, better support and Automatic IP Rotation up to 250 crawlers

Pros

  • Point and Click Tool is simple to set up and use
  • No programming is required
  • Can run up to 10 scrapers in your local computer if you don’t need much scalability
  • Supports Javascript-heavy websites
  • Includes Automatic IP Rotation in every plan

Cons

  • Vendor Lock-In – You will be locked in the Octoparse ecosystem as the tool only lets you run scrapers in Octoparse Cloud or 10 crawlers in your local machine. You can’t export your scrapers to any other platform or tool using Octoparse.
  • Local Jobs are unreliable and very slow even in a 20 Mbps network
  • API functionality is quite limited as per Octoparse
  • No Linux/Mac support. Octoparse only works in Windows.

Links

 

ParseHub

rsz_parsehub-logo-png-transparent

ParseHub lets you build web scrapers to crawl single and multiple websites with the support for JavaScript, AJAX, cookies, sessions, and redirects using their Desktop Application and deploy them to their cloud service. Parsehub provides a free version where you get 200 pages of data in 40 minutes, 5 public projects and limited support.

Data Export

  • File Formats – CSV, JSON
  • Integrates with Google Sheets and Tableau
  • Parsehub API

Pricing

Pricing is a bit confusing. In a nutshell, it is based on the number of pages crawled, speed limit and the total number of scraper you have, data retention and support priority

  • $149/month for a speed of 20 pages per minute, 10000 pages per each “run” of a scraper and 14-day data retention with standard support. You can run a scraper multiple times over the day or month.
  • $499/month for a speed of 100 pages per minute, unlimited pages per each “run” of a scraper and 30-day data retention with priority support. You can run a scraper multiple times over the day or month.

Pros

  • Point and Click Tool is simple to set up and use
  • No programming skills are required
  • Supports javascript heavy websites
  • Desktop application works in Windows, Mac, and Linux
  • Includes Automatic IP Rotation

Cons

  • Vendor Lock-In – You will be locked in the Parsehub ecosystem as the tool only lets you run scrapers in their cloud You can’t export your scrapers to any other platform or tool using Parsehub.
  • Cannot write directly to any database

Links

 

Dexi.io

Dexi.io is similar to Parsehub and Octoparse, except that it has web-based point and click utility instead of desktop-based tool. It lets you develop, hosting and schedule scrapers like the others.  Dexi has a concept of extractors and transformers interconnected using Pipes. It can be described as an advanced yet “complicated” alternative of Yahoo Pipes. 

Data Export

  • File Formats – CSV, JSON, XML
  • Can write to most databases through add-ons
  • Integrates with many cloud services
  • Dexi API

Pricing

Pricing is based on the number of concurrent jobs you can run and access to external integrations

  • $119/month for 1 concurrent job. This plan is pretty much unusable for larger jobs or if you need access to external storage or tools.
  • $399/month for 3 concurrent jobs, better support and access to Pipes, and other external services through integrations
  • $699/month for 6 concurrent jobs, priority  “tech” support and access to Pipes, and other external services through integrations

Pros

  • Many Integrations including Storage, ETL and Visualisation tools
  • Web-based point and click utility

Cons

  • Vendor Lock-In – You will be locked in the Dexi ecosystem as the tool only lets you run scrapers in their cloud platform.  You cannot export your scrapers to any other platform
  • Access to integrations comes at a high price
  • Setting up a scraper using the web-based UI is very slow and hard to work with for most websites
  • Higher learning curve

Links

 

Diffbot

Diffbot_Logo

Diffbot lets you configure crawlers that can go in and index websites and then process them using its automatic APIs for automatic data extraction from various web content. You can also write a custom extractor if automatic data extraction API doesn’t work for the websites you need. 

Data Export

  • File Formats – CSV, JSON, Excel
  • Cannot write directly to databases
  • Integrates with many cloud services through Zapier
  • Diffbot APIs

Pricing

Pricing is based on the number of API calls made to Diffbot, the speed of those calls and data retention

  • $299/month for 250K API calls at 5 calls per second, with 14-day data retention
  • $899/month for 1 Million API calls at 25 calls per second, with 30-day data retention
  • $3999/month for 5 Million API Calls at 50 calls per second, with 30-day data retention

Pros

  • Most of the websites do not usually need much setup as the automatic API’s does a lot of the heavy lifting for you
  • The custom API creation tool is easy to use
  • No IP rotation for the first two plans

Cons

  • Vendor Lock-In – You will be locked in the Diffbot ecosystem as the tool only lets you run scrapers in their environment platform.
  • Relatively Expensive

Links

 

Import.io

importio-logo

With Import.io you can clean, transform and visualize the data. It sits somewhere between Dexi.io, Octoparse, and Parsehub. You can build a scraper using a web-based point and click interface. Like Diffbot, import.io can handle most of the data extraction automatically. 

Data Export

  • File Formats – CSV, JSON, Google Sheets
  • Integrates with many cloud services
  • Import.io  APIs ( Premium Feature )

Pricing

Pricing is based on the number of pages crawled, support and access to more integrations and features.

  • $299/month for the most basic plan –  5000 pages per month. with no integrations or reports.

The pricing is very confusing to understand, you will in fact have to sit down with a sales rep to figure it out.

Pros

  • A whole package – Extraction, transformations, and visualizations.
  • Has a lot of value-added services which some would find useful
  • Has a good point and click interface along with some automatic APIs to make the setup process painless

Cons

  • Vendor Lock-In – You will be locked in the Import.io ecosystem as the tool only lets you run scrapers in their environment platform.
  • Most Expensive of all providers here ( If we understood the pricing correctly )
  • Confusing pricing model

Links

If you aren’t proficient with programming (visual or standard coding) or your needs are complex and you need large volumes of data to be scraped, there are great web scraping services that will suit your requirements to make the job easier for you.

You can save time and get clean, structured data by trying us – a “full-service” provider that doesn’t require the use of any tools and all you get is clean data without any hassles.

Need some professional help with scraping data? Let us know

Turn the Internet into meaningful, structured and usable data

Note: All features, prices etc are current at the time of writing this article. Please check the individual websites for current features and pricing.

Comments or Questions?

Turn the Internet into meaningful, structured and usable data