Web scraping is a critically important method for collecting data across the Internet.
Web Scraping powers companies of all sizes everyday – Fortune 500 to small startups and small businesses.
The approaches taken by companies to achieve this can be one of the following:
- Web Scraping Software or Data Extraction Software
- Web Scraping Services or Data Extraction Services
- Build your own script or software
Web Scraping Software
This approach has a minimal barrier to entry.
The two most popular web scraping software are
- Downloadable software – software packages that you download from the Internet, install on your local computer desktop and run locally
- Hosted software or Software as a Service – software you signup online and use online and pay for usage
Both of these methods have a very low barrier to entry – it is simple – go online, get a credit card and purchase the software.
You have to fit the paradigm and thought process of the software creator and get used to arcane terminology to describe the various elements, workflows, navigation, logic and process that goes into successfully scraping data from a web site.
Lets take this example of a very popular software used for web scraping (they will stay unnamed because they are not alone – everyone has a similar problem).
They have a whole section devoted to video tutorials.
Who doesn’t like videos – we all watch them. But the problem is that these videos are not entertainment videos, instead they talk about things like Agents, Collections, Jobs, XPaths, Lists (seems like an important concept and no, it is not a grocery list).
Lost already ! Well you are not alone.
Most people give up after playing with this for a little bit, watching a few videos, making a quick scraper from an example.
Even fairly technical people have a tough time navigating and fitting the paradigm
Software as a Service (SaaS)
Here is an example of a SaaS service tutorial (which will stay unnamed to prevent picking on it)
The best way to extract data that is spread out across many pages of a site is by building a Crawler. Based on your training, a Crawler travel to every page of that site looking for other pages that match. Crawlers are best used for when you want lots of data, but don’t know all the URLs for that site.
The key thing to note here is the part in bold – “Based on your training”. What that means is there is a lot of work that will go into it.
Hours spent in learning, training, the agony and ecstasy when things finally do work (for a limited time) – it surely has a thrill to some people.
If you have the time, skills and determination, Web Scraping Software can work for you.
But if your business does not involve Web Scraping, it is better to save yourself the trouble and use a Web Scraping Service instead.
Web Scraping Service
We have covered this topic on this page, so please head over there to read through it.
Build your own software
We will be covering this topic in more detail and link to it soon