This is an open thread and the goal is to solicit comments on what the best web scraping service may look like. Please go ahead a type away and write down the ideas or requirements…
A web scraping provider is an essential vendor or provider to most organizations, especially the highly data-driven organizations.
The early wave of outsourcing led to a significant body of knowledge on how to evaluate outsourcing providers or vendors. Google has about 1500 pages related to these two common phrases. However, there is barely any information on how to evaluate web scraping or data scraping vendors or service providers
Web scraping service providers or vendors are somewhat similar to outsourcing providers in a general sense just like web scraping software is similar to software. As you can see that comparison while valid, does not really provide a good blueprint or checklist for evaluation.
We will help create a RFP (Request for Proposal) or RFI (Request for Information) template for web scraping services through the rest of this article.
Key criterias for evaluation
Scraping uses technology, hence it is critical to evaluate the technology.
Questions to ask
- Can the vendor even disclose the technology they use?
- Does the vendor copy and paste data?
- Is the vendor using a few scripts running off their laptop or a single server on the cloud?
- Is the solution scalable – can the vendor scrape 10 sites or 10,000 sites or 100,000,000 sites?
- How fast can the vendor scrape? 1 page per second or 5000 pages per second?
- How much does it cost as the scale goes up – does the cost per page go up proportionally or does it go down?
- Can the technology handle blocks such as captcha?
- Does the vendor use a framework like Scrapy and are they completely dependent on the mercy of a larger scraping service providers such Scrapy Cloud.
- How does the vendor mitigate the risk associated with using a single scraping provider’s infrastructure?
- Does the vendor use the public/hybrid or private cloud or do they not use cloud computing at all?
- Is there a single point of failure in the vendor’s or their third party provider’s infrastructure? If so how many and how critical are these points of failure?
- What kind of monitoring do they use for their infrastructure?
- How good is their Information Security process and controls?
- Can the vendor create a “private instance” for you in case you have stringent security or confidentiality needs for the service?
- What kind of technology does the vendor use to manage your projects? Is it just emails or a much more structured approach?
Data Quality and QA process
Data quality is critical to any kind of data or web scraping service. If the quality is low and the data unreliable, it is worthless.
Questions to ask
- What is the vendor’s overall approach to data quality?
- Do they have automated data quality checks?
- How many checks do they perform on an average on any given data set?
- How often do they create new checks?
- How often do they revisit and update these checks to ensure they are performing and adapting to changes?
- Do they also provide some level of human quality check (granted that terabytes or even megabytes of data cannot be realistically checked by humans) and if so what kind of checks are performed and when?
- Do they have a dedicated QA team?
- How big is the QA team?
- Is this team independent of the development team?
- What kind of technology, techniques, and algorithms are used in the QA process?
Eventually, the organization, people, location of the vendor is a critical component that ties in the rest of the evaluation
Questions to ask
- Where is the vendor located?
- How highly does the vendor value your privacy?
- Do they disclose customer names publicly or on phone calls and emails.
- Do you have legal recourse against the vendor for non-performance based on the location?
- Where are their locations and their operational timezones globally?
- Is the team comprised of full-time employees or subcontractors?
- If they use subcontractors, are they hired over the Internet or do they physically operate from the vendor’s location?
- What kind of background checks do they conduct on their team?
- What are the educational qualifications of the team?
- Are the teams organized into project-based teams with adequate managerial oversight?
- How good are their communication skills – verbal and written?
- Do individual teams believe in high level of quality and how important is that to each team member?
- Do they provide dedicated or shared account management who will interface with you?
- How good and quick is their responsiveness to your questions, emails or calls?
- Do they provide a phone-based interface to the team or is it just email or skype?
- How much communication is “lost in translation”?
- How good is their understanding of the “context” of the data that they gather?
- Do they have subject matter experts in your industry, that know something about the data being gathered?
The questions above should provide a good starting point to help you evaluate the vendor. It is important to not just copy paste these questions, but instead to “make them your own” based on your needs and criteria.
e.g. Information Security may not be applicable to most organizations because the data gathered is public data.