This is an open thread and the goal is to solicit comments on what the best web scraping service may look like. Please go ahead a type away and write down the ideas or requirements…
A web scraping provider is an essential partner to most organizations, especially the highly data-driven organizations. Web scraping service providers or vendors are somewhat similar to outsourcing providers in a general sense just like web scraping software is similar to software. However, there are many areas specific to a web scraping service provider which are outlined in this article.
We will help create a RFP (Request for Proposal) or RFI (Request for Information) template for web scraping services through the rest of this article or you can use this list to evaluate RFP responses.
Scraping uses technology, hence it is critical to evaluate the technology.
Questions to ask
- Can the vendor even disclose the technology they use?
- Does the vendor copy and paste data?
- Is the vendor using a few scripts running off their laptop or a single server on the cloud?
- Is the solution scalable – can the vendor scrape 10 sites or 10,000 sites or 100,000,000 sites?
- How fast can the vendor scrape? 1 page per second or 5000 pages per second?
- How much does it cost as the scale goes up – does the cost per page go up proportionally or does it go down?
- Can the technology handle blocks such as captcha?
- Do they use headless browsers or scripts?
- Does the vendor use a framework like Scrapy and are they completely dependent on the mercy of a larger scraping service providers such Scrapy Cloud?
- Do they use certain 3rd party frameworks or packages and are those maintained or obsolete?
- What kind of proxy providers do they use or do they use their own networking?
- Do they have access to residential IP addresses?
- Can they gather data from various global locations for geo location specific data gathering?
- How does the vendor mitigate the risk associated with using a single scraping provider’s infrastructure?
- Does the vendor use the public/hybrid or private cloud or do they not use cloud computing at all?
- Is there a single point of failure in the vendor’s or their third party provider’s infrastructure? If so how many and how critical are these points of failure?
- What kind of monitoring do they use for their infrastructure?
- How good is their Information Security process and controls?
- How do they handle Compliance?
- Are their data centers certified to a standard – ISO 2700x, SOC/SSAE 16 etc?
- Can the vendor create a “private instance” for you in case you have stringent security or confidentiality needs for the service?
- What kind of technology does the vendor use to manage your projects? Is it just emails or a much more structured approach?
Data Quality and QA process
Data quality is critical to any kind of data or web scraping service. If the quality is low and the data unreliable, it is worthless.
Questions to ask
- What is the vendor’s overall approach to data quality?
- Do they have automated data quality checks?
- How many checks do they perform on an average on any given data set?
- How often do they create new checks?
- How often do they revisit and update these checks to ensure they are performing and adapting to changes?
- Do they also provide some level of human quality check (granted that terabytes or even megabytes of data cannot be realistically checked by humans) and if so what kind of checks are performed and when?
- Do they have a dedicated QA team?
- How big is the QA team?
- Is this team independent of the development team?
- What kind of technology, techniques, and algorithms are used in the QA process?
Eventually, the organization, the people, the processes and the location of the vendor is a critical component that ties in the rest of the evaluation
Questions to ask
- Where is the vendor located?
- How highly does the vendor value your privacy?
- Do they disclose customer names publicly or on phone calls and emails.
- Do you have legal recourse against the vendor for non-performance based on the location?
- Where are their locations and their operational timezones globally?
- Is the team comprised of full-time employees or subcontractors?
- If they use subcontractors, are they hired over the Internet or do they physically operate from the vendor’s location?
- What kind of background checks do they conduct on their team?
- What are the educational qualifications of the team?
- Are the teams organized into project-based teams with adequate managerial oversight?
- How good are their communication skills – verbal and written?
- Do individual teams believe in high level of quality and how important is that to each team member?
- Do they provide dedicated or shared account management who will interface with you?
- How good and quick is their responsiveness to your questions, emails or calls?
- Do they provide a phone-based interface to the team or is it just email or skype?
- How much communication is “lost in translation”?
- How good is their understanding of the “context” of the data that they gather?
- Do they have subject matter experts in your industry, that know something about the data being gathered?
- Do they have sufficient legal knowledge and checks in place to perform the data gathering without running into legal issues?
The questions above should provide a good starting point to help you evaluate the vendor. It is important to not just copy paste these questions, but instead to “make them your own” based on your needs and criteria.
e.g. Information Security, Compliance, Certifications etc. may not be applicable to most organizations because the data gathered is public data.
This article was initially published in 2017 and updated in 2022 to include recent developments in this space