The best data and file formats for scraped data

The data we provide comes in various forms from the source and is largely text (barring rich media such as images and videos or proprietary file formats such as PDFs). Our customers need this data in various formats and the key to a successful and scalable solution that fits the best data formats for web scraping and our customers is to define the format and use standard data sharing formats.

Common Data Formats for Web Scraping

CSV: The most common format is a Comma Separated Value (CSV) format – most people know how it works and it is easily viewable in various products including and especially Microsoft Excel.

JSON: (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate according to json.org

XML: EXtensible Markup Language is another flexible format that can be used to define and transfer data between computers

SQL: Structured Query Language, isn’t really a data format and is very specific to a particular database and database schema or structure.

What is a good format?

The most universal and flexible format that works in our business as a Data as a Service provider is JSON even though CSV may be universally more acceptable.

Why not CSV?

CSV works well for data that is structured in 2 dimensions (rows and columns), but a lot of data that we encounter is in multiple dimensions and doesn’t lend itself well to a 2 dimensional spreadsheet format. If the data is 2 dimensional, we encourage the CSV format because most databases can easily import this data. However, when the data is multi-dimensional and if it is semi-structured (i.e. some items have some data and others have some other data).

Let’s say a merchant’s data has products they sell associated with it and one merchant has 1 product and another has 10 products, it is hard to fit this data into a CSV format especially if you don’t know how many products the largest merchant could have.

Do you create a column for each product? How many columns do you create? 10, 100, 100000.. – that is the problem with using the CSV format for such data.

Another example is a data record for a person that has multiple emails or phone numbers, some may have one, some may have 5 or more of each.

CSV is not flexible to cater to variations in the number of columns for each row in the CSV.

Why not SQL?

SQL isn’t really a data format. It is a language (Structured Query Language) to work with databases.

While SQL can be used to import data into Relational Databases, the format is completely dependent upon the Schema (Database and Table structure) used by the Database. The name of the table, the names of the fields and data types of the fields are all specific to a particular instance of the database. Hence there is no universal formats that fits all like JSON.

We can provide SQL based on a particular schema for an additional cost, but it also requires constant maintenance in case the schema changes.

As a result, we discourage the use of SQL as a data format.

How do I work with JSON?

JSON is a very flexible format that doesn’t add to the size of the data as much as XML. It is easy to read and use. It includes both the field names and the values (data) that go into the field.

It can handle multi-dimensional and semi-structured data with ease and you can add or remove any fields with ease.

JSON is also the de-facto format for handling data in APIs. Inputs to APIs are best provided in JSON and the data returned is also handled well in the JSON format.

Most databases and languages have support for or have readily available libraries for importing and exporting JSON. A quick Google search of JSON + <your favorite database name> will ease the fear of people who are used to CSV format.

Default data formats provided by ScrapeHero

We provide CSV and JSON formats as default data formats  for web scraping that are included in our pricing because they can be used by anyone. Any other formats require a lot of iterations and have dependencies and as a result, we usually charge extra for those formats.

We can also provide XML data on request and for an extra charge.

JSON Sample

Here is how JSON format looks like – it is the best format for scraped data that can handle multiple dimensions

 

{
     "firstName": "John",
     "lastName": "Smith",
     "age": 34,
     "address":
     {
         "streetAddress": "45 5th Avenue",
         "city": "New York",
         "state": "NY",
         "postalCode": "10021"
     },
     "phoneNumber":
     [
         {
           "type": "home",
           "number": "212 555-1212"
         },
         {
           "type": "fax",
           "number": "646 555-4567"
         }
     ]
 }

 

What about Excel – XLS or XLSX files

Excel files are not only data files but also contain a lot of extra information such as formatting (highlights, colors etc), graphs, charts, formulae, pivot tables, embedded pictures, references to other sheets etc.

It is binary format specific to Microsoft.

The CSV files we provide can be instantly opened by Microsoft Excel so there is no compelling reason for us to provide Excel files. You can open the CSV file in Excel by double clicking it and then save it as Excel with all the formatting you desire.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?