The best data and file formats for scraped data

The data we provide comes in various forms from the source and is largely text (barring rich media such as images and videos or proprietary file formats such as PDFs).

Our customers need this data in various formats and the key to a successful and scalable solution that works best for our customers and us is to define the format and use standard data sharing formats.

Common Data Formats

CSV: The most common format is a Comma Separated Value (CSV) format – most people know how it works and it is easily viewable in various products including and especially Microsoft Excel.

JSON: (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate according to json.org

XML: EXtensible Markup Language is another flexible format that can be used to define and transfer data between computers

SQL: Structured Query Language, isn’t really a data format and is very specific to a particular database and database schema or structure.

What is a good format?

The most universal and flexible format that works in our business as a Data as a Service provider is JSON even though CSV may be universally more acceptable.

Why not CSV?

CSV works well for data that is structured in 2 dimensions (rows and columns), but a lot of data that we encounter is in multiple dimensions and doesn’t lend itself well to a 2 dimensional spreadsheet format. If the data is 2 dimensional, we encourage the CSV format because most databases can easily import this data. However, when the data is multi-dimensional and if it is semi-structured (i.e. some items have some data and others have some other data).

Let’s say a merchant’s data has products they sell associated with it and one merchant has 1 product and another has 10 products, it is hard to fit this data into a CSV format especially if you don’t know how many products the largest merchant could have.

Do you create a column for each product? How many columns do you create? 10, 100, 100000.. – that is the problem with using the CSV format for such data.

Why not SQL?

SQL isn’t really a data format as it is a language to work with databases. While SQL can be used to import data into Relational Databases, the format is completely dependent upon the Schema used by the destination. The name of the table, the names of the fields and data types of the fields are all specific to a particular instance of the database. Hence there is no universal formats that fits all like JSON.

We can provide SQL based on a particular schema for an additional cost, but it also requires maintenance in case the schema changes. As a result, we discourage the use of SQL as a data format.

How do I work with JSON?

JSON is a very flexible format which doesn’t add to the size of the data as much as XML. It is easy to read and use.

It can handle multi-dimensional and semi-structured data with ease.

JSON is also the de-facto format for handling data in APIs. Inputs to APIs are best provided in JSON and the data returned is also handled well in the JSON format.

Most databases and languages have support for or have readily available libraries for importing and exporting JSON. A quick Google search of JSON + <your favorite database name> will ease the fear of people who are used to CSV format.

Default data formats provided by ScrapeHero

We provide CSV and JSON formats as default data formats that are included in our pricing because they can be used by anyone. Any other formats require a lot of iterations and have dependencies and as a result we usually charge extra for those formats.

We can also provide XML data on request and for an extra charge.

JSON Sample

Here is how JSON format looks like – it is the best format for scraped data that can handle multiple dimensions

 

{
     "firstName": "John",
     "lastName": "Smith",
     "age": 34,
     "address":
     {
         "streetAddress": "45 5th Avenue",
         "city": "New York",
         "state": "NY",
         "postalCode": "10021"
     },
     "phoneNumber":
     [
         {
           "type": "home",
           "number": "212 555-1212"
         },
         {
           "type": "fax",
           "number": "646 555-4567"
         }
     ]
 }

 

What about Excel – XLS or XLSX files

Excel files are not only data files but also contain a lot of extra information such as formatting (highlights, colors etc), graphs, charts, formulae, pivot tables, embedded pictures, references to other sheets etc.

It is binary format specific to Microsoft.

The CSV files we provide can be instantly opened by Microsoft Excel so there is no compelling reason for us to provide Excel files. You can open the CSV file in Excel by double clicking it and then save it as Excel with all the formatting you desire.

Comments or Questions?

Turn the Internet into meaningful, structured and usable data   

Enjoying our Tutorials?

Subscribe to our weekly updates on the latest tutorials in Web Scraping and Data Extraction

ScrapeHero Logo

Can we help you get some data?