scraping-business-details-from-yelp-using-python-and-lxml

How to scrape Yelp.com Business Details using Python and LXML

This tutorial is a follow-up of How to scrape Yelp.com for Business Listings using Python. In this tutorial, we will show you how to extract data from the detail page of a business in Yelp.com. You can use URLs of businesses you are interested in OR the ones you got from part one of this tutorial. Lets create a python script and download a restaurant page from Yelp.com, and extract details from it.

Here is the data that we are going to extract from the restaurant page:

  1. Website URL
  2. Ranking
  3. Working hours
  4. Category
  5. Phone Number
  6. Address
  7. Price Range
  8. Health Rating
  9. Claimed Status
  10. Ratings
  11. Additional Info

Below is a screenshot of the data that we will be extracting

You can scrape a lot more information from the business detail URL as you wish, but we’ll stick to these for now.

Scraping Logic

  1. Download HTML of the hotel detail page using Python Requests – Quite easy, once you have the URL. We use python requests to download the entire HTML of this page.
  2. Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
  3. Save the data as JSON to a file. You might wonder why we’re using JSON here when we used CSV in the previous post. The data we scraped in part one of this tutorial has only rows and columns and fits well in a CSV format. This one has many more details and is quite hard to fit into a CSV(unless you want to look at a CSV which has more than 20 rows). You can read more about choosing a data format for your project, if you are new to this.

You could connect this scraper to the previous scraper built on How to scrape business listings from Yelp.com using Python, and have an automated workflow that sends you emails or writes data to a database instead of a JSON file. We are not going to do that here, as its beyond the scope of this simple tutorial.

Requirements

The requirements are pretty much the same as before, as we won’t be using any other complex tools here.

  • Python 2.7 ( https://www.python.org/downloads/ )
  • PIP to install the  following packages in Python ( https://pip.pypa.io/en/stable/installing/)
  • Python Requests, to make requests and download the HTML content of the pages ( http://docs.python-requests.org/en/master/user/install/).
  • Python LXML, for parsing the HTML Tree Structure using Xpaths ( Learn how to install that here – http://lxml.de/installation.html )

The Code

 

If you can’t see the embed above or if you would like to download the code, here is the link to it on GIST- https://gist.github.com/scrapehero/d8cf3d8b7039b8ba3dcde9b607cdc7de

Running the Scraper

Assuming you named your scraper yelp_business_details.py, if you type in python <space> scriptname in a command prompt or terminal with an -h.

python yelp_business_details.py -h
usage: yelp_business_details.py [-h] url

positional arguments:
url         yelp_business_details.py url

optional arguments:
-h, --help show this help message and exit

For example the Restaurant- The Bird, Washington DC whose URL is https://www.yelp.com/biz/the-bird-washington?osq=Restaurants

python yelp_business_details.py https://www.yelp.com/biz/the-bird-washington?osq=Restaurants

The script will automatically create a file called scraped_data-the-bird-washington?osq=Restaurants.json with the scraped data from Yelp.com

The output file would look similar to this

 
{
    "info": [
        {
            "Takes Reservations": "Yes"
        }, 
        {
            "Delivery": "No"
        }, 
        {
            "Take-out": "Yes"
        }, 
        {
            "Accepts Credit Cards": "Yes"
        }, 
        {
            "Accepts Android Pay": "No"
        }, 
        {
            "Good For": "Dinner"
        }, 
        {
            "Parking": "Street"
        }, 
        {
            "Bike Parking": "Yes"
        }, 
        {
            "Wheelchair Accessible": "Yes"
        }, 
        {
            "Good for Kids": "No"
        }, 
        {
            "Good for Groups": "Yes"
        }, 
        {
            "Attire": "Casual"
        }, 
        {
            "Ambience": "Trendy"
        }, 
        {
            "Noise Level": "Average"
        }, 
        {
            "Alcohol": "Full Bar"
        }, 
        {
            "Outdoor Seating": "Yes"
        }, 
        {
            "Wi-Fi": "Free"
        }, 
        {
            "Has TV": "Yes"
        }, 
        {
            "Waiter Service": "Yes"
        }, 
        {
            "Caters": "Yes"
        }
    ], 
    "ratings": "4.5", 
    "website": "http://www.thebirddc.com", 
    "working_hours": [
        {
            "Mon": "4:00 pm - 10:30 pmn        n                Closed now"
        }, 
        {
            "Tue": "4:00 pm - 10:30 pm"
        }, 
        {
            "Wed": "4:00 pm - 10:30 pm"
        }, 
        {
            "Thu": "4:00 pm - 10:30 pm"
        }, 
        {
            "Fri": "4:00 pm - 11:30 pm"
        }, 
        {
            "Sat": "10:00 am - 11:30 pm"
        }, 
        {
            "Sun": "10:00 am - 10:30 pm"
        }
    ], 
    "name": "The Bird", 
    "claimed_status": "Claimed", 
    "url": "https://www.yelp.com/biz/the-bird-washington?osq=Restaurants", 
    "longitude": "-77.026685", 
    "reviews": "84 reviews", 
    "phone": "(202) 518-3609", 
    "address": "1337 11th St NW Washington, DC 20001 b/t N O St & N N St Shaw", 
    "latitude": "38.908420", 
    "ratings_histogram": [
        {
            "5 stars": "54"
        }, 
        {
            "4 stars": "22"
        }, 
        {
            "3 stars": "3"
        }, 
        {
            "2 stars": "3"
        }, 
        {
            "1 star": "2"
        }
    ], 
    "price_range": "$11-30", 
    "health_rating": "", 
    "category": "American (New),Breakfast & Brunch"
}

You can extend this further to a database like MongoDB or MySQL.

Known Limitations

This code should work for grabbing basic details from most business URLs. However, if you want to scrape for thousands of pages there are some important things you should be aware of, and you can read about them at Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping.

If you are looking for some professional help with scraping complex websites, let us know by filling up the form below.

Tell us about your complex web scraping projects


 

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Join the conversation