extract-hotel-data-tripadvisor-scrapehero

How to scrape Tripadvisor.com Hotel Details using Python and LXML

This tutorial is a follow-up of How to scrape TripAdvisor.com for Hotels in a City using Python. In this tutorial, we’ll look at scraping hotel details from a Hotel URL. Let’s build a simple python script to download a hotel detail page from Tripadvisor.com, and extract details from it.

We’ll extract these data from a hotel’s page:

  1. Name
  2. Address
  3. Rank
  4. Description
  5. Rating
  6. Rating Summary
  7. Total Number of Reviews
  8. Highlights
  9. Amenities
  10. Additional Info

tripadvisor-scraping-specs1 Tripadvisor-scraping-spec-2 tripadvisor-scraping-specs-3

There is a lot more information that would be interesting to extract from the hotel detail page of TripAdvisor. For the sake of simplicity, we’ll just stick to the ones above.

Scraping Logic

  1. Download HTML of the hotel detail page using Python Requests – Quite easy, once you have the URL. We use python requests to download the entire HTML of this page.
  2. Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
  3. Save the data as JSON to a file.

You could connect this scraper to the previous scraper built on How to scrape TripAdvisor.com for Hotels in a City using Python. We’ll leave that to you to figure out.

Requirements

The requirements are pretty much the same as before, as we won’t be using any other complex tools here.

  • Python 2.7 ( https://www.python.org/downloads/ )
  • PIP to install the  following packages in Python ( https://pip.pypa.io/en/stable/installing/)
  • Python Requests, to make requests and download the HTML content of the pages ( http://docs.python-requests.org/en/master/user/install/).
  • Python LXML, for parsing the HTML Tree Structure using Xpaths ( Learn how to install that here – http://lxml.de/installation.html )

The Code

The code is self-explanatory as before.

If you can’t see the embed above Or if you want to download the code, here is the link to it on GIST. https://gist.github.com/scrapehero/ff1ddffb48c3bee89f5c7da4cf0c8786

Running the Scraper

Assuming you named your scraper tripadvisor_scraper_hotel.pyIf you type in  python <space> scriptname in a command prompt or terminal with an -h.

python tripadvisor_scraper_hotel.py -h
usage: tripadvisor_scraper_hotel.py [-h] url

positional arguments:
  url         Tripadvisor hotel url

optional arguments:
  -h, --help  show this help message and exit

For example – Hotel – Langham Place, New York, Fifth Avenue whose URL is https://www.tripadvisor.com/Hotel_Review-g60763-d1776857-Reviews-Langham_Place_New_York_Fifth_Avenue-New_York_City_New_York.html

python tripadvisor_scraper_hotel.py https://www.tripadvisor.com/Hotel_Review-g60763-d1776857-Reviews-Langham_Place_New_York_Fifth_Avenue-New_York_City_New_York.html

The script will automatically create a file called tripadvisor_hotel_scraped_data.json with the scraped data from TripAdvisor.

Which would look similar to

{
    "ratings": {
        "Excellent": "1,337", 
        "Very good": "191", 
        "Average": "66", 
        "Poor": "32", 
        "Terrible": "18"
    }, 
    "highlights": "Free High Speed Internet ( WiFi ),Restaurant,Fitness Center with Gym / Workout Room,Room Service,Spa", 
    "amenities": {
        "About the property": "Wheelchair access ,Non-Smoking Hotel ,Pets Allowed ( Dog / Pet Friendly ) ", 
        "Things to do": "Restaurant ,Fitness Center with Gym / Workout Room ,Spa ", 
        "Room types": "Suites ,Non-Smoking Rooms ,Family Rooms ", 
        "In your room": "Air Conditioning ,Minibar ", 
        "Internet": "Free Internet ,Free High Speed Internet ( WiFi ) ,Paid Wifi ", 
        "Services": "Room Service ,Children Activities (Kid / Family Friendly) ,Airport Transportation ,Dry Cleaning ,Meeting Rooms ,Business Center with Internet Access ,Laundry Service ,Concierge ,Multilingual Staff ,Conference Facilities ,Breakfast Available ,Banquet Room ,Babysitting "
    }, 
    "name": "Langham Place, New York, Fifth Avenue", 
    "address": {
        "country": "United States", 
        "zipcode": "10018-2753", 
        "street_address": "400 Fifth Avenue", 
        "locality": "New York City"
    }, 
    "official_description": "Gracefully complementing Manhattan's luxurious Fifth Avenue and boldly accenting New York City's famous skyline, Langham Place, New York, Fifth Avenue is a renewed celebration of sophistication and style. Soaring more than sixty stories above the city, Langham Place, Fifth Avenue represents the new heart of the world's most inspired metropolis. Accommodations exude the very essence of luxury and elegance. Duxiana beds, Pratesi linens, and floor-to-ceiling rain showers leave guests breathing easy, if not breathless.", 
    "review_count": "1954", 
    "rating": "4.5", 
    "additional_info": "Address: 400 Fifth Avenue, New York City, NY 10018-2753 (Formerly The Setai Fifth Avenue) Location: United States > New York > New York City > Midtown , Garment District , Manhattan Price Range: $485 - $1,212 (Based on Average Rates for a Standard Room) Hotel Class:5 star \u2014 Langham Place, New York, Fifth Avenue 5* Number of rooms: 214 Reservation Options: TripAdvisor is proud to partner with Hotels.com, Booking.com, DerbySoft Ltd Shanghai HQ Supplier Direct, Prestigia, Expedia and Agoda so you can book your Langham Place, New York, Fifth Avenue reservations with confidence. We help millions of travelers each month to find the perfect hotel for both vacation and business trips, always with the best discounts and special offers. Hotel Style: #3 Spa Hotel in New York City #8 Family Hotel in New York City #9 Business Hotel in New York City #14 Romantic Hotel in New York City #16 Luxury Hotel in New York City", 
    "rank": "#19 of 464 Hotels in New York City"
}

That’s it.

You can extend this further by saving it to a database like MongoDB or MySQL ( might need some flattening of the JSON)

Known Limitations

This code should work for grabbing basic details from most hotel URLs. However, if you want to scrape for thousands of pages there are some important things you should be aware of, and you can read about them at Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping.

If you need some professional help with scraping complex websites like Tripadvisor let us know by filling up the form below.

Tell us about your complex web scraping projects


Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Join the conversation