How to scrape Tripadvisor Hotel Details using Python and LXML

This tutorial is a follow-up of How to scrape TripAdvisor.com for Hotels in a City using Python. In this tutorial, we will scrape hotel data from a Hotel URL. Let’s build a simple python script to download a hotel detail page from Tripadvisor.com, and extract details from it.

We’ll extract these data from a hotel’s page:

  1. Name
  2. Address
  3. Rank
  4. Description
  5. Rating
  6. Rating Summary
  7. Total Number of Reviews
  8. Highlights
  9. Amenities
  10. Additional Info

 

tripadvisor-amenities-hotel-description

 

There is a lot more information that would be interesting to extract from the hotel detail page of TripAdvisor. For the sake of simplicity, we’ll just stick to the ones above.

Scraping Logic

  1. Download HTML of the hotel detail page using Python Requests – Quite easy, once you have the URL. We use python requests to download the entire HTML of this page.
  2. Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
  3. Save the data as JSON to a file.

You could connect this scraper to the previous scraper built on How to scrape TripAdvisor.com for Hotels in a City using Python. We’ll leave that to you to figure out.

Requirements

For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements.

Install Python 3 and Pip

Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/

Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/

Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/

Install Packages

The Code

The code is self-explanatory as before.

https://gist.github.com/scrapehero/1b26ad7fd8db1023defa1f4afd49bdbb

If you can’t see the embed above Or if you want to download the code, here is the link to it – https://gist.github.com/scrapehero/1b26ad7fd8db1023defa1f4afd49bdbb

If you would like the code in Python 2.7, check this link https://gist.github.com/scrapehero/f8f7241e32ac0f21f97db3c2f8ecf576

Running the Scraper

Assuming you named your scraper tripadvisor_scraper_hotel.py. If you type in python <space> scriptnamea command prompt or terminal with an -h.

python tripadvisor_scraper_hotel.py -h
usage: tripadvisor_scraper_hotel.py [-h] url

positional arguments:
  url         Tripadvisor hotel url

optional arguments:
  -h, --help  show this help message and exit

For example – Hotel – Langham Place, New York, Fifth Avenue whose URL is https://www.tripadvisor.com/Hotel_Review-g60763-d1776857-Reviews-Langham_Place_New_York_Fifth_Avenue-New_York_City_New_York.html

python tripadvisor_scraper_hotel.py https://www.tripadvisor.com/Hotel_Review-g60763-d1776857-Reviews-Langham_Place_New_York_Fifth_Avenue-New_York_City_New_York.html

The script will automatically create a file called tripadvisor_hotel_scraped_data.json with the scraped data from TripAdvisor.

Which would look similar to

{
    "hotel_url": "https://www.tripadvisor.com/Hotel_Review-g60763-d1776857-Reviews-The_Langham_New_York_Fifth_Avenue-New_York_City_New_York.html",
    "highlights": "Free Wifi,Parking,Breakfast Buffet,Air Conditioning,Non-Smoking Hotel,Restaurant,Airport Transportation",
    "additional_info": {
        "Awards & Recognition": "Certificate of ExcellenceWhat is Certificate of Excellence? TripAdvisor gives a Certificate of Excellence to accommodations, attractions and restaurants that consistently earn great reviews from travellers. Certificate of ExcellenceWhat is Certificate of Excellence? TripAdvisor gives a Certificate of Excellence to accommodations, attractions and restaurants that consistently earn great reviews from travellers.",
        "Hotel Style": "Spa, Family, Business, Romantic, Luxury",
        "Room types": "Suites, Non-Smoking Rooms",
        "Number of rooms": "234",
        "Price range": "C$571 - C$1,555 (Based on Average Rates for a Standard Room)",
        "Formerly known as": "The Setai Fifth Avenue",
        "Location": "United States > New York > New York City > Midtown / Garment District / Midtown West / Midtown / Tenderloin / Manhattan"
    },
    "name": "The Langham New York Fifth Avenue",
    "rating": 4.5,
    "review_count": 2574,
    "ratings": {
        "Poor": 45,
        "Terrible": 30,
        "Excellent": 2053,
        "Good": 341,
        "Average": 105
    },
    "rank": null,
    "address": {
        "region": "New York",
        "country": "United States",
        "zipcode": "10018-2753",
        "street_address": "400 Fifth Avenue",
        "locality": "New York City"
    },
    "amenities": {
        "Hotel Amenities": "Restaurant,Fitness Centre with Gym / Workout Room,Room Service,Free High Speed Internet (WiFi),Spa,Air Conditioning,Airport Transportation,Babysitting,Banquet Room,Breakfast Available,Business Centre with Internet Access,Children Activities (Kid / Family Friendly),Concierge,Conference Facilities,Dry Cleaning,Laundry Service,Meeting Rooms,Minibar,Multilingual Staff,Non-Smoking Hotel,Pets Allowed ( Dog / Pet Friendly ),Refrigerator in room,Wheelchair Access"
    },
    "official_description": "Gracefully complementing Manhattan's luxurious Fifth Avenue and boldly accenting New York City's famous skyline, The Langham, New York, Fifth Avenue is a renewed celebration of sophistication and style. Soaring more than sixty stories above the city, The Langham, Fifth Avenue represents the new heart of the world's most inspired metropolis. Accommodations exude the very essence of luxury and elegance. Duxiana beds, Pratesi linens, and floor-to-ceiling rain showers leave guests breathing easy, if not breathless."
}

That’s it.

You can extend this further by saving it to a database like MongoDB or MySQL ( might need some flattening of the JSON)

Known Limitations

This code should work for grabbing basic details and scrape hotel data URLs. However, if you want to scrape for thousands of pages there are some important things you should be aware of, and you can read about them at Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping.

If you need some professional help with scraping complex websites like Tripadvisor let us know by filling up the form below.

Tell us about your complex web scraping projects

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Responses

Walker Aguilar April 5, 2018

Hi, thanks for the gest post. Is posible get the hotel image please.


    ScrapeHero April 5, 2018

    Sure Walker – if you follow the tutorial and identify the path for the image, you can definitely modify the code easily.


John Wu September 18, 2018

Is it possible to put in a list of URLs of restaurants/hotels so that it scrapes data for multiple hotels?


    ScrapeHero September 18, 2018

    Sure John – it is code so anything is possible. You will need to learn some python to make that happen.
    You can add the list to a file – read the file in a loop and call the scraping code to get the data.


Houassa November 2, 2018

Hello, thank you for your presentation. How can we transform this data into csv? thank you!


HOUASSA November 2, 2018

Hello, thank you for your presentation. How can we turn this data into csv?


Comments are closed.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?