How to scrape Tripadvisor.com Hotel Details using Python and LXML

This tutorial is a follow-up of How to scrape TripAdvisor.com for Hotels in a City using Python. In this tutorial, we’ll look at scraping hotel details from a Hotel URL. Let’s build a simple python script to download a hotel detail page from Tripadvisor.com, and extract details from it.

We’ll extract these data from a hotel’s page:

  1. Name
  2. Address
  3. Rank
  4. Description
  5. Rating
  6. Rating Summary
  7. Total Number of Reviews
  8. Highlights
  9. Amenities
  10. Additional Info

 

There is a lot more information that would be interesting to extract from the hotel detail page of TripAdvisor. For the sake of simplicity, we’ll just stick to the ones above.

Scraping Logic

  1. Download HTML of the hotel detail page using Python Requests – Quite easy, once you have the URL. We use python requests to download the entire HTML of this page.
  2. Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
  3. Save the data as JSON to a file.

You could connect this scraper to the previous scraper built on How to scrape TripAdvisor.com for Hotels in a City using Python. We’ll leave that to you to figure out.

Requirements

For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements.

Install Python 3 and Pip

Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/

Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/

Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/

Install Packages

The Code

The code is self-explanatory as before.

If you can’t see the embed above Or if you want to download the code, here is the link to it – https://gist.github.com/scrapehero/1b26ad7fd8db1023defa1f4afd49bdbb

If you would like the code in Python 2.7, check this link https://gist.github.com/scrapehero/f8f7241e32ac0f21f97db3c2f8ecf576

Running the Scraper

Assuming you named your scraper tripadvisor_scraper_hotel.pyIf you type in python <space> scriptnamea command prompt or terminal with an -h.

For example – Hotel – Langham Place, New York, Fifth Avenue whose URL is https://www.tripadvisor.com/Hotel_Review-g60763-d1776857-Reviews-Langham_Place_New_York_Fifth_Avenue-New_York_City_New_York.html

The script will automatically create a file called tripadvisor_hotel_scraped_data.json with the scraped data from TripAdvisor.

Which would look similar to

That’s it.

You can extend this further by saving it to a database like MongoDB or MySQL ( might need some flattening of the JSON)

Known Limitations

This code should work for grabbing basic details from most hotel URLs. However, if you want to scrape for thousands of pages there are some important things you should be aware of, and you can read about them at Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping.

If you need some professional help with scraping complex websites like Tripadvisor let us know by filling up the form below.

Tell us about your complex web scraping projects

Turn websites into meaningful and structured data through our web data extraction service

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

2 comments on “How to scrape Tripadvisor.com Hotel Details using Python and LXML

Walker Aguilar

Hi, thanks for the gest post. Is posible get the hotel image please.

    ScrapeHero

    Sure Walker – if you follow the tutorial and identify the path for the image, you can definitely modify the code easily.

Join the conversation


Turn websites into meaningful and structured data through our web data extraction service