Step by step tutorial to scrape Tripadvisor reviews and hotel data - Name, Price Per Night, Deals Reviews, and Ratings using Python and LXML.
This tutorial is a follow-up of How to scrape TripAdvisor.com for Hotels in a City using Python. In this tutorial, we will scrape hotel data from a Hotel URL. Let’s build a simple python script to download a hotel detail page from Tripadvisor.com, and extract details from it.
We’ll extract these data from a hotel’s page:
- Name
- Address
- Rank
- Description
- Rating
- Rating Summary
- Total Number of Reviews
- Highlights
- Amenities
- Additional Info
There is a lot more information that would be interesting to extract from the hotel detail page of TripAdvisor. For the sake of simplicity, we’ll just stick to the ones above.
Read More – Scrape Booking.com for hotel data
Scraping Logic
- Download HTML of the hotel detail page using Python Requests – Quite easy, once you have the URL. We use python requests to download the entire HTML of this page.
- Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
- Save the data as JSON to a file.
You could connect this scraper to the previous scraper built on How to scrape TripAdvisor.com for Hotels in a City using Python. We’ll leave that to you to figure out.
Requirements
For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements.
Install Python 3 and Pip
Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/
Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/
Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/
Install Packages
- PIP to install the following packages in Python (https://pip.pypa.io/en/stable/installing/)
- Python Requests, to make requests and download the HTML content of the pages ( http://docs.python-requests.org/en/master/user/install/).
- Python LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here – http://lxml.de/installation.html)
The Code
The code is self-explanatory as before.
https://gist.github.com/scrapehero/1b26ad7fd8db1023defa1f4afd49bdbb
If you can’t see the embed above Or if you want to download the code, here is the link to it – https://gist.github.com/scrapehero/1b26ad7fd8db1023defa1f4afd49bdbb
If you would like the code in Python 2.7, check this link https://gist.github.com/scrapehero/f8f7241e32ac0f21f97db3c2f8ecf576
Read More – Analyzing the Restaurants in top 10 US cities
Running the Scraper
Assuming you named your scraper tripadvisor_scraper_hotel.py. If you type in python <space> scriptname
a command prompt or terminal with an -h.
python tripadvisor_scraper_hotel.py -h usage: tripadvisor_scraper_hotel.py [-h] url positional arguments: url Tripadvisor hotel url optional arguments: -h, --help show this help message and exit
For example – Hotel – Langham Place, New York, Fifth Avenue whose URL is https://www.tripadvisor.com/Hotel_Review-g60763-d1776857-Reviews-Langham_Place_New_York_Fifth_Avenue-New_York_City_New_York.html
python tripadvisor_scraper_hotel.py https://www.tripadvisor.com/Hotel_Review-g60763-d1776857-Reviews-Langham_Place_New_York_Fifth_Avenue-New_York_City_New_York.html
The script will automatically create a file called tripadvisor_hotel_scraped_data.json with the scraped data from TripAdvisor.
Which would look similar to
{ "hotel_url": "https://www.tripadvisor.com/Hotel_Review-g60763-d1776857-Reviews-The_Langham_New_York_Fifth_Avenue-New_York_City_New_York.html", "highlights": "Free Wifi,Parking,Breakfast Buffet,Air Conditioning,Non-Smoking Hotel,Restaurant,Airport Transportation", "additional_info": { "Awards & Recognition": "Certificate of ExcellenceWhat is Certificate of Excellence? TripAdvisor gives a Certificate of Excellence to accommodations, attractions and restaurants that consistently earn great reviews from travellers. Certificate of ExcellenceWhat is Certificate of Excellence? TripAdvisor gives a Certificate of Excellence to accommodations, attractions and restaurants that consistently earn great reviews from travellers.", "Hotel Style": "Spa, Family, Business, Romantic, Luxury", "Room types": "Suites, Non-Smoking Rooms", "Number of rooms": "234", "Price range": "C$571 - C$1,555 (Based on Average Rates for a Standard Room)", "Formerly known as": "The Setai Fifth Avenue", "Location": "United States > New York > New York City > Midtown / Garment District / Midtown West / Midtown / Tenderloin / Manhattan" }, "name": "The Langham New York Fifth Avenue", "rating": 4.5, "review_count": 2574, "ratings": { "Poor": 45, "Terrible": 30, "Excellent": 2053, "Good": 341, "Average": 105 }, "rank": null, "address": { "region": "New York", "country": "United States", "zipcode": "10018-2753", "street_address": "400 Fifth Avenue", "locality": "New York City" }, "amenities": { "Hotel Amenities": "Restaurant,Fitness Centre with Gym / Workout Room,Room Service,Free High Speed Internet (WiFi),Spa,Air Conditioning,Airport Transportation,Babysitting,Banquet Room,Breakfast Available,Business Centre with Internet Access,Children Activities (Kid / Family Friendly),Concierge,Conference Facilities,Dry Cleaning,Laundry Service,Meeting Rooms,Minibar,Multilingual Staff,Non-Smoking Hotel,Pets Allowed ( Dog / Pet Friendly ),Refrigerator in room,Wheelchair Access" }, "official_description": "Gracefully complementing Manhattan's luxurious Fifth Avenue and boldly accenting New York City's famous skyline, The Langham, New York, Fifth Avenue is a renewed celebration of sophistication and style. Soaring more than sixty stories above the city, The Langham, Fifth Avenue represents the new heart of the world's most inspired metropolis. Accommodations exude the very essence of luxury and elegance. Duxiana beds, Pratesi linens, and floor-to-ceiling rain showers leave guests breathing easy, if not breathless." }
That’s it.
You can extend this further by saving it to a database like MongoDB or MySQL ( might need some flattening of the JSON)
Known Limitations
This code should work for grabbing basic details and scrape hotel data URLs. However, if you want to scrape for thousands of pages there are some important things you should be aware of, and you can read about them at Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping.
If you need some professional help with scraping complex websites like Tripadvisor let us know by filling up the form below.
Tell us about your complex web scraping projects
Turn the Internet into meaningful, structured and usable data
Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.
Responses
Hi, thanks for the gest post. Is posible get the hotel image please.
Sure Walker – if you follow the tutorial and identify the path for the image, you can definitely modify the code easily.
Is it possible to put in a list of URLs of restaurants/hotels so that it scrapes data for multiple hotels?
Sure John – it is code so anything is possible. You will need to learn some python to make that happen.
You can add the list to a file – read the file in a loop and call the scraping code to get the data.
Hello, thank you for your presentation. How can we transform this data into csv? thank you!
Hello, thank you for your presentation. How can we turn this data into csv?
Comments are closed.