Part 2 of our Tripadvisor Scraper - Learn how to extract hotel details such as hotel name, address, ranking and more from Tripadvisor using Python and LXML.
Tripadvisor.com has tons of information regarding hotels from all over the world, which can be used for monitoring prices of hotels in a locality, competitive pricing, analyzing how the price changes with each season, understand ratings of hotels in a city and lot more. We are dividing the tutorial into two parts to scrape Tripadvisor
- Scrape Hotel List for City
- Scrape Hotels Details from a Hotel URL
In this part of the tutorial to scrape Tripadvisor, we’ll search Tripadvisor.com for hotels in a City, for specific check-in and check-out dates and extract the following data from the first page of search results
- Hotel Name
- Hotel Detail Page URL
- Number of Reviews
- Tripadvisor Rating
- Hotel Features
- Booking Provider
- Price per Night
- Number of Deals available
The annotated screenshot below shows the data extracted using Tripadvisor scraper.
Read More – Scrape Booking.com for hotel data
- Construct the search results page URL from TripAdvisor – Tripadvisor has complex URL for the search results page of each locality. For example here is the one for Boston
https://www.tripadvisor.com/Hotels-g60745-Boston_Massachusetts-Hotels.html. We’ll have to construct this URL manually to scrape results from that page. We do that by getting this URL from Tripadvisor autocomplete API.
- Download HTML of the search result page using Python Requests – Quite easy, once you have the URL. We use python requests to download the entire HTML of this page.
- Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
- Save the data to a CSV file
Since this is web scraping tutorial using Python, you’ll obviously need Python. We use some python packages for downloading and parsing the HTML. Below are the requirements
- Python 2.7 ( https://www.python.org/downloads/ )
- PIP to install the following packages in Python ( https://pip.pypa.io/en/stable/installing/)
- Python Requests, to make requests and download the HTML content of the pages ( http://docs.python-requests.org/en/master/user/install/).
- Python LXML, for parsing the HTML Tree Structure using Xpaths ( Learn how to install that here – http://lxml.de/installation.html )
The code is self-explanatory. We’ve added positional arguments in the command line scripts to specify check-in check-out date, locality and sort order for the results.
You can download the Tripadvisor code from here https://gist.github.com/scrapehero/1c425fdf290144cd4c7c635587feb459 if the embed above doesn’t work.
Read More – Analyzing the Restaurants in top 10 US cities
Running the Scraper
Assuming you named the script tripadvisor_scraper.py. If you type in the script name in a command prompt or terminal with an -h
python tripadvisor_scraper.py -h usage: tripadvisor_scraper.py [-h] checkin_date checkout_date sort locality positional arguments: checkin_date Hotel Check In Date (Format: YYYY/MM/DD checkout_date Hotel Chek Out Date (Format: YYYY/MM/DD) sort available sort orders are : priceLow - hotels with lowest price, distLow : Hotels located near to the search center, recommended: highest rated hotels based on traveler reviews, popularity :Most popular hotels as chosen by Tipadvisor users locality Search Locality optional arguments: -h, --help show this help message and exit
Run it using python with arguments for the locality, sort order, check-in and check-out dates in YYYY/MM/DD format.
For an example – finding hotels in Boston for 2017-Jan-01 to 2017-Jan-02, sorted by popularity.
python tripadvisor_scraper.py "2017/01/01" "2017/01/02" "popularity" "boston"
This will create a CSV file called tripadvisor_data.csv in the same folder as the script.
Here is some sample data extracted from TripAdvisor for the command above.
You can download the code at https://gist.github.com/scrapehero/1c425fdf290144cd4c7c635587feb459
Let us know in comments how this scraper worked for you.
Part 2 can be found here How to scrape Tripadvisor.com Hotel Details using Python and LXML
This code should work for grabbing the first page of results for most cities. However, if you want to scrape for thousands of pages there are some important things you should be aware of and you can read about them at Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping .
If you need some professional help with scraping complex websites like TripAdvisor let us know by filling up the form below.
Tell us about your complex web scraping projects
Turn the Internet into meaningful, structured and usable data
Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.