How to scrape TripAdvisor for Hotel Data, Pricing and Reviews using Python

Tripadvisor.com has tons of information regarding hotels from all over the world, which can be used for monitoring prices of hotels in a locality, competitive pricing, analyzing how the price changes with each season, understand ratings of hotels in a city and lot more. We are dividing the tutorial into two parts to scrape Tripadvisor

  1. Scrape Hotel List for City
  2. Scrape Hotels Details from a Hotel URL

In this part of the tutorial to scrape Tripadvisor, we’ll search Tripadvisor.com for hotels in a City, for specific check-in and check-out dates and extract the following data from the first page of search results

  1. Hotel Name
  2. Hotel Detail Page URL
  3. Number of Reviews
  4. Tripadvisor Rating
  5. Hotel Features
  6. Booking Provider
  7. Price per Night
  8. Number of Deals available

The annotated screenshot below shows the data extracted using Tripadvisor scraper.

tripadvisor-scraper

Scraping Logic

  1. Construct the search results page URL from TripAdvisor – Tripadvisor has complex URL for the search results page of each locality. For example here is the one for Boston https://www.tripadvisor.com/Hotels-g60745-Boston_Massachusetts-Hotels.html. We’ll have to construct this URL manually to scrape results from that page. We do that by getting this URL from Tripadvisor autocomplete API.
  2. Download HTML of the search result page using Python Requests – Quite easy, once you have the URL. We use python requests to download the entire HTML of this page.
  3. Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
  4. Save the data to a CSV file

Requirements

Since this is web scraping tutorial using Python, you’ll obviously need Python. We use some python packages for downloading and parsing the HTML. Below are the requirements

  • Python 2.7 ( https://www.python.org/downloads/ )
  • PIP to install the  following packages in Python ( https://pip.pypa.io/en/stable/installing/)
  • Python Requests, to make requests and download the HTML content of the pages ( http://docs.python-requests.org/en/master/user/install/).
  • Python LXML, for parsing the HTML Tree Structure using Xpaths ( Learn how to install that here – http://lxml.de/installation.html )

The Code

The code is self-explanatory. We’ve added positional arguments in the command line scripts to specify check-in check-out date, locality and sort order for the results.

https://gist.github.com/scrapehero/1c425fdf290144cd4c7c635587feb459

You can download the Tripadvisor code from here https://gist.github.com/scrapehero/1c425fdf290144cd4c7c635587feb459 if the embed above doesn’t work.

Running the Scraper

Assuming you named the script tripadvisor_scraper.py. If you type in the script name in a command prompt or terminal with an -h

python tripadvisor_scraper.py -h

usage: tripadvisor_scraper.py [-h] checkin_date checkout_date sort locality

positional arguments:
  checkin_date   Hotel Check In Date (Format: YYYY/MM/DD
  checkout_date  Hotel Chek Out Date (Format: YYYY/MM/DD)
  sort           available sort orders are : priceLow - hotels with lowest
                 price, distLow : Hotels located near to the search center,
                 recommended: highest rated hotels based on traveler reviews,
                 popularity :Most popular hotels as chosen by Tipadvisor users
  locality       Search Locality

optional arguments:
  -h, --help     show this help message and exit

Run it using python with arguments for the locality, sort order, check-in and check-out dates in  YYYY/MM/DD format.

For an example – finding hotels in Boston for 2017-Jan-01 to 2017-Jan-02, sorted by popularity.

 python tripadvisor_scraper.py "2017/01/01" "2017/01/02" "popularity" "boston"

This will create a CSV file called tripadvisor_data.csv  in the same folder as the script.

Here is some sample data extracted from TripAdvisor for the command above.

tripadvisor-scraping

You can download the code at https://gist.github.com/scrapehero/1c425fdf290144cd4c7c635587feb459

Let us know in comments how this scraper worked for you.

Part 2 can be found here How to scrape Tripadvisor.com Hotel Details using Python and LXML

Known Limitations

This code should work for grabbing the first page of results for most cities. However, if you want to scrape for thousands of pages there are some important things you should be aware of and you can read about them at Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping .

If you need some professional help with scraping complex websites like TripAdvisor let us know by filling up the form below.

Tell us about your complex web scraping projects

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Responses

stedentriplonden January 4, 2017

Thanks for helping me getting used with Python, this script is working well! The only problem i get is prices are printed incorrect and i can’t seem to solve the problem:

Example:

price_per_night

€161*

€318*

€139*

I’m in Tripadvisor europe so i’m getting Euro prices instead of dollar.

Thanks in advance!


Elis Regina January 10, 2017

Hi guys. This is really helpful. Do you by any chance have a scraper for the actual text of the reviews? Im working on a text mining Project for a specific Hotel
Thanks a lot


Prakash Anand August 31, 2017

How can get all the hotels. This is giving me first page of results. Plz help.


Kentaro September 2, 2017

Hi.This is really helpful. Many thanks.
Is it possible to explain about the form_data? I need to change them to adjust to my needs.


Guy Malou January 2, 2018

Hello,
I am a web developer and I happen to scratch web with other API. I would like to learn how to do it with python 2.7 that I installed and also pip except that I have trouble installing LXML because the download lxml-4.1.1-cp27-cp27m-win32.whl (3.2mb) n did not succeed and this sends me a message Exception:
Traceback (most recent call last):
i start in python and i would really like to help and many thanks


A. January 15, 2018

Hellom I’m getting “error: unrecognized arguments: Boston”
from the updated code, what could it mean?


    A. January 15, 2018

    Nevermind, works like a charm, thanks for sharing


Walker Aguilar April 5, 2018

Is posible get the image of the hotel?


Alex Galey April 10, 2018

Thank you for this artcile and code, I am not able to scrap data : Installation Ok – URL found Ok – Parsing Data Ok – tripadvisor_data.csv Empty (only first line).
I tried with different cities for different dates.

I am also getting warning :
/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)


nadia April 10, 2018

I think my code has been blocked by tripadvisor or i have been blacklisted. Now , is there any alternative


Bitact April 25, 2018

Mine too


Muthu Vigneshwaran July 22, 2018

Hi, how would we iterate the to the all pages and get the list of all hotels?


    ScrapeHero July 22, 2018

    The tutorials are intended to provide a starting point for learning and making advanced enhancements. The next step to learn would be pagination and is a great exercise to learn Python


      Muthu Vigneshwaran July 23, 2018

      Hi,

      I am try to scrape the following website for the list of all the hotels and my code is as follows

      – coding: utf-8 –

      from time import sleep
      import scrapy
      from selenium import webdriver
      from scrapy.selector import Selector
      from scrapy.http import Request
      from selenium.common.exceptions import NoSuchElementException

      class DineoutRestaurantSpider(scrapy.Spider):
      name = ‘dineout_restaurant’
      allowed_domains = [‘dineout.co.in/bangalore-restaurants?search_str=’]
      start_urls = [‘http://dineout.co.in/bangalore-restaurants?search_str=’]

      def start_requests(self):
          self.driver = webdriver.Chrome('/Users/macbookpro/Downloads/chromedriver')
          self.driver.get('https://www.dineout.co.in/bangalore-restaurants?search_str=')
      
          url = 'https://www.dineout.co.in/bangalore-restaurants?search_str='
          yield Request(url, callback=self.parse)
          self.logger.info('Empty message')
      
          for i in range(1, 4):
              try:
                  next_page = self.driver.find_element_by_xpath('//a[text()="Next "]')
                  sleep(11)
                  self.logger.info('Sleeping for 11 seconds.')
                  next_page.click()
                  url = 'https://www.dineout.co.in/bangalore-restaurants?search_str='
                  yield Request(url, callback=self.parse)
      
              except NoSuchElementException:
                  self.logger.info('No more pages to load.')
                  self.driver.quit()
                  break
      
      def parse(self, response):
          self.logger.info('Entered parse method')
          restaurants = response.xpath('//*[@class="cardBg"]')
          for restaurant in restaurants:
               name = restaurant.xpath('.//*[@class="titleDiv"]/h4/a/text()').extract_first()
               location = restaurant.xpath('.//*[@class="location"]/a/text()').extract()
               rating = restaurant.xpath('.//*[@class="rating rating-5"]/a/span/text()').extract_first()
               yield{
                      'Name': name,
                      'Location': location,
                      'Rating': rating,
                      }
      

      In the above code when i yield the request the callback=self.parse is not getting called as the function is not called and it is only called at the end of the program.


Ventura August 5, 2018

HI,

I tried out your code, but the prices are not reflecting. I tried debugging the issue and found prices dont seem to be populated in the response text from the website. When I am checking the source in the actual webpage in chrome it is showing up, but not in the response text. Please advise


    ScrapeHero August 7, 2018

    Hi Ventura, Thanks for pointing this out, and taking the time to debug :+1: It seems like Tripadvisor has made some alterations to its website, to have to price show up only when a real browser is used ( execute javascript). We will rewrite this tutorial or modify our code to fix this.

    Meanwhile, you could use selenium or puppeteer for this. Here is something that could help – https://www.scrapehero.com/tutorial-web-scraping-hotel-prices-using-selenium-and-python/


XIAO QIANYU September 6, 2018

Thank you so much. This is very helpful.
The original code doesn’t work with my system somehow and I have made below changes according to error info. Finally it works.
Below are my summary, in case anyone has encountered the same error.

SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(“Scraper Inititated for Locality:%s”%locality)?

Solution:
Add () for all print element
for example
print “Writing to output file tripadvisor_data.csv” ->
print(“Writing to output file tripadvisor_data.csv”)

price_per_night = ”.join(raw_hotel_price_per_night).encode(‘utf-8’).replace(‘\n’,”) if raw_hotel_price_per_night else None
TypeError: a bytes-like object is required, not ‘str’

Solution:
price_per_night = ”.join(raw_hotel_price_per_night).encode(‘utf-8’).replace(‘\n’,”) ->
price_per_night = ”.join(raw_hotel_price_per_night).encode(‘utf-8’).decode().replace(‘\n’,”)

3.TypeError: write() argument must be str, not bytes

Solution:
with open(‘tripadvisor_data.csv’,’b’) ->
with open(r’C:\Users\Python_workspace\tripadvisor_data.csv’,’wb’)

My system:
Python 3.7.0b4 (v3.7.0b4:eb96c37699, May 2 2018, 19:02:22) [MSC v.1913 64 bit (AMD64)] on win32


    Sam Giarratana November 27, 2018

    Thanks for this, figured out the first one on my own but really thankful to have your help on the other too.


Seb H October 21, 2018

Thanks for posting this! However, I can’t get the code to work:

Scraper Inititated for Locality:%s
Traceback (most recent call last):
File “tripadvisor_scraper.py”, line 138, in
data = parse(locality, checkin_date, checkout_date, sort)
File “tripadvisor_scraper.py”, line 14, in parse
print(“Scraper Inititated for Locality:%s”) % locality
TypeError: unsupported operand type(s) for %: ‘NoneType’ and ‘str’

What should I do?

Cheers


    Nithu October 23, 2018

    Could you please share the command you used to run the script? Here is a sample command to run the script –

    python tripadvisor_scraper.py 2018/11/2 2018/11/3 priceLow boston


Zach July 9, 2019

What would the URL need to be changed to in order to get info for “Restaurants” and “Things to do”? I’ve been playing around for it for awhile but can’t seem to a correct ‘ur’ key back. Thanks!


    zach July 9, 2019

    *geo-url


Nadia October 2, 2019

Why your “üguests” is not working ? I am changing it to 3 but it is showing results for only 2


Comments are closed.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?