How to Scrape Tripadvisor Data [Python + Selenium]

Set Up the Environment for Tripadvisor Data Scraping
Data Extracted from Tripadvisor
Tripadvisor Data Scraping: The code
Limitations of the Code
Why Use a Web Scraping Service
Frequently Asked Questions

Tripadvisor is a popular site for comparing prices and other details of hotels; it hosts details of hotels. Therefore, Tripadvisor data scraping can give you insights into your competitors. You can use the information to improve the amenities and features of your hotels or set competitive prices.

Tripadvisor.com, however, is a dynamic website, so the requests library is not ideal for the purpose.

Solution: Use automated browsers like Selenium.

This tutorial shows how to scrape Tripadvisor data using Selenium and lxml.

Set Up the Environment for Tripadvisor Data Scraping

Install the necessary packages on your system to use the code.

The code needs Python Selenium to get the source code from the hotel page and lxml to parse and extract the necessary data.

The Python package manager, pip, can install both Selenium and lxml.

pip install selenium lxml

Want to learn more about scraping using Selenium? Read this article on Selenium web scraping.

Data Extracted from Tripadvisor

This tutorial will teach you how to scrape Tripadvisor and get these details from a hotel page:

Name
Address
Ratings
Review count
Overall Rating
Amenities
Room features
Room Types
Rank
Hotel URL

The above data points will be inside specific HTML tags; you need to analyze the Tripadvisor hotel page to understand them to create XPaths. These XPaths help HTML parsers (lxml in this case) locate the tags.

Want to learn more about XPaths? Check out this Xpath cheat sheet.

Just right click on each data point and click ‘Inspect’ to analyze the code.

The Carriage House hotel's page on Tripadvisor showing the browser's inspect panel.

Tripadvisor Data Scraping: The code

Here’s the complete code for Tripadvisor scraping that you can copy and paste.

from lxml import html
import json
import argparse
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep

def process_request(url):
    driver = webdriver.Chrome()
    driver.get(url)

    sleep(4)    
    htmls = driver.find_element(By.TAG_NAME,"html")
    sleep(3)
    htmls.send_keys(Keys.END)
    sleep(10)
    
    response = driver.page_source
    parser = html.fromstring(response, url)
    return process_page(parser, url)

def clean(text):
    if isinstance(text, list):
        text = ' '.join(text)
    return ' '.join(text.strip().split())

def process_page(parser, url):

    XPATH_NAME = '//h1[@id="HEADING"]//text()'
    XPATH_RANK = "//div[@id='ABOUT_TAB']//div[contains(text(),'#')]/text()"
    XPATH_AMENITIES = '//div[contains(text(),"Property amenities")]/following::div[1]//text()'
    XPATH_ROOM_FEATURES = '//div[@data-test-target="hr-about-group-room_amenities"]/following::div[1]//text()'
    XPATH_OFFICIAL_DESCRIPTION = '//div[@data-automation="aboutTabDescription"]//text()'
    XPATH_ROOM_TYPES = '//div[contains(text(),"Room types")]/following::div[1]//text()'
    XPATH_FULL_ADDRESS_JSON = "//div[@id='LOCATION']//span/text()"
    XPATH_RATING ="//svg[@data-automation='bubbleRatingImage']/title/text()"
    XPATH_USER_REVIEWS = "//div[@id='REVIEWS']//div[text()='Excellent']/ancestor::div[4]//text()"
    XPATH_REVIEW_COUNT = "//div[@id='ABOUT_TAB']//div[@data-automation='bubbleReviewCount']//text()"

    raw_name = parser.xpath(XPATH_NAME)
    raw_rank = parser.xpath(XPATH_RANK)
    raw_amenities = parser.xpath(XPATH_AMENITIES)
    raw_room_features = parser.xpath(XPATH_ROOM_FEATURES)
    raw_official_description = parser.xpath(XPATH_OFFICIAL_DESCRIPTION)
    raw_room_types = parser.xpath(XPATH_ROOM_TYPES)
    raw_address = parser.xpath(XPATH_FULL_ADDRESS_JSON)
    raw_rating = parser.xpath(XPATH_RATING)
    raw_review_count = parser.xpath(XPATH_REVIEW_COUNT)
    raw_userRating = parser.xpath(XPATH_USER_REVIEWS)

    room_types = clean(raw_room_types)
    name = clean(raw_name).split('Someone')[0]
    rank = clean(raw_rank).split()[0]
    amenities = clean(raw_amenities)
    official_description = clean(raw_official_description)
    address = clean(raw_address[0])
    newRoomFeatures = clean(raw_room_features)
    rating = raw_rating[0].split()[0]
    userRating = clean(raw_userRating).split()
    review_count = clean(raw_review_count).replace('(', '').replace(')', '').replace('reviews','').strip()
    ratings = {
                                'Excellent': userRating[1],
                                'Good': userRating[3],
                                'Average': userRating[5],
                                'Poor': userRating[7],
                                'Terrible': userRating[9]
                    }

    amenity_dict = {'Hotel Amenities': amenities}
    room_types = {'Room Types':room_types}

    data = {
            'address': address,
            'ratings': ratings,
            'amenities': amenity_dict,
            'official_description': official_description,
            'room_types': room_types,
            'rating': rating,
            'review_count': review_count,
            'name': name,
            'rank': rank,
            'highlights': newRoomFeatures,
            'hotel_url': url
    }
    return data

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument('url', help='Tripadvisor hotel url')
    args = parser.parse_args()
    url = args.url
    scraped_data = process_request(url)
    if scraped_data:
        print("Writing scraped data")
        with open('tripadvisor_hotel_scraped_data.json', 'w') as f:
            json.dump(scraped_data, f, indent=4, ensure_ascii=False)

The first step is to write import statements. These will allow you to use methods from various libraries, including lxml and Selenium.

Besides selenium and lxml, this tutorial uses three modules: json, argparse, and sleep.

Here are all the libraries and modules imported into this code:

lxml for parsing the source code of a website
json for handling JSON files
argparse to push arguments from the command line
sleep to instruct the program to wait a certain amount of time before moving on to the next step
By, webdriver, and Keys from Selenium to interact with the page

from lxml import html
import json
import argparse
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep

This code has three defined functions:

clean() removes unnecessary spaces.
process_request() takes the URL as the argument, visits the website, and returns the page source.
process_page() parses the source code, extracts the details, and returns the data.

Request Processing and Page Navigation

process_request() takes a URL and visits the webpage using Selenium WebDriver.

def process_request(url):
    driver = webdriver.Chrome()
    driver.get(url)
    sleep(4)    
    htmls = driver.find_element(By.TAG_NAME,"html")
    sleep(3)
    htmls.send_keys(Keys.END)
    sleep(10)

This code uses webdriver.Chrome() to launch the Selenium browser and the get() method to visit the page.

Since the page uses lazy loading, you need to scroll to load details such as reviews and ratings. To do so, the above code simulates pressing the ‘END’ key.

You can then use Selenium’s page_source attribute to get the source code. The process_request() also parses the code using lxml and returns the function process_page().

 response = driver.page_source
   parser = html.fromstring(response, url)
   return process_page(parser, url)

Data Cleaning

The clean() function removes extra whitespace from extracted text elements.

def clean(text):
    if isinstance(text, list):
        text = ' '.join(text)
    return ' '.join(text.strip().split())

XPath Definitions and Data Extraction

process_page() takes the url and the parsed data as arguments and uses XPaths for data extraction.

Finding the XPaths can be challenging; you must analyze the webpage’s HTML code and figure it out based on the location and attributes of your element.

Ensure that when you build an XPath, you use a static class. Tripadvisor uses JavaScript to display certain elements, and these have dynamic classes. These classes differ for each hotel. Therefore, building the XPath from the element’s position is better.

For example, say your required data is a text inside a span, and the span is inside a div that is inside another. Then, you can use “//div/div/span/text()” to extract the data.

After determining the XPaths, store them in a variable, making it easy to update whenever they change.

def process_page(parser, url):

    XPATH_NAME = '//h1[@id="HEADING"]//text()'
    XPATH_RANK = "//div[@id='ABOUT_TAB']//div[contains(text(),'#')]/text()"
    XPATH_AMENITIES = '//div[contains(text(),"Property amenities")]/following::div[1]//text()'
    XPATH_ROOM_FEATURES = '//div[@data-test-target="hr-about-group-room_amenities"]/following::div[1]//text()'
    XPATH_OFFICIAL_DESCRIPTION = '//div[@data-automation="aboutTabDescription"]//text()'
    XPATH_ROOM_TYPES = '//div[contains(text(),"Room types")]/following::div[1]//text()'
    XPATH_FULL_ADDRESS_JSON = "//div[@id='LOCATION']//span/text()"
    XPATH_RATING ="//svg[@data-automation='bubbleRatingImage']/title/text()"
    XPATH_USER_REVIEWS = "//div[@id='REVIEWS']//div[text()='Good']/ancestor::div[4]//text()"
    XPATH_REVIEW_COUNT = "//div[@id='ABOUT_TAB']//div[@data-automation='bubbleReviewCount']//text()"

Here, the code gets:

Name from an h1 tag
Rank from a div tag that has the text ‘#’
Amenities from div tag that is the sibling of another div tag with the text ‘Property amenities’
Room features from a div tag that is the sibling of another div tag with the attribute data-test-target=”hr-about-group-room_amenities”
Description from a div element with the attribute data-automation=”aboutTabDescription”
Room types from a div element that is a sibling of another div element with the text ‘Room types’
Address from a span element inside a div element with ID ‘LOCATION’
Rating from an SVG element with the attribute data-automation=’bubbleRatingImage’
Review distribution from the fourth ancestor of a div element with the text ‘Good’
Review count from a div element with the attribute data-automation=’bubbleReviewCount’

Now, you can use these variables in the xpath() method to extract the corresponding data point.

 raw_name = parser.xpath(XPATH_NAME)
   raw_rank = parser.xpath(XPATH_RANK)
   raw_amenities = parser.xpath(XPATH_AMENITIES)
   raw_room_features = parser.xpath(XPATH_ROOM_FEATURES)
   raw_official_description = parser.xpath(XPATH_OFFICIAL_DESCRIPTION)
   raw_room_types = parser.xpath(XPATH_ROOM_TYPES)
   raw_address = parser.xpath(XPATH_FULL_ADDRESS_JSON)
   raw_rating = parser.xpath(XPATH_RATING)
   raw_review_count = parser.xpath(XPATH_REVIEW_COUNT)
   raw_userRating = parser.xpath(XPATH_USER_REVIEWS)

Data Processing and Storage

After extracting each data point, process_page() cleans the data using the clean() function

room_types = clean(raw_room_types)
   name = clean(raw_name).split('Someone')[0]
   rank = clean(raw_rank).split()[0]
   amenities = clean(raw_amenities)
   official_description = clean(raw_official_description)
   address = clean(raw_address[0])
   newRoomFeatures = clean(raw_room_features)
   rating = raw_rating[0].split()[0]
   userRating = clean(raw_userRating).split()
   review_count = clean(raw_review_count).replace('(', '').replace(')', '').replace('reviews','').strip()

The USER_RATING XPath gets the text content of the element holding the rating count distributed into Excellent, Good, Average, Poor, and Terrible.

Then, the code splits the text, and uses appropriate indices to scrape Tripadvisor review count distribution and store them in a dictionary.

ratings = {
                                'Excellent': userRating[1],
                                'Good': userRating[3],
                                'Average': userRating[5],
                                'Poor': userRating[7],
                                'Terrible': userRating[9]
                    }

After that, process_page stores the cleaned data as an object and returns it.

amenity_dict = {'Hotel Amenities': amenities}
    room_types = {'Room Types':room_types}

    data = {
            'address': address,
            'ratings': ratings,
            'amenities': amenity_dict,
            'official_description': official_description,
            'room_types': room_types,
            'rating': rating,
            'review_count': review_count,
            'name': name,
            'rank': rank,
            'highlights': newRoomFeatures,
            'hotel_url': url
    }
       
    return data

Main Execution

Finally, use json.dump to write the extracted data as a JSON file.

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('url', help='Tripadvisor hotel url')
    args = parser.parse_args()
    url = args.url
    scraped_data = process_request(url)
    if scraped_data:
        print("Writing scraped data")
        with open('tripadvisor_hotel_scraped_data.json', 'w') as f:
            json.dump(scraped_data, f, indent=4, ensure_ascii=False)

Here’s a flowchart showing the high-level execution of the code:

Flowchart showing the high-level execution of the code for scraping Tripadvisor

Here is the JSON data extracted using the code.

{
    "address": "3700 West Flamingo Road, Las Vegas, NV 89103",
    "ratings": {
        "Excellent": "8,298",
        "Good": "7,414",
        "Average": "5,091",
        "Poor": "2,866",
        "Terrible": "3,348"
    },
    "amenities": {
        "Hotel Amenities": "Free parking Free High Speed Internet (WiFi) Fitness Centre with Gym / Workout Room Pool Bar / lounge Casino and Gambling Game room Children Activities (Kid / Family Friendly)"
    },
    "official_description": "Rio Hotel & Casino is a vibrant oasis just one block off the famed Las Vegas Strip on Flamingo Road, with convenient access to Harry Reid International Airport and I-15. With an array of entertainment options, spacious 580-square-foot rooms, and dynamic dining experiences, this iconic resort delivers the complete Las Vegas experience to its guests. Newly renovated rooms are available for booking now. Read more",
    "room_types": {
        "Room Types": "Mountain view City view Landmark view Pool view Bridal suite Non-smoking rooms Suites"
    },
    "rating": "3.5",
    "review_count": "27,017",
    "name": "Rio Hotel & Casino",
    "rank": "#15",
    "highlights": "Blackout curtains Air conditioning Desk Dining area Refrigerator Cable / satellite TV Sofa bed Walk-in shower",
    "hotel_url": "https://www.tripadvisor.in/Hotel_Review-g45963-d91673-Reviews-Rio_Hotel_Casino-Las_Vegas_Nevada.html"
}

Limitations of the Code

This code can successfully extract the hotel details. However, Tripadvisor will force you to solve CAPTCHAs if you use the code too frequently. You need CAPTCHA solvers to address that, which this tutorial does not discuss.

Therefore, this code for Tripadvisor data scraping is unsuitable for large-scale projects.

Moreover, you must update the XPaths whenever Tripadvisor changes its website structure. You need to analyze the webpage again to figure out the new XPaths.

Why Use a Web Scraping Service

You can start learning how to scrape Tripadvisor using this tutorial.

However, remember that you must include additional code to bypass anti-scraping measures. You also need to watch Tripadvisor’s website for any changes in the structure and update the XPaths, or your code won’t succeed.

This is where a web scraping service shines.

A web scraping service like ScrapeHero can take care of all the problems mentioned above. Leave all the coding to us; we can build enterprise-grade web scrapers to gather the data you need, including travel, airline, and hotel data.

Frequently Asked Questions

Is Tripadvisor data scraping legal?

Legality depends on the legal jurisdiction, i.e., laws specific to the country and the locality. Gathering or scraping publicly available information is not illegal. Generally, web scraping Tripadvisor is legal if you are scraping publicly available data.

Can you scrape reviews from Tripadvisor?

Reviews are also public on Tripadvisor; hence, you can scrape them using Python. However, you must figure out the XPaths of the reviews.

Published on: March 12, 2024

Services

How to Scrape Tripadvisor Using Python lxml

Table of contents

Set Up the Environment for Tripadvisor Data Scraping

Data Extracted from Tripadvisor

Tripadvisor Data Scraping: The code

Request Processing and Page Navigation

Data Cleaning

XPath Definitions and Data Extraction

Data Processing and Storage

Main Execution

Limitations of the Code

Why Use a Web Scraping Service

Frequently Asked Questions

Table of contents

Scrape any website, any format, no sweat.

Ready to turn the internet into meaningful and usable data?

Continue Reading

Gain the Competitive Edge: Detect Stock Outs on Competitor Listings

Best for Pricing Intelligence: Scraping vs. Native APIs

Amazon Buy Box Monitoring: How to Stop Sales Drops