How to Scrape Expedia using Python and LXML

Gathering travel data regarding flights is a mammoth task when done manually. There are hundreds of thousands of combinations of airports, routes, timings and ever changing prices. Ticket prices tend to vary daily (or even hourly), and there are a large number of flights available per day. Web Scraping is one of the solutions to keep track of this data. In this tutorial, we will scrape Expedia, a leading travel booking website to extract details on flights. Our scraper will extract the flight schedules and prices for a source and destination pair.

Here is a list of fields that we will be extracting:

  1. Arrival Airport
  2. Arrival Time
  3. Departure Airport
  4. Departure Time
  5. Plane Name
  6. Airline
  7. Flight Duration
  8. Plane Code
  9. Ticket Price
  10. No of Stops

 

Below is a screenshot of some of the data we will be extracting

details-for-scraping-expedia

Scraping Logic

  1. Construct the URL of the search results from Expedia- Here is one for the available flights listed from New York to Miami –https://www.expedia.com/Flights-Search?trip=oneway&leg1=from:New%20York,%20NY%20(NYC-All%20Airports),to:Miami,%20Florida,departure:04/01/2017TANYT&passengers=children:0,adults:1,seniors:0,infantinlap:Y&mode=search
  2. Download HTML of the search result page using Python Requests.
  3. Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
  4. Save the data to a JSON file. You can later modify this to write to a database.

Requirements

For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements.

Install Python 3 and Pip

Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/

Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/

Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/

Install Packages

The Code

The code is self-explanatory.

https://gist.github.com/scrapehero/bc34513e2ea72dc0890ad47fbd8a1a4f

If the embed above doesn’t work, you can download the code from the link here

If you would like the code in Python 2, you can check out the link here.

Running The Expedia Scraper

Assume the script is named expedia.py. If you type in the script name in command prompt or terminal along with a -h

usage: expedia.py [-h] source destination date

positional arguments:
source            Source airport code
destination       Destination airport code
date              MM/DD/YYYY

optional arguments:
     -h, --help show this help message and exit

 

The arguments source and destination are the airport codes for the source and destination airports. The date argument should be in the format MM/DD/YYYY.

As an example, to find the flights listed from New York to Miami we would put the arguments like this:

python3 expedia.py nyc mia 04/01/2017

This will create a JSON output file called nyc-mia-flight-results.json that will be in the same folder as the script. 

The output file will look similar to this:

{
    "arrival": "Miami Intl., Miami",
    "timings": [
      {
        "arrival_airport": "Miami, FL (MIA-Miami Intl.)",
        "arrival_time": "12:19a",
        "departure_airport": "New York, NY (LGA-LaGuardia)",
        "departure_time": "9:00p"
      }
    ],
    "airline": "American Airlines",
    "flight duration": "1 days 3 hours 19 minutes",
    "plane code": "738",
    "plane": "Boeing 737-800",
    "departure": "LaGuardia, New York",
    "stops": "Nonstop",
    "ticket price": "1144.21"
  },
  {
    "arrival": "Miami Intl., Miami",
    "timings": [
      {
        "arrival_airport": "St. Louis, MO (STL-Lambert-St. Louis Intl.)",
        "arrival_time": "11:15a",
        "departure_airport": "New York, NY (LGA-LaGuardia)",
        "departure_time": "9:11a"
      },
      {
        "arrival_airport": "Miami, FL (MIA-Miami Intl.)",
        "arrival_time": "8:44p",
        "departure_airport": "St. Louis, MO (STL-Lambert-St. Louis Intl.)",
        "departure_time": "4:54p"
      }
    ],
    "airline": "Republic Airlines As American Eagle",
    "flight duration": "0 days 11 hours 33 minutes",
    "plane code": "E75",
    "plane": "Embraer 175",
    "departure": "LaGuardia, New York",
    "stops": "1 Stop",
    "ticket price": "2028.40"
  },

 

You can download the code at https://gist.github.com/scrapehero/bc34513e2ea72dc0890ad47fbd8a1a4f 

Let us know in the comments how this scraper worked for you.

Known Limitations

This scraper should work for extracting most flight details available on Expedia unless the website structure changes drastically. If you would like to scrape the details of thousands of pages at very short intervals, this scraper is probably not going to work for you. You should read  Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping.

If you need professional help with scraping complex websites, contact us by filling up the form below.

Tell us about your complex web scraping projects

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Posted in:   Web Scraping Tutorials

Responses

Kevin Smith October 13, 2017

This was an exciting tutorial as I am new to python, however I received the error below after running the code:

Traceback (most recent call last):
File “expedia.py”, line 100, in
scraped_data = parse(source,destination,date)
File “expedia.py”, line 24, in parse
departure_location_airport = flight_data[‘legs’][i][‘departureLocation’][‘airportLongName’]
KeyError: ‘airportLongName’

I’m researching what this error message means, but if you all happen to know what the issue is, I would love to know myself! Cheers!

Reply

    Imanol October 17, 2017

    Hi, I’m having exactly the same problem. Did you solve it?
    Thanks

    Reply

    Kaitlyn October 31, 2017

    Replace ‘airportLongName’ with ‘airportCity’.

    Reply

Marcin January 15, 2018

I received the following errors:
File “C:\Users\user1\PycharmProjects\Scraping_Flight1\search.py”, line 95, in
scraped_data = parse(source, destination, date)
File “C:\Users\user1\PycharmProjects\Scraping_Flight1\search.py”, line 16, in parse
parser = html.fromstring(response.text, headers=headers, verify=False)
File “C:\Users\user1\ScrapeFlight_1\lib\site-packages\lxml\html\__init__.py”, line 876, in fro
mstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File “C:\Users\user1\ScrapeFlight_1\lib\site-packages\lxml\html\__init__.py”, line 762, in doc
ument_fromstring
value = etree.fromstring(html, parser, **kw)
File “src\lxml\etree.pyx”, line 3215, in lxml.etree.fromstring (src\lxml\etree.c:80983)
TypeError: fromstring() got an unexpected keyword argument ‘headers’

Can you help me with that?

Reply

    Shweta November 22, 2018

    {
    “error”: “failed to process the page”
    }

    I keep getting this in my json file. Please help as to what might be the problem

    Reply

      rijesh November 23, 2018

      Looks like the site stopped working for older browsers, we’ve updated the code to use latest user-agents. Please try now

      Reply

        Chito Leung November 24, 2018

        headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36’}

        Reply

Chito Leung November 24, 2018

headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36’}

Reply

Paul Soniat November 30, 2018

Really great, worked like a charm for me. Thank you for this

Reply

Dafni December 1, 2018

Really great! One thing! The pricing in what currency is it?

Reply

    ScrapeHero December 2, 2018

    Thank you Dafni. The currency should be what you see when you try the website in your browser, if you are running this script from your computer.

    Reply

John December 9, 2018

Thank you for putting this together. I’ve been looking forward to giving it a go but I keep running in to a “Certificate Verification” error.

“InsecureRequestWarning)”

I have added certificate verification code, but still nothing.
Here is what I tried

import certifi
import urllib3
http = urllib3.PoolManager(
cert_reqs=’CERT_REQUIRED’,
ca_certs=certifi.where())
http.request(‘GET’, ‘https://www.expedia.com’)

Any help would be greatly appreciated.
Regards.

Reply

    rijesh December 10, 2018

    Please remove verify=False from requests.get(url, headers=headers, verify=False) and try again.

    Reply

      John December 12, 2018

      Thank you, issue resolved.
      Now, the next step to take this to the next level is to add the necessary code to scrape for round trips not just one way.
      I took a shot at it, but last time I coded was over 20 years ago (not in Python) so there is a steep learning curve.
      Thanks again for you help.

      Reply

Geo Netz December 11, 2018

I keep on receiving this error, anyone know why?

usage: ipykernel_launcher.py [-h] source destination date
ipykernel_launcher.py: error: the following arguments are required: destination, date
An exception has occurred, use %tb to see the full traceback.

SystemExit: 2

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2969: UserWarning: To exit: use ‘exit’, ‘quit’, or Ctrl-D.
warn(“To exit: use ‘exit’, ‘quit’, or Ctrl-D.”, stacklevel=1)

Reply

    rijesh December 12, 2018

    can you please share the full traceback?

    Reply

      rijesh December 12, 2018

      To run the code, you have to provide Source Airport Code, Destination Airport Code and Date Of Journey.

      python3 expedia.py

      For example:

      python3 expedia.py nyc mia 04/01/2019

      where nyc is the source airport and mia is the destination airport.

      Reply

J.J January 2, 2019

I was able to make the script run correctly in the past, but today when running it I receive this in the results file:
{
“error”: “failed to process the page”
}

I appear to have the updated headers etc, so not sure if there is a bigger issue that I cant see due to my limited skills in this area

Reply

    rijesh January 2, 2019

    Please share the input arguments, We will look in to it.

    Reply

      J.J January 3, 2019

      Thanks, I am unsure of what I changed, or if it was an issue on the expedia side of things but I have just run the script again and it is work fine for me now. Just using this argument “expedia nyc mia 05/05/2019”

      Reply

Zachary Davidson February 4, 2019

Im getting the following error. Any idea why? Thanks !!

File “C:/Users/Zachary Davidson/PycharmProjects/scraping1/expedia.py”, line 13
headers = {‘User – Agent’: ‘Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 70.0.3538.77 Safari / 537.36’}
^
SyntaxError: invalid character in identifier

Process finished with exit code 1

Reply

    ScrapeHero February 4, 2019

    Seems like a copy-paste issue.

    Can you remove the spaces before and after the dash in the line below.

        headers = {‘User-Agent’: ‘Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 70.0.3538.77 Safari / 537.36’}
    
    Reply

Jim March 18, 2019

python3 expedia.py nyc mia 05/25/2019
works.
python3 expedia.py nyc yvr 05/25/2019
does not. I get “Rerying…” and output includes “failed to process page”
YVR is Vancouver, Canada.
Further
python3 expedia.py nyc mia 09/25/2019
Also fails.
I can do these searches on the website.
Any ideas? Thanks

Reply

    rijesh March 20, 2019

    Can you please check the status code you get while running the code? To get the statuscode please put this line – “print(response.status_code)” just after “response = requests.get(url, headers=headers, verify=False)”. The code works for all mentioned inputs.

    Reply

Mahmoud Sabri May 27, 2019

Thanks a lot! this is very useful.

Reply

zach June 27, 2019

Why is it when I enter more adults or more children it is unable to run the script? I imagine it has something to do with going out of bounds, but I don’t see anywhere else where passengers/adults/children is used. Thanks!

Reply

Nora August 28, 2019

Thanks a lot,, It is so useful, but I faced one issue it scraped just 35 trips, even there are more than 50 trips. Have anyone face this issue?

Reply

    Zach October 23, 2019

    When you run the same code above, is it still working for you? My cachedResultsJson script is returning nothing, therefore there’s no data I can grab. Thanks!

    Reply

      drenaskillshop December 6, 2019

      need to update the link of the url !! The website got some changes!

      Reply

Michelle February 29, 2020

Hi. I’ll be trying this code. I am looking for extracting the flights between 2 cities and saving in a local file. Thanks

Reply

Théo July 4, 2020

Hi, first a great thanks to share your works!
With an URL which works fine with a brower, i have “error”: “failed to process the page” in JSON file?
Do you know why?
Thanks a lot.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?