Scrape Glassdoor Job Data Using Python and LXML

Aggregating job postings from the web is difficult as its time consuming to manually extract data from websites. Web scraping is the best source for job data feeds if you are looking for jobs in a city or within a specific salary range.

This tutorial is about extracting details of jobs listings based on a particular job name and location. You can scrape the estimated salary, job ratings, or go a bit further and scrape the jobs based on the number of miles from a particular city. With scraping Glassdoor jobs, you can find job listings over a certain time period, and identify when job postings are listed and removed to make an analysis on jobs that are trending.

In this tutorial, we will scrape Glassdoor.com, one of the fastest growing job recruiting sites. The scraper will extract the data fields for a particular job name in a location given.

Here is the list of fields that we will be extracting:

  1. Job Name
  2. Company
  3. State
  4. City
  5. Salary
  6. URL

Here is a screenshot of the data we will be extracting

scraping-glassdoor-jobs

Scraping Logic

  1. First, construct the URL for the search results from Glassdoor. Since we will be extracting listings by their job name and location, here are the listings to find Android developer in Boston, Massachusetts-https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=Android+Developer&sc.keyword=Android+Developer&locT=C&locId=1154532&jobType=
  2. Download HTML of the search result page using Python Requests.
  3. Parse the page using LXML – LXML lets you navigate the HTML Tree Structure using Xpaths. We have predefined the XPaths for the details we need in the code.
  4. Save the data to a CSV file. In this article, we are only scraping the job name, company, location and estimated salary from the first page of results, so a CSV file should be enough to fit in all the data. If you would like to extract details in bulk, a JSON file is more preferable. You can read about choosing your data format, just to be sure.

Requirements

Install Python 3 and Pip

Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/

Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/

Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/

Packages

For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements:

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

Get Started for Free
Deploy to ScrapeHero Cloud

The Code

https://gist.github.com/scrapehero/352286d0f9dee87990cd45c3f979e7cb

If the embed above does not work you can download the link at https://gist.github.com/scrapehero/352286d0f9dee87990cd45c3f979e7cb

Running the Scraper

The name of the script is glassdoor.py. If you type in the script name in terminal or command prompt with a -h

usage: glassdoor.py [-h] keyword place

positional arguments:
  keyword   job name
  place     job location

optional arguments:
-h, --help show this help message and exit

The argument “keyword” represents a keyword related to the job you are searching for and the argument “place” is used to find the desired job in a specific location. The example below shows how to run the script to find the list of Android developers in Boston:

As an example, to find the list of Android developers in Boston we would run the script like this:

python3 glassdoor.py "Android developer" "Boston"

This will create a CSV file named Android developer-Boston-job-results.csv that will be in the same folder as the script. Here is some of the data extracted from Glassdoor in a CSV file from the command above.
scraping-glassdoor

You can download the code at https://gist.github.com/scrapehero/352286d0f9dee87990cd45c3f979e7cb

Let us know in the comments how this scraper worked for you.

Known Limitations

This scraper should work for extracting most job listings on Glassdoor unless the website structure changes drastically. If you would like to scrape the details of thousands of pages at very short intervals, this scraper is probably not going to work for you. You should read  Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping.

If you need professional help with scraping complex websites, contact us by filling up the form below.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Responses

Piotr Zieliński March 2, 2018

Thanks for a great article! I am currently working on a project where I also need a date of posting a job offer. Could you advise me how can I scrape it ? Regards


Josh Dombal August 14, 2018

When I run this it only grabs 31 lines of data??


    Yinka November 8, 2019

    Hi Josh, were you able to solve this?


      Bilal May 6, 2020

      hello yinka, did you solve the issue


Rhaos September 26, 2018

What’s work of “location_url= ” ?
Please describe .


    Bradley Aaron Kohler June 8, 2019

    location_url is a url at which we can extract the place_id. If you look at a standard glassdoor url you will see that the location is not shown in the url. This is because locations are stored as a place_id. To perform an html post request the place_id must first be extracted by some means.


piyush June 8, 2019

what is the process to run this code

python -h
after i’m getting and
but not able to run this code.


Comments are closed.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?