How to Scrape LinkedIn using Python

LinkedIn is one of the largest professional social media websites in the world and is a good source of social media data and job data. Using web scraping you can gather these data fields for performing analysis. We are glad that you came here to learn how to scrape LinkedIn and we won’t disappoint you. In this tutorial we will show you how to scrape the data in a LinkedIn company page.

Here are the steps to Scrape LinkedIn

  1. Download and install the latest version of Python
  2. Copy and run the code provided

Scrape job listings from LinkedIn

ScrapeHero Cloud has pre-built crawler that let’s you scrape LinkedIn jobs for as low as $5

No coding required and No setup required – Just provide URLs to start scraping!

Get started with scraping LinkedIn for job listings

Here are the fields we will be scraping LinkedIn for :

  1. Company Name
  2. Website
  3. Description
  4. Date founded
  5. Address – Street, city, zip, country
  6. Specialties
  7. Number of followers

Why Scrape LinkedIn?

  1. Job search Automation – you want to work for a company with some specific criteria and they are not the usual suspects. You do have a shortlist, but this list isn’t really short – it is more like a long list. You wish there was a tool like google finance that could help you filter companies based on criteria they have published on LinkedIn. You can take your “long list” and scrape this information into a structured format and then like every programmer before you, build an amazing analysis tool.
    Heck, you could probably even build an app for that and not need that job after all !
  2. Curiosity- not the one that killed the cat, but you are curious about companies on LinkedIn and want to gather a good clean set of data to satiate your curiosity.
  3. Tinkerer – you just like to tinker and found out that you would love to learn Python and needed something useful to get started.

Well, whatever your reason, you have come to the right place.

In this tutorial we will show you the basic steps on how to scrape the publicly available LinkedIn company pages such as LinkedIn itself or the ScrapeHero page.

Prerequisites to LinkedIn Scraping:

For this tutorial, and just like we did for the Amazon Scraper, we will stick to using basic Python and a couple of python packages – requests and lxml. We will not use more complicated packages like Scrapy in this tutorial.

You will need to install the following:

  • Python 2.7 available here ( https://www.python.org/downloads/ )
  • Python Requests available here ( http://docs.python-requests.org/en/master/user/install/) . You might need Python pip to install this available here – https://pip.pypa.io/en/stable/installing/)
  • Python LXML ( Learn how to install that here – http://lxml.de/installation.html )

Python LinkedIn Scraper

Below is the code to create your own python LinkedIn scraper. If you are unable to view the python code to scrape LinkedIn below, it can be downloaded from the GIST here

All you need to do is change the URL in this line

companyurls = ['https://www.linkedin.com/company/scrapehero']

or add more URLs separated by commas to this list

You can save the file and run it using Python – python filename.py

The output will be in a file called data.json in the same directory and will look something like this

{
        "website": "http://www.scrapehero.com", 
        "description": "ScrapeHero is one of the top web scraping companies in the world for a reason.\r\nWe don't leave you with a \"self service\" screen to build your own scrapers.\r\nWe have real humans that will talk to you within hours of your request and help you with your need.\r\nEven though we are premier provider in this space, our investments in automation have allowed us to provide a completely \"full service\" to you at an affordable cost.\r\nGet in touch with us at https://scrapehero.com and experience our awesome customer service first hand", 
        "founded": 2014, 
        "street": null, 
        "specialities": [
            "Web Scraping Service", 
            "Website Scraping", 
            "Screen scraping", 
            "Data scraping", 
            "Web crawling", 
            "Data as a Service", 
            "Data extraction API", 
            "Scrapy", 
            "DaaS"
        ], 
        "size": "11-50 employees", 
        "city": null, 
        "zip": null, 
        "url": "https://www.linkedin.com/company/scrapehero", 
        "country": null, 
        "industry": "Computer Software", 
        "state": null, 
        "company_name": "ScrapeHero", 
        "follower_count": 41, 
        "type": "Privately Held"
    }

Scrape job listings from LinkedIn

ScrapeHero Cloud has pre-built crawler that let’s you scrape LinkedIn jobs for as low as $5

No coding required and No setup required – Just provide URLs to start scraping!

Or if you run it for Cisco

companyurls = ['https://www.linkedin.com/company/cisco']

The output will look like this

{
        "website": "http://www.cisco.com", 
        "description": "Cisco (NASDAQ: CSCO) enables people to make powerful connections--whether in business, education, philanthropy, or creativity. Cisco hardware, software, and service offerings are used to create the Internet solutions that make networks possible--providing easy access to information anywhere, at any time. \r\n\r\nCisco was founded in 1984 by a small group of computer scientists from Stanford University. Since the company's inception, Cisco engineers have been leaders in the development of Internet Protocol (IP)-based networking technologies. Today, with more than 71,000 employees worldwide, this tradition of innovation continues with industry-leading products and solutions in the company's core development areas of routing and switching, as well as in advanced technologies such as home networking, IP telephony, optical networking, security, storage area networking, and wireless technology. In addition to its products, Cisco provides a broad range of service offerings, including technical support and advanced services. \r\n\r\nCisco sells its products and services, both directly through its own sales force as well as through its channel partners, to large enterprises, commercial businesses, service providers, and consumers.", 
        "founded": 1984, 
        "street": "Tasman Way, ", 
        "specialities": [
            "Networking", 
            "Wireless", 
            "Security", 
            "Unified Communication", 
            "Telepresence", 
            "Collaboration", 
            "Data Center", 
            "Virtualization", 
            "Unified Computing Systems"
        ], 
        "size": "10,001+ employees", 
        "city": "San Jose", 
        "zip": "95134", 
        "url": "https://www.linkedin.com/company/cisco", 
        "country": "United States", 
        "industry": "Computer Networking", 
        "state": "CA", 
        "company_name": "Cisco", 
        "follower_count": 1201541, 
        "type": "Public Company"
    }

Things to keep in mind before scraping LinkedIn using Python

  1. Since LinkedIn needs you to log in every time you open their website this code may not work for you.
  2. Use Request Headers, Proxies, and IP Rotation to prevent getting Captchas – How to prevent getting blacklisted while scraping.  You can also use python to solve some basic captchas using an OCR called Tesseract.

Scrape job listings from LinkedIn

ScrapeHero Cloud has pre-built crawler that let’s you scrape LinkedIn jobs for as low as $5

No coding required and No setup required – Just provide URLs to start scraping!

Feel free to change the URLs or the fields you want to scrape and Happy Scraping !

Need some help with scraping data?

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Responses

Arjun June 29, 2016

I am not able to scrape data for company page scrapehero

Reply

    scrapehero June 29, 2016

    Could you please copy and paste the error messages you are getting?

    Reply

      autumn August 19, 2016

      Hey there, thanks for the code! I don’t get any error output, only “retrying : https://www.linkedin.com/company/scrapehero” several times.
      It seems the requests.get returns a 999 reposne since LinkedIn denies access to the page…
      SSLError: hostname ‘fr.linkedin.com’ doesn’t match either of ‘spdy.linkedin.com’, ‘polls.linkedin.com’, ‘cms.linkedin.com’, ‘slideshare.www.linkedin.com’

      Reply

        Mark January 29, 2017

        I’m getting the same issue. Is there any fix available?

        Reply

          ScrapeHero January 29, 2017

          You are most likely being blocked by LI – they do block most automated access


          Mark January 30, 2017

          So is there any way around this available?


          ScrapeHero January 30, 2017

          Hi Mark,
          This is the reason professional services exist for this need. You are at a point where it is more than a DIY project.
          Thanks


          Mark January 31, 2017

          Fair enough, thanks 🙂


scrapehero July 27, 2016

If you are getting an error about lxml “ImportError: No module named lxml”, you will need to install lxml using the command pip install lxml on the command prompt

Reply

    Doug September 28, 2016

    I ran “pip install lxml” but am still getting “ImportError: No module named lxml”. Any idea why?

    Reply

      Doug September 28, 2016

      Figured it out, was using python 3

      Reply

        ScrapeHero September 28, 2016

        Glad it was resolved Doug.

        Reply

Helen August 11, 2016

It works great for company profiles! Thank you! But it seems that similar approach does not work for personal profiles. Any idea why?

Reply

    ScrapeHero August 11, 2016

    Hi Helen,
    This tutorial is designed for company profiles only.
    It will need modifications to fit the structure of a personal profile because the pages are structured very differently.
    We will keep that on our list for a followup tutorial.
    Thanks

    Reply

      Soup_Sandwich May 15, 2017

      Hi,

      Has there been a tutorial for this yet?

      Thanks.

      Reply

    ScrapeHero September 16, 2016

    It seems like your IP address may be blocked by LI

    Reply

Niraj gupta September 25, 2016

how to write xpath for any website would you please tell me

Reply

Nitin October 3, 2016

You are amazing. It works like charm. Can you also help us know how we can automate IP`s and rotate IP in order to not to get blocked from LI.

Reply

C Park October 5, 2016

keeps telling me the following error?

File “scraper.py”, line 36
response = requests.get(url, headers=headers)
^
IndentationError: unindent does not match any outer indentation level

Reply

    ScrapeHero October 5, 2016

    Hi there,
    Python needs lines to be “indented” uniformly – seems like the copy paste or edits changed the indentation so you will need to go to line 36 and check the spaces or tabs before the word response as described in the error message.
    You will need some basic python syntax knowledge if you run into such errors – nothing a google search cannot handle.
    Thanks

    Reply

    C Park October 5, 2016

    NVM, it works now, but now I get this warning!

    InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.

    for context, I had a csv with rows of company names, so I have a few lines prior to your code that appends each company name in that list into the companyurl format that linkedin requires for the type of info i’m looking for (size):

    urlstring_1 = ‘https://www.linkedin.com/vsearch/f?type=all&keywords=’
    urlstring_2 = ‘&orig=GLHD&rsid=&pageKey=oz-winner&trkInfo=tarId%3A1475627803218&trk=global_header&search=Search’

    companyurl_list = [urlstring_1 + x + urlstring_2 for x in company_list]

    Reply

      ScrapeHero October 5, 2016

      Looks like a SSL warning and as long as you are getting the data this warning shouldn’t matter – you are not sending your credit card over this connection – cheers

      Reply

        C Park October 5, 2016

        you’re right – it works fine!

        That being said, it worked for the first time i did it. Then I tried it again a few min later and started getting null values…wondering if they know it’s not a browser action/”real” user doing the search?

        Reply

          ScrapeHero October 5, 2016

          Most likely your IP has been flagged. You can wait a while to get unblocked or read through other posts on our site on how to overcome blocking.
          Good luck


    Pandora December 31, 2016

    Aren’t they just selectively removing competition of the remotest form ? Is that even legal ? Man! I’m pissed off reading what happened.!

    Reply

ScrapeHero December 31, 2016

LinkedIn has removed the ScrapeHero page (you are welcome LI/M$FT!) so please use a different URL as an example for the code.

Reply

Pandora January 3, 2017

Hey,

I wanted to take this one step further and scrap the names of the people in the following page:

https://www.linkedin.com/vsearch/p?f_CC=2738480&trk=rr_connectedness

these are all the employees working in a company called Scripbox. Getting the names from this search result can give us a lot more data to work with. I tried to crawl through this page using your same headers but it seems to fail and I’m being redirected to a login page. So I’m guessing it requires the user to be logged in. My question is how do we define the right session in the request headers. I’m unable to find any cookie values after searching source but again I’m new and only learning. Please help.

Reply

    Pandora January 7, 2017

    Any help guys ?

    Reply

      ScrapeHero January 7, 2017

      Hi there – that kind of data is private data and we don’t scrape private data.
      Sorry you will need to figure that out yourself.
      Thanks

      Reply

        Pandora January 7, 2017

        Oh okay. Sorry about that. Thanks.

        Reply

Mark January 29, 2017

Hi! Is there a way to scrape based on an input of company name rather than URL (i.e. I would like to be able to feed the scraper a list such as Cisco, IBM, Apple, etc)?

Reply

    ScrapeHero January 29, 2017

    Hi Mark,
    Sure it is possible, but it will require writing some code. Please google how to read files in python and use the output to pass to this scraper.

    Reply

      Mark January 30, 2017

      Yeah I understand, but what I’m after is to get the LinkedIn URL for each company in a list. I can then use these URLs with this amazing scraper. Do you know how to achieve that first step?

      Reply

Ming February 20, 2017

As of Feb 20, 2017, the script doesn’t seem to work. LinkedIn recently updated their site and I don’t think the script parses the page contents correctly.

Reply

    ScrapeHero February 20, 2017

    Thanks letting us know. We just updated the code to fix this. It should work now.

    Reply

Andy March 9, 2017

Thanks for the script! Looks like LinkedIn may have updated the frontend again – the script retries 5 times and then stores a null value in data.json, regardless of which company you try.

Reply

    ScrapeHero March 9, 2017

    Hi Andy, I am sure they have changed it up. We will check this script for errors periodically and fix it.
    Thanks

    Reply

ScrapeHero March 13, 2017

We just tested the code and it works. However, the reason for the lack of data is that LI has blocked your IP (maybe even temporarily) and added it to the list where you need to logged in to get this data.

Reply

    ScrapeHero March 14, 2017

    Hi Hoenie,
    Glad you were able to port the code to Python 3 and it worked for you. We did notice that the blocks are still a problem for many other users even though your issue was related to v2 vs v3

    Reply

Art April 1, 2017

Hi there, Thanks for the post. What is the purpose of retrying if redirect or login? I mean what makes the same URL give different output if tried again from the same IP? In my experience it does not change behavior. Also, how did you get Json by removing comments on the page? Nice trick) Please tell more about this and what section did this id led to ? = id=”stream-promo-top-bar-embed-id-content” How do i find new id for this section in new layout?

Reply

Roy nijland May 11, 2017

Hi there Scrapehero!

Script works perfect with the old URL / page structure. I was trying to modify your script to work with the new structure https://www.linkedin.com/company-beta/3975633/, tried to set the xcode path too datafrom_xpath = doc.xpath(‘//code[@id=”datalet-bpr-guid-1379382″]//text()’) as their new api method requests the following:

{“request”:”/voyager/api/organization/companies/3975633?decoration\u003D%28name%2CcompanyPageUrl%2Cpermissions%2Ctype%2CcoverPhoto%2Cdescription%2CstaffCountRange%2CentityUrn%2CfollowingInfo%2CfoundedOn%2Cheadquarter%2Cindustries%2CcompanyIndustries*%2CjobSearchPageUrl%2Clogo%2CpaidCompany%2Cspecialities%2CaffiliatedCompanies*~%28entityUrn%2Clogo%2Cname%2Cindustries%2CfollowingInfo%2Curl%2CpaidCompany%29%2Cgroups*~%28entityUrn%2ClargeLogo%2CgroupName%2CmemberCount%2CwebsiteUrl%2Curl%29%2CcompanyEmployeesSearchPageUrl%2CstaffCount%2Curl%2CshowcasePages*~%28entityUrn%2Clogo%2Cname%2Cindustries%2CfollowingInfo%2Curl%2Cdescription%29%2CbackgroundCoverImage%2CoverviewPhoto%2CdataVersion%2CadsRule%2Cschool%2Cshowcase%2CsalesNavigatorCompanyUrl%2CacquirerCompany~%28entityUrn%2Clogo%2Cname%2Cindustries%2CfollowingInfo%2Curl%2CpaidCompany%29%2Cclaimable%2CclaimableByViewer%2CautoGenerated%2CuniversalName%2CstaffingCompany%2CviewerEmployee%2CviewerConnectedToAdministrator%2CviewerPendingAdministrator%29″,”status”:200,”body”:”bpr-guid-1379382″}

However i`m not succeeding at extracting any values yet ?

Any idea on how to solve this?

Thanks in advance,

Roy

Reply

    ScrapeHero May 11, 2017

    We have identified some changes to the site structure and will be updating the code soon.
    Thanks

    Reply

      Roy nijland May 12, 2017

      Awesome, the effort is appreciated!

      Reply

Soup_Sandwich May 12, 2017

Hi there, I’m very new to screen scraping and was pretty excited to stumble upon this. I was wondering if there is a way to scrape user profiles instead of companies. I’m trying to create a small sample database that contains certain characteristics of user profiles.

Thanks for the help.

Reply

    bitoo June 7, 2017

    Hi, are you able to scrape user profiles, I’m also new here and trying to do the same thing.

    Reply

      Anand June 9, 2017

      It seems like LinkedIn has implement new security so now its not possible to scrape public company and public user profile data ?

      Reply

Ara G Kazaryan June 12, 2017

Can you give some insight on how you come up with your x-paths? Please and thank you.

Reply

Mohith Murlidhar July 3, 2017

Hi,

I get the following response along with a null data.json file:
(‘Fetching :’, ‘http://www.linkedin.com/company/tata-consultancy-services’)
/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)

Reply

    Mohith Murlidhar July 3, 2017

    I sorted that out. But I am still not getting the output. I checked, and I see that datafrom_xpath is empty each time. Why is this happening?

    Reply

      Mohith Murlidhar July 3, 2017

      I fixed it.

      Reply

        josterricardo February 28, 2018

        how did you solve it? It gives me the same error

        Reply

Mohith Murlidhar July 5, 2017

Something weird is happening. So, even though this company data is public, I am not able to get the information. When I go to a company link using incognito mode on my browser, I get redirected to the linkedin sign up page. But on another system, when I do the same, I am able to access that link. Why is this so?

Reply

Jubin Sanghvi July 17, 2017

Hey, I tried using the scraper and it works brilliantly. Few questions though, does LinkedIn block IPs if I try to scrape a lot of pages. Is there a limit to it? Is there any workaround?

Reply

John Winger August 15, 2017

An important development on LinkedIn Scraping – a federal judge orders LinedIn to unblock access for scraping of public data.

A judge has ruled that Microsoft’s LinkedIn network must allow a third-party company to scrape data publicly posted by LinkedIn users.
A US District Judge has granted hiQ Labs with a preliminary injunction that provides access to LinkedIn data. LinkedIn tried to argue that hiQ Labs violated the 1986 Computer Fraud and Abuse Act by scraping data. The judge raised concerns around LinkedIn “unfairly leveraging its power in the professional networking market for an anticompetitive purpose,” and compared LinkedIn’s argument to allowing website owners to “block access by individuals or groups on the basis of race or gender discrimination.”

https://www.theverge.com/2017/8/15/16148250/microsoft-linkedin-third-party-data-access-judge-ruling

Reply

    Dwight July 12, 2018

    Is still the current ruling, “must allow a third-party company to scrape data publicly posted by LinkedIn users”? And does this include individuals?

    Reply

    Gabriel September 7, 2017

    Hi, I´m getting this error too, I´m hopping you can help me please. Thank you

    Reply

My3 January 16, 2018

InsecureRequestWarning
getting this warning and not able to scrape anything

Reply

Le Contemplateur May 23, 2018

After adapting the syntax to 3.x and solving the certification warning, I get a null json file. What am I doing wrong? I’ll test it with 2.7 just to be sure it’s not a compatibility problem

Reply

    João Silva May 30, 2018

    Hi! I’m getting exactly the same problem. Getting a null json file. Does anyone already have a solution for that? Thanks.

    Reply

Hugo Bernardes June 3, 2018

Hello, I am getting the following error:

Traceback (most recent call last):
File “scraper.py”, line 4, in
from exceptions import ValueError
ModuleNotFoundError: No module named ‘exceptions’

Can you help? Many thanks!

Reply

Karen Phillips October 4, 2018

I have the same error ‘ No module named ‘exceptions’.

Reply

    ScrapeHero October 5, 2018

    No module named … errors are resolved by installing the module using pip

    Reply

    Umesh S G December 18, 2018

    “exceptions” module is no more supported by Python 3.x version, alternatively install “builtins” module ($pip install builtins) and then use this line “from builtins import ValueError” .. It will work

    Reply

D January 6, 2019

Adding a cookie to the code in order to access as a web browser solves the problem

Reply

NN August 19, 2019

Does this still work in 2019?

Reply

    ScrapeHero August 19, 2019

    The code is workable but LI blocks almost everything

    Reply

M October 18, 2019

Hi
I am getting an invalid syntax error for def in def readurl ()
any help would be great! thanks!

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?