Tutorial: How to Scrape LinkedIn for Public Company Data

We are glad that you came here to learn how to scrape LinkedIn and we won’t disappoint you.

In this tutorial we will show you how to scrape the data in a LinkedIn company page.

For those who stumbled onto this page without a clear understanding of why they wanted to scrape LinkedIn data, here a few reasons why:

  1. Job search Automation – you want to work for a company with some specific criteria and they are not the usual suspects. You do have a shortlist, but this list isn’t really short – it is more like a long list. You wish there was a tool like google finance that could help you filter companies based on criteria they have published on LinkedIn. You can take your “long list” and scrape this information into a structured format and then like every programmer before you, build an amazing analysis tool.
    Heck, you could probably even build an app for that and not need that job after all !
  2. Curiosity- not the one that killed the cat, but you are curious about companies on LinkedIn and want to gather a good clean set of data to satiate your curiosity.
  3. Tinkerer – you just like to tinker and found out that you would love to learn Python and needed something useful to get started.

Well, whatever your reason, you have come to the right place.

In this tutorial we will show you the basic steps on how to scrape the publicly available LinkedIn company pages such as LinkedIn itself or the ScrapeHero page.

Prerequisites:

For this tutorial, and just like we did for the Amazon Scraper, we will stick to using basic Python and a couple of python packages – requests and lxml. We will not use more complicated packages like Scrapy for something simple.

You will need to install the following:

  • Python 2.7 available here ( https://www.python.org/downloads/ )
  • Python Requests available here ( http://docs.python-requests.org/en/master/user/install/) . You might need Python pip to install this available here – https://pip.pypa.io/en/stable/installing/)
  • Python LXML ( Learn how to install that here – http://lxml.de/installation.html )

The code for this scraper is embedded below and if you are unable to see it in your browser, it can be downloaded from the GIST here

All you need to do is change the URL in this line

or add more URLs separated by commas to this list

You can save the file and run it using Python – python filename.py

The output will be in a file called data.json in the same directory and will look something like this

Or if you run it for Cisco

The output will look like this

Feel free to change the URLs or the fields you want to scrape and Happy Scraping !

Need some help with scraping data?

Turn websites into meaningful and structured data through our web data extraction service

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

62 comments on “Tutorial: How to Scrape LinkedIn for Public Company Data

Arjun

I am not able to scrape data for company page scrapehero

    scrapehero

    Could you please copy and paste the error messages you are getting?

      autumn

      Hey there, thanks for the code! I don’t get any error output, only “retrying : https://www.linkedin.com/company/scrapehero” several times.
      It seems the requests.get returns a 999 reposne since LinkedIn denies access to the page…
      SSLError: hostname ‘fr.linkedin.com’ doesn’t match either of ‘spdy.linkedin.com’, ‘polls.linkedin.com’, ‘cms.linkedin.com’, ‘slideshare.www.linkedin.com’

        Mark

        I’m getting the same issue. Is there any fix available?

          ScrapeHero

          You are most likely being blocked by LI – they do block most automated access

          Mark

          So is there any way around this available?

          ScrapeHero

          Hi Mark,
          This is the reason professional services exist for this need. You are at a point where it is more than a DIY project.
          Thanks

          Mark

          Fair enough, thanks 🙂

scrapehero

If you are getting an error about lxml “ImportError: No module named lxml”, you will need to install lxml using the command pip install lxml on the command prompt

    Doug

    I ran “pip install lxml” but am still getting “ImportError: No module named lxml”. Any idea why?

      Doug

      Figured it out, was using python 3

Helen

It works great for company profiles! Thank you! But it seems that similar approach does not work for personal profiles. Any idea why?

    ScrapeHero

    Hi Helen,
    This tutorial is designed for company profiles only.
    It will need modifications to fit the structure of a personal profile because the pages are structured very differently.
    We will keep that on our list for a followup tutorial.
    Thanks

      Soup_Sandwich

      Hi,

      Has there been a tutorial for this yet?

      Thanks.

Corentin

Hello Thanks for the code! As autumn i don’t have any output error but only “retrying : https://www.linkedin.com/company/scrapehero”
It’s seems the request is denied or something like that

    ScrapeHero

    It seems like your IP address may be blocked by LI

Niraj gupta

how to write xpath for any website would you please tell me

Nitin

You are amazing. It works like charm. Can you also help us know how we can automate IP`s and rotate IP in order to not to get blocked from LI.

C Park

keeps telling me the following error?

File “scraper.py”, line 36
response = requests.get(url, headers=headers)
^
IndentationError: unindent does not match any outer indentation level

    ScrapeHero

    Hi there,
    Python needs lines to be “indented” uniformly – seems like the copy paste or edits changed the indentation so you will need to go to line 36 and check the spaces or tabs before the word response as described in the error message.
    You will need some basic python syntax knowledge if you run into such errors – nothing a google search cannot handle.
    Thanks

    C Park

    NVM, it works now, but now I get this warning!

    InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.

    for context, I had a csv with rows of company names, so I have a few lines prior to your code that appends each company name in that list into the companyurl format that linkedin requires for the type of info i’m looking for (size):

    urlstring_1 = ‘https://www.linkedin.com/vsearch/f?type=all&keywords=’
    urlstring_2 = ‘&orig=GLHD&rsid=&pageKey=oz-winner&trkInfo=tarId%3A1475627803218&trk=global_header&search=Search’

    companyurl_list = [urlstring_1 + x + urlstring_2 for x in company_list]

      ScrapeHero

      Looks like a SSL warning and as long as you are getting the data this warning shouldn’t matter – you are not sending your credit card over this connection – cheers

        C Park

        you’re right – it works fine!

        That being said, it worked for the first time i did it. Then I tried it again a few min later and started getting null values…wondering if they know it’s not a browser action/”real” user doing the search?

          ScrapeHero

          Most likely your IP has been flagged. You can wait a while to get unblocked or read through other posts on our site on how to overcome blocking.
          Good luck

MichaelGeorgeKeating (@MichaelGKeating)

This is a great post. Be careful scraping Linkedin though. I was recently banned for life because of my software http://octatools.com. If you care to learn more, check out my blog post on that topic: http://michaelgkeating.com/cant-find-me-on-linkedin-heres-why-i-got-kicked-off/

    Pandora

    Aren’t they just selectively removing competition of the remotest form ? Is that even legal ? Man! I’m pissed off reading what happened.!

ScrapeHero

LinkedIn has removed the ScrapeHero page (you are welcome LI/M$FT!) so please use a different URL as an example for the code.

Pandora

Hey,

I wanted to take this one step further and scrap the names of the people in the following page:

https://www.linkedin.com/vsearch/p?f_CC=2738480&trk=rr_connectedness

these are all the employees working in a company called Scripbox. Getting the names from this search result can give us a lot more data to work with. I tried to crawl through this page using your same headers but it seems to fail and I’m being redirected to a login page. So I’m guessing it requires the user to be logged in. My question is how do we define the right session in the request headers. I’m unable to find any cookie values after searching source but again I’m new and only learning. Please help.

    Pandora

    Any help guys ?

      ScrapeHero

      Hi there – that kind of data is private data and we don’t scrape private data.
      Sorry you will need to figure that out yourself.
      Thanks

        Pandora

        Oh okay. Sorry about that. Thanks.

Mark

Hi! Is there a way to scrape based on an input of company name rather than URL (i.e. I would like to be able to feed the scraper a list such as Cisco, IBM, Apple, etc)?

    ScrapeHero

    Hi Mark,
    Sure it is possible, but it will require writing some code. Please google how to read files in python and use the output to pass to this scraper.

      Mark

      Yeah I understand, but what I’m after is to get the LinkedIn URL for each company in a list. I can then use these URLs with this amazing scraper. Do you know how to achieve that first step?

Ming

As of Feb 20, 2017, the script doesn’t seem to work. LinkedIn recently updated their site and I don’t think the script parses the page contents correctly.

    ScrapeHero

    Thanks letting us know. We just updated the code to fix this. It should work now.

Andy

Thanks for the script! Looks like LinkedIn may have updated the frontend again – the script retries 5 times and then stores a null value in data.json, regardless of which company you try.

    ScrapeHero

    Hi Andy, I am sure they have changed it up. We will check this script for errors periodically and fix it.
    Thanks

ScrapeHero

We just tested the code and it works. However, the reason for the lack of data is that LI has blocked your IP (maybe even temporarily) and added it to the list where you need to logged in to get this data.

    ScrapeHero

    Hi Hoenie,
    Glad you were able to port the code to Python 3 and it worked for you. We did notice that the blocks are still a problem for many other users even though your issue was related to v2 vs v3

Art

Hi there, Thanks for the post. What is the purpose of retrying if redirect or login? I mean what makes the same URL give different output if tried again from the same IP? In my experience it does not change behavior. Also, how did you get Json by removing comments on the page? Nice trick) Please tell more about this and what section did this id led to ? = id=”stream-promo-top-bar-embed-id-content” How do i find new id for this section in new layout?

Roy nijland

Hi there Scrapehero!

Script works perfect with the old URL / page structure. I was trying to modify your script to work with the new structure https://www.linkedin.com/company-beta/3975633/, tried to set the xcode path too datafrom_xpath = doc.xpath(‘//code[@id=”datalet-bpr-guid-1379382″]//text()’) as their new api method requests the following:

{“request”:”/voyager/api/organization/companies/3975633?decoration\u003D%28name%2CcompanyPageUrl%2Cpermissions%2Ctype%2CcoverPhoto%2Cdescription%2CstaffCountRange%2CentityUrn%2CfollowingInfo%2CfoundedOn%2Cheadquarter%2Cindustries%2CcompanyIndustries*%2CjobSearchPageUrl%2Clogo%2CpaidCompany%2Cspecialities%2CaffiliatedCompanies*~%28entityUrn%2Clogo%2Cname%2Cindustries%2CfollowingInfo%2Curl%2CpaidCompany%29%2Cgroups*~%28entityUrn%2ClargeLogo%2CgroupName%2CmemberCount%2CwebsiteUrl%2Curl%29%2CcompanyEmployeesSearchPageUrl%2CstaffCount%2Curl%2CshowcasePages*~%28entityUrn%2Clogo%2Cname%2Cindustries%2CfollowingInfo%2Curl%2Cdescription%29%2CbackgroundCoverImage%2CoverviewPhoto%2CdataVersion%2CadsRule%2Cschool%2Cshowcase%2CsalesNavigatorCompanyUrl%2CacquirerCompany~%28entityUrn%2Clogo%2Cname%2Cindustries%2CfollowingInfo%2Curl%2CpaidCompany%29%2Cclaimable%2CclaimableByViewer%2CautoGenerated%2CuniversalName%2CstaffingCompany%2CviewerEmployee%2CviewerConnectedToAdministrator%2CviewerPendingAdministrator%29″,”status”:200,”body”:”bpr-guid-1379382″}

However i`m not succeeding at extracting any values yet 🙁

Any idea on how to solve this?

Thanks in advance,

Roy

Soup_Sandwich

Hi there, I’m very new to screen scraping and was pretty excited to stumble upon this. I was wondering if there is a way to scrape user profiles instead of companies. I’m trying to create a small sample database that contains certain characteristics of user profiles.

Thanks for the help.

    bitoo

    Hi, are you able to scrape user profiles, I’m also new here and trying to do the same thing.

      Anand

      It seems like LinkedIn has implement new security so now its not possible to scrape public company and public user profile data 😁

Ara G Kazaryan

Can you give some insight on how you come up with your x-paths? Please and thank you.

Mohith Murlidhar

Hi,

I get the following response along with a null data.json file:
(‘Fetching :’, ‘http://www.linkedin.com/company/tata-consultancy-services’)
/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)

    Mohith Murlidhar

    I sorted that out. But I am still not getting the output. I checked, and I see that datafrom_xpath is empty each time. Why is this happening?

Mohith Murlidhar

Something weird is happening. So, even though this company data is public, I am not able to get the information. When I go to a company link using incognito mode on my browser, I get redirected to the linkedin sign up page. But on another system, when I do the same, I am able to access that link. Why is this so?

Jubin Sanghvi

Hey, I tried using the scraper and it works brilliantly. Few questions though, does LinkedIn block IPs if I try to scrape a lot of pages. Is there a limit to it? Is there any workaround?

John Winger

An important development on LinkedIn Scraping – a federal judge orders LinedIn to unblock access for scraping of public data.

A judge has ruled that Microsoft’s LinkedIn network must allow a third-party company to scrape data publicly posted by LinkedIn users.
A US District Judge has granted hiQ Labs with a preliminary injunction that provides access to LinkedIn data. LinkedIn tried to argue that hiQ Labs violated the 1986 Computer Fraud and Abuse Act by scraping data. The judge raised concerns around LinkedIn “unfairly leveraging its power in the professional networking market for an anticompetitive purpose,” and compared LinkedIn’s argument to allowing website owners to “block access by individuals or groups on the basis of race or gender discrimination.”

https://www.theverge.com/2017/8/15/16148250/microsoft-linkedin-third-party-data-access-judge-ruling

    Gabriel

    Hi, I´m getting this error too, I´m hopping you can help me please. Thank you

My3

InsecureRequestWarning
getting this warning and not able to scrape anything

Join the conversation


Turn websites into meaningful and structured data through our web data extraction service