Tutorial: How to Scrape LinkedIn for public company data

We are glad that you came here to learn how to scrape LinkedIn and we won’t disappoint you.

In this tutorial we will show you how to scrape the data in a LinkedIn company page.

For those who stumbled onto this page without a clear understanding of why they wanted to scrape LinkedIn data, here a few reasons why:

  1. Job search Automation – you want to work for a company with some specific criteria and they are not the usual suspects. You do have a shortlist, but this list isn’t really short – it is more like a long list. You wish there was a tool like google finance that could help you filter companies based on criteria they have published on LinkedIn. You can take your “long list” and scrape this information into a structured format and then like every programmer before you, build an amazing analysis tool.
    Heck, you could probably even build an app for that and not need that job after all !
  2. Curiosity- not the one that killed the cat, but you are curious about companies on LinkedIn and want to gather a good clean set of data to satiate your curiosity.
  3. Tinkerer – you just like to tinker and found out that you would love to learn Python and needed something useful to get started.

Well, whatever your reason, you have come to the right place.

In this tutorial we will show you the basic steps on how to scrape the publicly available LinkedIn company pages such as LinkedIn itself or the ScrapeHero page.

Prerequisites:

For this tutorial, and just like we did for the Amazon Scraper, we will stick to using basic Python and a couple of python packages – requests and lxml. We will not use more complicated packages like Scrapy for something simple.

You will need to install the following:

  • Python 2.7 available here ( https://www.python.org/downloads/ )
  • Python Requests available here ( http://docs.python-requests.org/en/master/user/install/) . You might need Python pip to install this available here – https://pip.pypa.io/en/stable/installing/)
  • Python LXML ( Learn how to install that here – http://lxml.de/installation.html )

The code for this scraper is embedded below and if you are unable to see it in your browser, it can be downloaded from the GIST here

All you need to do is change the URL in this line

companyurls = ['https://www.linkedin.com/company/scrapehero']

or add more URLs separated by commas to this list

You can save the file and run it using Python – python filename.py

The output will be in a file called data.json in the same directory and will look something like this

{
        "website": "http://www.scrapehero.com", 
        "description": "ScrapeHero is one of the top web scraping companies in the world for a reason.\r\nWe don't leave you with a \"self service\" screen to build your own scrapers.\r\nWe have real humans that will talk to you within hours of your request and help you with your need.\r\nEven though we are premier provider in this space, our investments in automation have allowed us to provide a completely \"full service\" to you at an affordable cost.\r\nGet in touch with us at https://scrapehero.com and experience our awesome customer service first hand", 
        "founded": 2014, 
        "street": null, 
        "specialities": [
            "Web Scraping Service", 
            "Website Scraping", 
            "Screen scraping", 
            "Data scraping", 
            "Web crawling", 
            "Data as a Service", 
            "Data extraction API", 
            "Scrapy", 
            "DaaS"
        ], 
        "size": "11-50 employees", 
        "city": null, 
        "zip": null, 
        "url": "https://www.linkedin.com/company/scrapehero", 
        "country": null, 
        "industry": "Computer Software", 
        "state": null, 
        "company_name": "ScrapeHero", 
        "follower_count": 41, 
        "type": "Privately Held"
    }

Or if you run it for Cisco

companyurls = ['https://www.linkedin.com/company/cisco']

The output will look like this

{
        "website": "http://www.cisco.com", 
        "description": "Cisco (NASDAQ: CSCO) enables people to make powerful connections--whether in business, education, philanthropy, or creativity. Cisco hardware, software, and service offerings are used to create the Internet solutions that make networks possible--providing easy access to information anywhere, at any time. \r\n\r\nCisco was founded in 1984 by a small group of computer scientists from Stanford University. Since the company's inception, Cisco engineers have been leaders in the development of Internet Protocol (IP)-based networking technologies. Today, with more than 71,000 employees worldwide, this tradition of innovation continues with industry-leading products and solutions in the company's core development areas of routing and switching, as well as in advanced technologies such as home networking, IP telephony, optical networking, security, storage area networking, and wireless technology. In addition to its products, Cisco provides a broad range of service offerings, including technical support and advanced services. \r\n\r\nCisco sells its products and services, both directly through its own sales force as well as through its channel partners, to large enterprises, commercial businesses, service providers, and consumers.", 
        "founded": 1984, 
        "street": "Tasman Way, ", 
        "specialities": [
            "Networking", 
            "Wireless", 
            "Security", 
            "Unified Communication", 
            "Telepresence", 
            "Collaboration", 
            "Data Center", 
            "Virtualization", 
            "Unified Computing Systems"
        ], 
        "size": "10,001+ employees", 
        "city": "San Jose", 
        "zip": "95134", 
        "url": "https://www.linkedin.com/company/cisco", 
        "country": "United States", 
        "industry": "Computer Networking", 
        "state": "CA", 
        "company_name": "Cisco", 
        "follower_count": 1201541, 
        "type": "Public Company"
    }

Feel free to change the URLs or the fields you want to scrape and Happy Scraping !

Need some help with scraping data?


Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

49 thoughts on “Tutorial: How to Scrape LinkedIn for public company data

      1. Hey there, thanks for the code! I don’t get any error output, only “retrying : https://www.linkedin.com/company/scrapehero” several times.
        It seems the requests.get returns a 999 reposne since LinkedIn denies access to the page…
        SSLError: hostname ‘fr.linkedin.com’ doesn’t match either of ‘spdy.linkedin.com’, ‘polls.linkedin.com’, ‘cms.linkedin.com’, ‘slideshare.www.linkedin.com’

  1. If you are getting an error about lxml “ImportError: No module named lxml”, you will need to install lxml using the command pip install lxml on the command prompt

    1. Hi Helen,
      This tutorial is designed for company profiles only.
      It will need modifications to fit the structure of a personal profile because the pages are structured very differently.
      We will keep that on our list for a followup tutorial.
      Thanks

  2. You are amazing. It works like charm. Can you also help us know how we can automate IP`s and rotate IP in order to not to get blocked from LI.

    1. Hi Nitin,
      You are welcome !
      Rotating IPs etc is a whole industry and a specialization in itself.
      Please google for some providers that provide this service.
      The usage should be pretty transparent and is very easy to configure.

      Thanks

  3. keeps telling me the following error?

    File “scraper.py”, line 36
    response = requests.get(url, headers=headers)
    ^
    IndentationError: unindent does not match any outer indentation level

    1. Hi there,
      Python needs lines to be “indented” uniformly – seems like the copy paste or edits changed the indentation so you will need to go to line 36 and check the spaces or tabs before the word response as described in the error message.
      You will need some basic python syntax knowledge if you run into such errors – nothing a google search cannot handle.
      Thanks

    2. NVM, it works now, but now I get this warning!

      InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.

      for context, I had a csv with rows of company names, so I have a few lines prior to your code that appends each company name in that list into the companyurl format that linkedin requires for the type of info i’m looking for (size):

      urlstring_1 = ‘https://www.linkedin.com/vsearch/f?type=all&keywords=’
      urlstring_2 = ‘&orig=GLHD&rsid=&pageKey=oz-winner&trkInfo=tarId%3A1475627803218&trk=global_header&search=Search’

      companyurl_list = [urlstring_1 + x + urlstring_2 for x in company_list]

      1. Looks like a SSL warning and as long as you are getting the data this warning shouldn’t matter – you are not sending your credit card over this connection – cheers

        1. you’re right – it works fine!

          That being said, it worked for the first time i did it. Then I tried it again a few min later and started getting null values…wondering if they know it’s not a browser action/”real” user doing the search?

          1. Most likely your IP has been flagged. You can wait a while to get unblocked or read through other posts on our site on how to overcome blocking.
            Good luck

    1. Aren’t they just selectively removing competition of the remotest form ? Is that even legal ? Man! I’m pissed off reading what happened.!

  4. Hey,

    I wanted to take this one step further and scrap the names of the people in the following page:

    https://www.linkedin.com/vsearch/p?f_CC=2738480&trk=rr_connectedness

    these are all the employees working in a company called Scripbox. Getting the names from this search result can give us a lot more data to work with. I tried to crawl through this page using your same headers but it seems to fail and I’m being redirected to a login page. So I’m guessing it requires the user to be logged in. My question is how do we define the right session in the request headers. I’m unable to find any cookie values after searching source but again I’m new and only learning. Please help.

Join the conversation