Tutorial: How to Scrape LinkedIn for public company data

We are glad that you came here to learn how to scrape LinkedIn and we won’t disappoint you.

In this tutorial we will show you how to scrape the data in a LinkedIn company page.

For those who stumbled onto this page without a clear understanding of why they wanted to scrape LinkedIn data, here a few reasons why:

  1. Job search Automation – you want to work for a company with some specific criteria and they are not the usual suspects. You do have a shortlist, but this list isn’t really short – it is more like a long list. You wish there was a tool like google finance that could help you filter companies based on criteria they have published on LinkedIn. You can take your “long list” and scrape this information into a structured format and then like every programmer before you, build an amazing analysis tool.
    Heck, you could probably even build an app for that and not need that job after all !
  2. Curiosity- not the one that killed the cat, but you are curious about companies on LinkedIn and want to gather a good clean set of data to satiate your curiosity.
  3. Tinkerer – you just like to tinker and found out that you would love to learn Python and needed something useful to get started.

Well, whatever your reason, you have come to the right place.

In this tutorial we will show you the basic steps on how to scrape the publicly available LinkedIn company pages such as LinkedIn itself or the ScrapeHero page.

Prerequisites:

For this tutorial, and just like we did for the Amazon Scraper, we will stick to using basic Python and a couple of python packages – requests and lxml. We will not use more complicated packages like Scrapy for something simple.

You will need to install the following:

  • Python 2.7 available here ( https://www.python.org/downloads/ )
  • Python Requests available here ( http://docs.python-requests.org/en/master/user/install/) . You might need Python pip to install this available here – https://pip.pypa.io/en/stable/installing/)
  • Python LXML ( Learn how to install that here – http://lxml.de/installation.html )

The code for this scraper is embedded below and if you are unable to see it in your browser, it can be downloaded from the GIST here

All you need to do is change the URL in this line

companyurls = ['https://www.linkedin.com/company/scrapehero']

or add more URLs separated by commas to this list

You can save the file and run it using Python – python filename.py

The output will be in a file called data.json in the same directory and will look something like this

{
        "website": "http://www.scrapehero.com", 
        "description": "ScrapeHero is one of the top web scraping companies in the world for a reason.\r\nWe don't leave you with a \"self service\" screen to build your own scrapers.\r\nWe have real humans that will talk to you within hours of your request and help you with your need.\r\nEven though we are premier provider in this space, our investments in automation have allowed us to provide a completely \"full service\" to you at an affordable cost.\r\nGet in touch with us at https://scrapehero.com and experience our awesome customer service first hand", 
        "founded": 2014, 
        "street": null, 
        "specialities": [
            "Web Scraping Service", 
            "Website Scraping", 
            "Screen scraping", 
            "Data scraping", 
            "Web crawling", 
            "Data as a Service", 
            "Data extraction API", 
            "Scrapy", 
            "DaaS"
        ], 
        "size": "11-50 employees", 
        "city": null, 
        "zip": null, 
        "url": "https://www.linkedin.com/company/scrapehero", 
        "country": null, 
        "industry": "Computer Software", 
        "state": null, 
        "company_name": "ScrapeHero", 
        "follower_count": 41, 
        "type": "Privately Held"
    }

Or if you run it for Cisco

companyurls = ['https://www.linkedin.com/company/cisco']

The output will look like this

{
        "website": "http://www.cisco.com", 
        "description": "Cisco (NASDAQ: CSCO) enables people to make powerful connections--whether in business, education, philanthropy, or creativity. Cisco hardware, software, and service offerings are used to create the Internet solutions that make networks possible--providing easy access to information anywhere, at any time. \r\n\r\nCisco was founded in 1984 by a small group of computer scientists from Stanford University. Since the company's inception, Cisco engineers have been leaders in the development of Internet Protocol (IP)-based networking technologies. Today, with more than 71,000 employees worldwide, this tradition of innovation continues with industry-leading products and solutions in the company's core development areas of routing and switching, as well as in advanced technologies such as home networking, IP telephony, optical networking, security, storage area networking, and wireless technology. In addition to its products, Cisco provides a broad range of service offerings, including technical support and advanced services. \r\n\r\nCisco sells its products and services, both directly through its own sales force as well as through its channel partners, to large enterprises, commercial businesses, service providers, and consumers.", 
        "founded": 1984, 
        "street": "Tasman Way, ", 
        "specialities": [
            "Networking", 
            "Wireless", 
            "Security", 
            "Unified Communication", 
            "Telepresence", 
            "Collaboration", 
            "Data Center", 
            "Virtualization", 
            "Unified Computing Systems"
        ], 
        "size": "10,001+ employees", 
        "city": "San Jose", 
        "zip": "95134", 
        "url": "https://www.linkedin.com/company/cisco", 
        "country": "United States", 
        "industry": "Computer Networking", 
        "state": "CA", 
        "company_name": "Cisco", 
        "follower_count": 1201541, 
        "type": "Public Company"
    }

Feel free to change the URLs or the fields you want to scrape and Happy Scraping !

Need some help with scraping data?


Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

42 thoughts on “Tutorial: How to Scrape LinkedIn for public company data

  1. Hi! Is there a way to scrape based on an input of company name rather than URL (i.e. I would like to be able to feed the scraper a list such as Cisco, IBM, Apple, etc)?

      1. Yeah I understand, but what I’m after is to get the LinkedIn URL for each company in a list. I can then use these URLs with this amazing scraper. Do you know how to achieve that first step?

  2. As of Feb 20, 2017, the script doesn’t seem to work. LinkedIn recently updated their site and I don’t think the script parses the page contents correctly.

  3. Thanks for the script! Looks like LinkedIn may have updated the frontend again – the script retries 5 times and then stores a null value in data.json, regardless of which company you try.

Join the conversation