How to scrape YellowPages.com using Python and LXML

In this tutorial, we will show you how to write a web scraper that will extract business details based on a city and category from Yellow Pages. 

We’ll search Yellowpages.com for restaurants in a city and extract the details from the first page of results.

What data are we extracting?

Here is the list of fields we will be extracting:

  1. Rank
  2. Business Name
  3. Phone Number
  4. Business Page
  5. Category
  6. Website
  7. Rating
  8. Street Name
  9. Locality
  10. Region
  11. Zip code
  12. URL

Below is a screenshot of the data we will be extracting.

yellow-pages-extract-details

Finding the Data

First, we need to find the data that is present in the web page’s HTML Tags before we start building the scraper. You’ll need to understand the HTML tags in the pages’ content to do so.

If you already understand HTML and Python this will be simple for you. You don’t need advanced programming skills for the most part of this tutorial,

If you don’t know much about HTML and Python, spend some time reading Getting started with HTML – Mozilla Developer Network and https://www.programiz.com/python-programming

Let’s inspect the HTML of the web page and find out where the data is located. Here is what we’re going to do:

  1. Find the HTML tag that encloses the list of links from where we need the data from
  2. Get the links from it and extract the data

Inspecting the HTML

Why should we inspect the element? – To find any element on the web page using XML path expression.

Open any browser (we are using Chrome here) and go to https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Boston

Right-click on any link on the page and choose – Inspect Element. The browser will open a toolbar and show the HTML Content of the Web Page in a well-structured format.

yellowpages-inspecting-the-element

The GIF above shows the data we need to extract in the DIV tag. If you look closely, it has an attribute called ‘class’ named as ‘result’. This DIV contains the data fields we need to extract.

yellow-pages-fields-to-extract

Now let’s find the HTML tag(s) which has the links we need to extract. You can right-click on the link title in the browser and do inspect element again. It will open the HTML Content like before, and highlight the tag which holds the data you right clicked on. In the GIF below you can see the data fields structured in order.  

yellow-pages-inside-div-class

 

How to set up your computer for web scraper development

We will use Python 3 for this tutorial. The code will not run if you are using Python 2.7. To start, your system needs Python 3 and PIP installed in it.

Most UNIX operating systems like Linux and Mac OS comes with Python pre-installed. But, not all the Linux Operating Systems ship with Python 3 by default.

To check your python version. Open a terminal ( in Linux and Mac OS ) or Command Prompt ( on Windows ) and type:

python --version

and press enter. If the output looks something like Python 3.x.x, you have Python 3 installed. If it says Python 2.x.x you have Python 2. If it prints an error, you don’t probably have python installed.

Install Python 3 and Pip

Here is a guide to install Python 3 in Linuxhttp://docs.python-guide.org/en/latest/starting/install3/linux/

Mac Users can follow this guide here – http://docs.python-guide.org/en/latest/starting/install3/osx/

Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/

Install Packages

The Code

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
from lxml import html
import unicodecsv as csv
import argparse

def parse_listing(keyword,place):
 """
 
 Function to process yellowpage listing page 
 : param keyword: search query
    : param place : place name

 """
 url = "https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}".format(keyword,place)
 print("retrieving ",url)

 headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, br',
    'Accept-Language':'en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7',
    'Cache-Control':'max-age=0',
    'Connection':'keep-alive',
    'Host':'www.yellowpages.com',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'
   }
 # Adding retries
 for retry in range(10):
  try:
   response = requests.get(url,verify=False, headers = headers )
   print("parsing page")
   if response.status_code==200:
    parser = html.fromstring(response.text)
    #making links absolute
    base_url = "https://www.yellowpages.com"
    parser.make_links_absolute(base_url)

    XPATH_LISTINGS = "//div[@class='search-results organic']//div[@class='v-card']" 
    listings = parser.xpath(XPATH_LISTINGS)
    scraped_results = []

    for results in listings:
     XPATH_BUSINESS_NAME = ".//a[@class='business-name']//text()" 
     XPATH_BUSSINESS_PAGE = ".//a[@class='business-name']//@href" 
     XPATH_TELEPHONE = ".//div[@itemprop='telephone']//text()"
     XPATH_ADDRESS = ".//div[@class='info']//div//p[@itemprop='address']"
     XPATH_STREET = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='streetAddress']//text()"
     XPATH_LOCALITY = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='addressLocality']//text()"
     XPATH_REGION = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='addressRegion']//text()"
     XPATH_ZIP_CODE = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='postalCode']//text()"
     XPATH_RANK = ".//div[@class='info']//h2[@class='n']/text()"
     XPATH_CATEGORIES = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='categories']//text()"
     XPATH_WEBSITE = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='links']//a[contains(@class,'website')]/@href"
     XPATH_RATING = ".//div[@class='info']//div[contains(@class,'info-section')]//div[contains(@class,'result-rating')]//span//text()"

     raw_business_name = results.xpath(XPATH_BUSINESS_NAME)
     raw_business_telephone = results.xpath(XPATH_TELEPHONE)    
     raw_business_page = results.xpath(XPATH_BUSSINESS_PAGE)
     raw_categories = results.xpath(XPATH_CATEGORIES)
     raw_website = results.xpath(XPATH_WEBSITE)
     raw_rating = results.xpath(XPATH_RATING)
     # address = results.xpath(XPATH_ADDRESS)
     raw_street = results.xpath(XPATH_STREET)
     raw_locality = results.xpath(XPATH_LOCALITY)
     raw_region = results.xpath(XPATH_REGION)
     raw_zip_code = results.xpath(XPATH_ZIP_CODE)
     raw_rank = results.xpath(XPATH_RANK)
     
     business_name = ''.join(raw_business_name).strip() if raw_business_name else None
     telephone = ''.join(raw_business_telephone).strip() if raw_business_telephone else None
     business_page = ''.join(raw_business_page).strip() if raw_business_page else None
     rank = ''.join(raw_rank).replace('.\xa0','') if raw_rank else None
     category = ','.join(raw_categories).strip() if raw_categories else None
     website = ''.join(raw_website).strip() if raw_website else None
     rating = ''.join(raw_rating).replace("(","").replace(")","").strip() if raw_rating else None
     street = ''.join(raw_street).strip() if raw_street else None
     locality = ''.join(raw_locality).replace(',\xa0','').strip() if raw_locality else None
     region = ''.join(raw_region).strip() if raw_region else None
     zipcode = ''.join(raw_zip_code).strip() if raw_zip_code else None


     business_details = {
          'business_name':business_name,
          'telephone':telephone,
          'business_page':business_page,
          'rank':rank,
          'category':category,
          'website':website,
          'rating':rating,
          'street':street,
          'locality':locality,
          'region':region,
          'zipcode':zipcode,
          'listing_url':response.url
     }
     scraped_results.append(business_details)

    return scraped_results

   elif response.status_code==404:
    print("Could not find a location matching",place)
    #no need to retry for non existing page
    break
   else:
    print("Failed to process page")
    return []
    
  except:
   print("Failed to process page")
   return []


if __name__=="__main__":
 
 argparser = argparse.ArgumentParser()
 argparser.add_argument('keyword',help = 'Search Keyword')
 argparser.add_argument('place',help = 'Place Name')
 
 args = argparser.parse_args()
 keyword = args.keyword
 place = args.place
 scraped_data =  parse_listing(keyword,place)   
 
 if scraped_data:
  print("Writing scraped data to %s-%s-yellowpages-scraped-data.csv"%(keyword,place))
  with open('%s-%s-yellowpages-scraped-data.csv'%(keyword,place),'wb') as csvfile:
   fieldnames = ['rank','business_name','telephone','business_page','category','website','rating','street','locality','region','zipcode','listing_url']
   writer = csv.DictWriter(csvfile,fieldnames = fieldnames,quoting=csv.QUOTE_ALL)
   writer.writeheader()
   for data in scraped_data:
    writer.writerow(data)

Execute the full code by typing the script name followed by a -h in command prompt or terminal:

usage: yellow_pages.py [-h] keyword place

positional arguments:
  keyword     Search Keyword
  place       Place Name

optional arguments:
  -h, --help  show this help message and exit

The positional arguments keyword represents a category and place is the desired location to search for a business. As an example, let’s find the business details for restaurants in Boston, MA. The script would be executed as:

python3 yellow_pages.py restaurants Boston,MA

You should see a file called restaurants-boston-yellowpages-scraped-data.csv in the same folder as the script, with the extracted data. Here is some sample data extracted from YellowPages.com for the command above.

yellow-pages-extracted-results

The data will be saved as a CSV file. You can download the code at https://github.com/scrapehero/yellowpages-scraper

Let us know in the comments how this scraper worked for you.

Known Limitations

This code should be capable of scraping the business details of most locations. If you want to scrape the details of thousands of locations, you should read  Scalable do-it-yourself scraping – How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping .

If you need some professional help with scraping websites contact us by filling up the form below.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data


Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

Responses

Dayna October 19, 2018

How would I use this code to scrape ALL of the pages of results? I tried adding a for loop but it didn’t seem to work for me.

Reply

Andrew October 29, 2018

I’m looking for the same as well. How would you scrape the search where there are multiple pages of results? Thanks.

Reply

Chris January 9, 2019

Script worked beautifully. Just needs paging which should be the easiest part. Keeping you guys in mind with projects as well. Thanks!

Reply

lucas June 5, 2019

Has anyone figured this question out? How do you loop over all pages of results?
Thanks!

Reply

divinebitshop July 25, 2019

But will the codes above also work for other business directory websites, especially “https://www.businesslist.com.ng”. And how can it be used for search with multiple pages.

Reply

    ScrapeHero July 25, 2019

    No – each site is different

    Reply

Comments or Questions?

Turn the Internet into meaningful, structured and usable data