How to scrape using Python and LXML

In this web scraping tutorial, we will show you how to scrape yellow pages and extract business details based on a city and category.

For this scraper, we will search for restaurants in a city. Then scrape business details from the first page of results.

What data are we extracting?

Here is the list of data fields we will be extracting:

  1. Rank
  2. Business Name
  3. Phone Number
  4. Business Page
  5. Category
  6. Website
  7. Rating
  8. Street Name
  9. Locality
  10. Region
  11. Zip code
  12. URL

Below is a screenshot of the data we will be extracting from YellowPages.


Finding the Data

First, we need to find the data that is present in the web page’s HTML Tags before we start building the Yellowpages scraper. You’ll need to understand the HTML tags in the web pages content to do so.

If you already understand HTML and Python this will be simple for you. You don’t need advanced programming skills for the most part of this tutorial,

If you don’t know much about HTML and Python, spend some time reading Getting started with HTML – Mozilla Developer Network and

Let’s inspect the HTML of the web page and find out where the data is located. Here is what we’re going to do:

  1. Find the HTML tag that encloses the list of links from where we need the data from
  2. Get the links from it and extract the data

Inspecting the HTML

Why should we inspect the element? – To find any element on the web page using XML path expression.

Open any browser (we are using Chrome here) and go to

Right-click on any link on the page and choose – Inspect Element. The browser will open a toolbar and show the HTML Content of the Web Page in a well-structured format.


The GIF above shows the data we need to extract in the DIV tag. If you look closely, it has an attribute called ‘class’ named as ‘result’. This DIV contains the data fields we need to extract.


Now let’s find the HTML tag(s) which has the links we need to extract. You can right-click on the link title in the browser and do Inspect Element again. It will open the HTML Content like before, and highlight the tag which holds the data you right-clicked on. In the GIF below you can see the data fields structured in order.


How to set up your computer to scrape Yellow Pages

We will use Python 3 for this Yellow Pages scraping tutorial. The code will not run if you are using Python 2.7. To start, your system needs Python 3 and PIP installed in it.

Most UNIX operating systems like Linux and Mac OS comes with Python pre-installed. But, not all the Linux Operating Systems ship with Python 3 by default.

To check your python version. Open a terminal (in Linux and Mac OS) or Command Prompt (on Windows) and type:

python --version

Then press the Enter key. If the output looks something like Python 3.x.x, you have Python 3 installed. Likewise if its Python 2.x.x, you have Python 2. But, if it prints an error, you don’t probably have python installed.

Install Python 3 and Pip

Here is a guide to install Python 3 in Linux

Mac Users can follow this guide here –

Likewise, Windows Users go here –

Install Packages

The Code

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from lxml import html
import unicodecsv as csv
import argparse
def parse_listing(keyword, place):
Function to process yellowpage listing page
: param keyword: search query
: param place : place name
url = "{0}&geo_location_terms={1}".format(keyword, place)
print("retrieving ", url)
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': '',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'
# Adding retries
for retry in range(10):
response = requests.get(url, verify=False, headers=headers)
print("parsing page")
if response.status_code == 200:
parser = html.fromstring(response.text)
# making links absolute
base_url = ""
XPATH_LISTINGS = "//div[@class='search-results organic']//div[@class='v-card']"
listings = parser.xpath(XPATH_LISTINGS)
scraped_results = []
for results in listings:
XPATH_BUSINESS_NAME = ".//a[@class='business-name']//text()"
XPATH_BUSSINESS_PAGE = ".//a[@class='business-name']//@href"
XPATH_TELEPHONE = ".//div[@class='phones phone primary']//text()"
XPATH_ADDRESS = ".//div[@class='info']//div//p[@itemprop='address']"
XPATH_STREET = ".//div[@class='street-address']//text()"
XPATH_LOCALITY = ".//div[@class='locality']//text()"
XPATH_REGION = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='addressRegion']//text()"
XPATH_ZIP_CODE = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='postalCode']//text()"
XPATH_RANK = ".//div[@class='info']//h2[@class='n']/text()"
XPATH_CATEGORIES = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='categories']//text()"
XPATH_WEBSITE = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='links']//a[contains(@class,'website')]/@href"
XPATH_RATING = ".//div[@class='info']//div[contains(@class,'info-section')]//div[contains(@class,'result-rating')]//span//text()"
raw_business_name = results.xpath(XPATH_BUSINESS_NAME)
raw_business_telephone = results.xpath(XPATH_TELEPHONE)
raw_business_page = results.xpath(XPATH_BUSSINESS_PAGE)
raw_categories = results.xpath(XPATH_CATEGORIES)
raw_website = results.xpath(XPATH_WEBSITE)
raw_rating = results.xpath(XPATH_RATING)
# address = results.xpath(XPATH_ADDRESS)
raw_street = results.xpath(XPATH_STREET)
raw_locality = results.xpath(XPATH_LOCALITY)
raw_region = results.xpath(XPATH_REGION)
raw_zip_code = results.xpath(XPATH_ZIP_CODE)
raw_rank = results.xpath(XPATH_RANK)
business_name = ''.join(raw_business_name).strip() if raw_business_name else None
telephone = ''.join(raw_business_telephone).strip() if raw_business_telephone else None
business_page = ''.join(raw_business_page).strip() if raw_business_page else None
rank = ''.join(raw_rank).replace('.\xa0', '') if raw_rank else None
category = ','.join(raw_categories).strip() if raw_categories else None
website = ''.join(raw_website).strip() if raw_website else None
rating = ''.join(raw_rating).replace("(", "").replace(")", "").strip() if raw_rating else None
street = ''.join(raw_street).strip() if raw_street else None
locality = ''.join(raw_locality).replace(',\xa0', '').strip() if raw_locality else None
locality, locality_parts = locality.split(',')
_, region, zipcode = locality_parts.split(' ')
business_details = {
'business_name': business_name,
'telephone': telephone,
'business_page': business_page,
'rank': rank,
'category': category,
'website': website,
'rating': rating,
'street': street,
'locality': locality,
'region': region,
'zipcode': zipcode,
'listing_url': response.url
return scraped_results
elif response.status_code == 404:
print("Could not find a location matching", place)
# no need to retry for non existing page
print("Failed to process page")
return []
print("Failed to process page")
return []
if __name__ == "__main__":
argparser = argparse.ArgumentParser()
argparser.add_argument('keyword', help='Search Keyword')
argparser.add_argument('place', help='Place Name')
args = argparser.parse_args()
keyword = args.keyword
place =
scraped_data = parse_listing(keyword, place)
if scraped_data:
print("Writing scraped data to %s-%s-yellowpages-scraped-data.csv" % (keyword, place))
with open('%s-%s-yellowpages-scraped-data.csv' % (keyword, place), 'wb') as csvfile:
fieldnames = ['rank', 'business_name', 'telephone', 'business_page', 'category', 'website', 'rating',
'street', 'locality', 'region', 'zipcode', 'listing_url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_ALL)
for data in scraped_data:

Execute the full code by typing the script name followed by a -h in command prompt or terminal:

usage: [-h] keyword place
positional arguments:
keyword     Search Keyword
place       Place Name
optional arguments:
-h, --help  show this help message and exit

The positional arguments keyword represents a category and place is the desired location to search for a business. As an example, let’s find the business details for restaurants in Boston, MA. The script would be executed as:

python3 restaurants Boston,MA

You should see a file called restaurants-boston-yellowpages-scraped-data.csv in the same folder as the script, with the extracted data. Here is some sample data of the business details extracted from for the command above.


The data will be saved as a CSV file. You can download the code at

Let us know in the comments how this code to scrape Yellowpages worked for you.

Known Limitations

This code should be capable to scrape business details of most locations. But if you want to scrape Yellowpages on a large scale, you should read  How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping .

If you need some professional help with scraping websites contact us by filling up the form below.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data

Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.


Dayna October 19, 2018

How would I use this code to scrape ALL of the pages of results? I tried adding a for loop but it didn’t seem to work for me.


Andrew October 29, 2018

I’m looking for the same as well. How would you scrape the search where there are multiple pages of results? Thanks.


Chris January 9, 2019

Script worked beautifully. Just needs paging which should be the easiest part. Keeping you guys in mind with projects as well. Thanks!


lucas June 5, 2019

Has anyone figured this question out? How do you loop over all pages of results?


divinebitshop July 25, 2019

But will the codes above also work for other business directory websites, especially “”. And how can it be used for search with multiple pages.


    ScrapeHero July 25, 2019

    No – each site is different


Jamie October 25, 2020

What everyone is looking for, multiple page wise, I am working on right now. It’s called pagination and it’s pretty easy.

I will upload github source if ScrapeHero allows me.


    ScrapeHero October 27, 2020

    Hi Jamie,
    You can link to the source or submit it through github


    David March 3, 2022

    Would be super interested in the code if you had luck with it.


Mrs Mitch March 17, 2022

I not the greatest with coding and keep getting an error message (usage: [-h] keyword place error: the following arguments are required: place
An exception has occurred, use %tb to see the full traceback) when I try to run this script. (python3 restaurants Boston, MA) Maybe I’m not typing it in correctly. Can you help with this please and thanks!


Leave a Reply

Your email address will not be published.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?