Learn how to scrape Zillow using Python. Scrape Zillow for title, address, price, real estate provider and more using Zillow scraper.
In this web scraping tutorial, we will show you how to scrape yellow pages and extract business details based on a city and category.
For this scraper, we will search Yellowpages.com for restaurants in a city. Then scrape business details from the first page of results.
What data are we extracting?
Here is the list of data fields we will be extracting:
- Rank
- Business Name
- Phone Number
- Business Page
- Category
- Website
- Rating
- Street Name
- Locality
- Region
- Zip code
- URL
Below is a screenshot of the data we will be extracting from YellowPages.
Read More – Learn to scrape Yelp business data
Finding the Data
First, we need to find the data that is present in the web page’s HTML Tags before we start building the Yellowpages scraper. You’ll need to understand the HTML tags in the web pages content to do so.
If you already understand HTML and Python this will be simple for you. You don’t need advanced programming skills for the most part of this tutorial,
If you don’t know much about HTML and Python, spend some time reading Getting started with HTML – Mozilla Developer Network and https://www.programiz.com/python-programming
Let’s inspect the HTML of the web page and find out where the data is located. Here is what we’re going to do:
- Find the HTML tag that encloses the list of links from where we need the data from
- Get the links from it and extract the data
Inspecting the HTML
Why should we inspect the element? – To find any element on the web page using XML path expression.
Open any browser (we are using Chrome here) and go to https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Boston
Right-click on any link on the page and choose – Inspect Element. The browser will open a toolbar and show the HTML Content of the Web Page in a well-structured format.
The GIF above shows the data we need to extract in the DIV tag. If you look closely, it has an attribute called ‘class’ named as ‘result’. This DIV contains the data fields we need to extract.
Now let’s find the HTML tag(s) which has the links we need to extract. You can right-click on the link title in the browser and do Inspect Element again. It will open the HTML Content like before, and highlight the tag which holds the data you right-clicked on. In the GIF below you can see the data fields structured in order.
How to set up your computer to scrape Yellow Pages
We will use Python 3 for this Yellow Pages scraping tutorial. The code will not run if you are using Python 2.7. To start, your system needs Python 3 and PIP installed in it.
Most UNIX operating systems like Linux and Mac OS comes with Python pre-installed. But, not all the Linux Operating Systems ship with Python 3 by default.
To check your python version. Open a terminal (in Linux and Mac OS) or Command Prompt (on Windows) and type:
python --version
Then press the Enter key. If the output looks something like Python 3.x.x, you have Python 3 installed. Likewise if its Python 2.x.x, you have Python 2. But, if it prints an error, you don’t probably have python installed.
Install Python 3 and Pip
Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/
Mac Users can follow this guide here – http://docs.python-guide.org/en/latest/starting/install3/osx/
Likewise, Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/
Install Packages
- Python Requests, to make requests and download the HTML content of the pages ( http://docs.python-requests.org/en/master/user/install/).
- Python LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here – http://lxml.de/installation.html)
The Code
#!/usr/bin/env python # -*- coding: utf-8 -*- import requests from lxml import html import unicodecsv as csv import argparse def parse_listing(keyword, place): """ Function to process yellowpage listing page : param keyword: search query : param place : place name """ url = "https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}".format(keyword, place) print("retrieving ", url) headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Host': 'www.yellowpages.com', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36' } # Adding retries for retry in range(10): try: response = requests.get(url, verify=False, headers=headers) print("parsing page") if response.status_code == 200: parser = html.fromstring(response.text) # making links absolute base_url = "https://www.yellowpages.com" parser.make_links_absolute(base_url) XPATH_LISTINGS = "//div[@class='search-results organic']//div[@class='v-card']" listings = parser.xpath(XPATH_LISTINGS) scraped_results = [] for results in listings: XPATH_BUSINESS_NAME = ".//a[@class='business-name']//text()" XPATH_BUSSINESS_PAGE = ".//a[@class='business-name']//@href" XPATH_TELEPHONE = ".//div[@class='phones phone primary']//text()" XPATH_ADDRESS = ".//div[@class='info']//div//p[@itemprop='address']" XPATH_STREET = ".//div[@class='street-address']//text()" XPATH_LOCALITY = ".//div[@class='locality']//text()" XPATH_REGION = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='addressRegion']//text()" XPATH_ZIP_CODE = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='postalCode']//text()" XPATH_RANK = ".//div[@class='info']//h2[@class='n']/text()" XPATH_CATEGORIES = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='categories']//text()" XPATH_WEBSITE = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='links']//a[contains(@class,'website')]/@href" XPATH_RATING = ".//div[@class='info']//div[contains(@class,'info-section')]//div[contains(@class,'result-rating')]//span//text()" raw_business_name = results.xpath(XPATH_BUSINESS_NAME) raw_business_telephone = results.xpath(XPATH_TELEPHONE) raw_business_page = results.xpath(XPATH_BUSSINESS_PAGE) raw_categories = results.xpath(XPATH_CATEGORIES) raw_website = results.xpath(XPATH_WEBSITE) raw_rating = results.xpath(XPATH_RATING) # address = results.xpath(XPATH_ADDRESS) raw_street = results.xpath(XPATH_STREET) raw_locality = results.xpath(XPATH_LOCALITY) raw_region = results.xpath(XPATH_REGION) raw_zip_code = results.xpath(XPATH_ZIP_CODE) raw_rank = results.xpath(XPATH_RANK) business_name = ''.join(raw_business_name).strip() if raw_business_name else None telephone = ''.join(raw_business_telephone).strip() if raw_business_telephone else None business_page = ''.join(raw_business_page).strip() if raw_business_page else None rank = ''.join(raw_rank).replace('.\xa0', '') if raw_rank else None category = ','.join(raw_categories).strip() if raw_categories else None website = ''.join(raw_website).strip() if raw_website else None rating = ''.join(raw_rating).replace("(", "").replace(")", "").strip() if raw_rating else None street = ''.join(raw_street).strip() if raw_street else None locality = ''.join(raw_locality).replace(',\xa0', '').strip() if raw_locality else None locality, locality_parts = locality.split(',') _, region, zipcode = locality_parts.split(' ') business_details = { 'business_name': business_name, 'telephone': telephone, 'business_page': business_page, 'rank': rank, 'category': category, 'website': website, 'rating': rating, 'street': street, 'locality': locality, 'region': region, 'zipcode': zipcode, 'listing_url': response.url } scraped_results.append(business_details) return scraped_results elif response.status_code == 404: print("Could not find a location matching", place) # no need to retry for non existing page break else: print("Failed to process page") return [] except: print("Failed to process page") return [] if __name__ == "__main__": argparser = argparse.ArgumentParser() argparser.add_argument('keyword', help='Search Keyword') argparser.add_argument('place', help='Place Name') args = argparser.parse_args() keyword = args.keyword place = args.place scraped_data = parse_listing(keyword, place) if scraped_data: print("Writing scraped data to %s-%s-yellowpages-scraped-data.csv" % (keyword, place)) with open('%s-%s-yellowpages-scraped-data.csv' % (keyword, place), 'wb') as csvfile: fieldnames = ['rank', 'business_name', 'telephone', 'business_page', 'category', 'website', 'rating', 'street', 'locality', 'region', 'zipcode', 'listing_url'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_ALL) writer.writeheader() for data in scraped_data: writer.writerow(data)
Execute the full code by typing the script name followed by a -h in command prompt or terminal:
usage: yellow_pages.py [-h] keyword place positional arguments: keyword Search Keyword place Place Name optional arguments: -h, --help show this help message and exit
The positional arguments keyword represents a category and place is the desired location to search for a business. As an example, let’s find the business details for restaurants in Boston, MA. The script would be executed as:
python3 yellow_pages.py restaurants Boston,MA
You should see a file called restaurants-boston-yellowpages-scraped-data.csv in the same folder as the script, with the extracted data. Here is some sample data of the business details extracted from YellowPages.com for the command above.
The data will be saved as a CSV file. You can download the code at https://github.com/scrapehero/yellowpages-scraper
Let us know in the comments how this code to scrape Yellowpages worked for you.
Known Limitations
This code should be capable to scrape business details of most locations. But if you want to scrape Yellowpages on a large scale, you should read How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping .
If you need some professional help with scraping websites contact us by filling up the form below.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data
Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.
Responses
How would I use this code to scrape ALL of the pages of results? I tried adding a for loop but it didn’t seem to work for me.
I’m looking for the same as well. How would you scrape the search where there are multiple pages of results? Thanks.
Script worked beautifully. Just needs paging which should be the easiest part. Keeping you guys in mind with projects as well. Thanks!
Has anyone figured this question out? How do you loop over all pages of results?
Thanks!
But will the codes above also work for other business directory websites, especially “https://www.businesslist.com.ng”. And how can it be used for search with multiple pages.
No – each site is different
What everyone is looking for, multiple page wise, I am working on right now. It’s called pagination and it’s pretty easy.
I will upload github source if ScrapeHero allows me.
Hi Jamie,
You can link to the source or submit it through github
Would be super interested in the code if you had luck with it.
I not the greatest with coding and keep getting an error message (usage: ipykernel_launcher.py [-h] keyword place
ipykernel_launcher.py: error: the following arguments are required: place
An exception has occurred, use %tb to see the full traceback) when I try to run this script. (python3 yellow_pages.py restaurants Boston, MA) Maybe I’m not typing it in correctly. Can you help with this please and thanks!