A guide to web scraping using BeautifulSoup

Web scraping is a technique used to extract data from websites automatically. It helps in gathering data that helps in machine learning, data analysis, and artificial intelligence use cases. Data extracted from web pages in its raw HTML form may not be immediately usable, so it is essential to convert it into a structured format first. BeautifulSoup is a Python library that helps you parse web pages and extract information from them. It enables you to parse HTML and XML documents, making data extraction easy and efficient. BeautifulSoup web scraping is a popular choice in the Python community. In this article, we’ll dive deeper into using BeautifulSoup for web scraping.

BeautifulSoup installation

pip install beautifulsoup4

How to parse and extract data from an HTML page using BeautifulSoup?

We will be using the following HTML throughout this article. It’s a simple listing page with pokemon as products.

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Demo Page</title>
</head>
<body>
    <div class="product-container">
    <h3>Products</h3>
    <div class="product" style="border: solid; margin:1%">
        <p class="product-name"> <b> Name: </b> <span> Abra </span></p>
        <p class="price"> <b> Price: </b> <span> 100.00 </span> </p>
        <a href="/pokemon/abra">Buy</a>
    </div>
    <div class="product" style="border: solid; margin:1%">
        <p class="product-name"> <b> Name: </b> <span> Absol </span></p>
        <p class="price"> <b> Price: </b> <span> 80.00 </span> </p>
        <a href="/pokemon/absol">Buy</a>
    </div>
    <div class="product" style="border: solid; margin:1%">
        <p class="product-name"> <b> Name: </b> <span> Altaria </span></p>
        <p class="price"> <b> Price: </b> <span> 120.00 </span> </p>
        <a href="/pokemon/altaria">Buy</a>
    </div>
    <div class="product" style="border: solid; margin:1%">
        <p class="product-name"> <b> Name: </b> <span> Arctozolt </span></p>
        <p class="price"> <b> Price: </b> <span> 110.00 </span> </p>
        <a href="/pokemon/arctozolt">Buy</a>
    </div>
    <div class="product" style="border: solid; margin:1%">
        <p class="product-name"> <b> Name: </b> <span> Barbaracle </span></p>
        <p class="price"> <b> Price: </b> <span> 100.00 </span> </p>
        <a href="/pokemon/barbaracle">Buy</a>
    </div>
</div>
</body>
</html>

Parsing HTML content

Let’s import the BeautifulSoup library.

from bs4 import BeautifulSoup

Now, let’s parse the HTML code (mentioned above). You can use the code lines mentioned below to parse the HTML page. BeautifulSoup will not parse the HTML content by itself. It uses other libraries such as lxml, html.parser, and html5lib to do this.

sample_html = '<HTML Text Here>'
soup = BeautifulSoup(sample_html, 'html.parser')

The first parameter used while calling the BeautifulSoup constructor is the sample HTML, and the second one is the features. The “features” argument in BeautifulSoup lets you choose the type of parser to use (e.g. “lxml”, “html.parser”) or the type of document (e.g. “html”, “xml”). It is best to specify a particular parser for consistent results across platforms and virtual environments.

Accessing tags from HTML

Now that we have got the HTML parsed, let’s dive into accessing the tags and data within them.

You can access individual tags as attributes of the soup object. For instance, check out this code to get the h3 tag:

soup.h3
Output: <h3>Products</h3>

Try this code line to extract the text within a tag:

soup.h3.string
Output: 'Products'

To access child tags, use the dot operator on the parent tag. For instance, let’s access the h3 tag inside the root div tag. Here’s an example:

soup.div.h3
Output: <h3>Products</h3>

Accessing the nth tag from HTML

It is tough to access tags from a tree that has a lot of siblings. For example, to access all the <div> tags from the above HTML, you can use the find_all method of BeautifulSoup.

div_tags = soup.div.find_all('div')
Output: [<div class="product">
 <p class="product-name"> <b> Name: </b> <span> Abra </span></p>
 <p class="price"> <b> Price: </b> <span> 100.00 </span> </p>
 <a href="/pokemon/abra">Buy</a>
 </div>,
...]

The find_all method accepts a tag name and returns all matching tags as a python list. We can access the child tag by specifying the list index.

div_tags[0]
Output: <div class="product">
<p class="product-name"> <b> Name: </b> <span> Abra </span></p>
<p class="price"> <b> Price: </b> <span> 100.00 </span> </p>
<a href="/pokemon/abra">Buy</a>
</div>

The tags themselves are BeautifulSoup nodes and support all the methods. To access the name of the first product, you just need to specify which child tag has the data. You can find the code snippet below:

div_tags[0].p.span.string
Output: ' Abra '

Now, let’s see how to print the name of all products. You can do this using a for loop. Take a look at the example below:

for div_tag in div_tags:
     print(div_tag.p.span.string)
Output: Abra
Absol
Altaria
Arctozolt
Barbaracle

Filter tags by attributes

HTML tags often have attributes such as class, id, name, etc. which can be utilized to filter tags. As an example, consider the following code that filters the <p> tag with a class attribute value ‘price’:

  • Filter tags by class attribute
    soup.find_all('p', attrs={'class': 'price'})

    Or it can be simplified like this:

    soup.find_all('p', class_='price')
  • Filter tags by id attribute
    soup.find_all('p', attrs={'id': 'id of the tag'})

    Or

    soup.find_all('p', id='id of the tag')
  • Filter elements by attribute without tag name
    soup.find_all(id='id of the tag')
  • Filter elements by non-standard attributes
    soup.find_all('div', attrs={'data-class': 'value to search')
  • Filter elements by multiple attributes
    soup.find_all('div', attrs={'class': 'class to search', 'id': 'id to search'})

How to extract the text inside tags

BeautifulSoup offers multiple methods to access text inside a tag. Let’s check out a few examples to print texts using different approaches.

First, filter the price tags from the HTML

price_tags = soup.find_all('p', class_='price')

Using text attribute

for price_tag in price_tags:
    print(price_tag.span.text)
Output: 100.00
80.00
120.00
110.00
100.00

Using the get_text() method

for price_tag in price_tags:
    print(price_tag.span.get_text())
Output: 100.00
80.00
120.00
110.00
100.00

Using the string attribute

You can also use the string attribute of a node to extract text if the node doesn’t have any child nodes. However, if the node contains other nodes as children, the string attribute will return None.

for price_tag in price_tags:
    print(price_tag.span.string)
Output: 100.00
80.00
120.00
110.00
100.00

Extract URLs from anchor tags

First, let’s find one anchor tag.

anchor_tag = soup.find('a')

Note: find method is very similar to find_all method. However, find returns only the first match, whereas find_all returns all the matches.

We can extract the URL using the code line mentioned below.

anchor_tag['href']
Output: '/pokemon/abra'

Similarly, you can access all the attributes. Also, we can extract all tag attributes to a dict using the below code:

anchor_tag.attrs
Output: {'href': '/pokemon/abra'}

Getting HTML of tags

In certain cases, we may need the raw HTML of a particular tag. For that, we can use the python str() function. See the example below:

anchor_tag = soup.find('a')
str(anchor_tag)
Output: '<a href="/pokemon/abra">Buy</a>'

BeautifulSoup offers a method called prettify() which will return the formatted HTML. See the example code line below:

print(anchor_tag.prettify())
Output:<a href="/pokemon/abra">
 Buy
</a>

Using CSS Selector with BeautifulSoup

BeautifulSoup allows you to use CSS selectors by using the select method on a soup object. Let’s see the example below:

All the <p> tags with the name using CSS selectors can be filtered as follows:

soup.select('div p.product-name span')
Output:[<span> Abra </span>,
 <span> Absol </span>,
 <span> Altaria </span>,
 <span> Arctozolt </span>,
 <span> Barbaracle </span>]

The select() returns a list of matching elements, which we can iterate on and extract the string. If you need only the first matching result, you can use select_one method.

soup.select_one('div p.product-name span')

Output: <span> Abra </span>

We can also use XPath for filtering the tags. But unfortunately, BeautifulSoup does not support XPath. You may have to rely on CSS selectors. If you need XPath support, you can look into the python lxml library. For installation and to get started with lxml, go to their official documentation.

How to choose the best parser for BeautifulSoup

As mentioned earlier, BeatifulSoup supports various parsing libraries. So how should we select one among them?

The following table summarizes the advantages and disadvantages of each parser.

Parser Usage Advantages Disadvantages
Python’s html parser BeautifulSoup(markup, “html.parser”)
  • Standard library
  • Decent speed
  • Lenient
  • Not as fast as lxml, less lenient than html5lib.
lxml’s HTML parser BeautifulSoup(markup, “lxml”)
  • Very fast
  • Lenient
  • External C dependency
lxml’s XML parser BeautifulSoup(markup, “lxml-xml”) BeautifulSoup(markup, “xml”)
  • Very fast
  • The only currently supported XML parser
  • External C dependency
html5lib BeautifulSoup(markup, “html5lib”)
  • Extremely lenient
  • Parses pages the same way a web browser does
  • Creates valid HTML5
  • Very slow
  • External Python dependency

If you want faster parsing, go with lxml’s HTML parser. This requires the lxml library to be installed. You can install lxml by running the following command.

pip install lxml

Best practices while parsing in BeautifulSoup

Increase the parsing speed

You should note that BeautifulSoup will never be as fast as the libraries that it uses to parse the HTML. If you work in an environment where computer time is more valuable, you should consider directly working with lxml. That being said, the following are the ways you can improve the performance of BeautifulSoup.

  1. If you are not using lxml as the parser, start using it. With lxml, BeautifulSoup will parse the HTML significantly faster than other parsing libraries. Check out this installation and usage guide.
  2. Use cchardet library for faster encoding detection.

Reduce memory usage

To reduce memory consumption while parsing using BeautifulSoup, you can parse only a part of the document. This will reduce memory consumption and make the search process faster. You can find more information on the official BeautifulSoup documentation.

Posted in:   Web Scraping Tutorials

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?