web scraping

6 min read

Web Scraping Using RegEx

ScrapeHero
Published: September 25, 2024

How RegEx works
Data Scraped Using RegEx
Web Scraping using RegEx: The Environment
Web Scraping using RegEx: The Code
Code Limitations
Why Code Yourself? Use ScrapeHero’s Web Scraping Service

Are you wondering whether you can perform web scraping using RegEx? Yes, you can. However, using RegEx is more error-prone. Hence, dedicated parsing libraries, like BeautifulSoup, are more preferred.

But there is no harm in learning how to use RegEx for web scraping. It can solidify your RegEx and web scraping knowledge.

This article shows you how to use RegEx for web scraping without using any parser.

How RegEx works

Before using RegEx for web scraping, let’s be clear on the fundamentals.

RegEx or regular expressions work by searching for a pattern in a string. For example, suppose you want to find emails from a string. Then the pattern of the string could be

\S+@\S+\.\S+

Here,

\S is a non-whitespace character
+ tells that the previous character should repeat one or more times
@ matches the character itself
\. matches the period. The period is a special character, requiring a backslash to escape it.

In short, the above pattern searches for a string. The string has non-white characters before and after the character ‘@.’ Following them, the string also has a period and another set of non-whitespace characters.

Here is a RegEx cheat sheet you can use while using RegEx for web scraping.

Data Scraped Using RegEx

The tutorial shows web scraping using regular expressions with Python. The Python code uses RegEx to scrape eBay product data from its search results page.

Name
Price
URL

Use the browser’s inspect tool to find the HTML source code of these data points. Right-click on a data point and click ‘Inspect’.

Web Scraping using RegEx: The Environment

The code in this tutorial uses three Python packages.

The re module: This module enables you to use RegEx
The json module: This module allows you to write the extracted data to a JSON file
Python requests: This library has methods to manage HTTP requests

The re and json modules come with the Python standard library. So you don’t need to install them.

However, for web scraping with Python requests library, you must install it; you can do that using pip.

pip install requests

Web Scraping using RegEx: The Code

Import the packages mentioned above; you can do that with a single code line.

import re, requests, json

Make an HTTP request using the Python requests package to the eBay search results page; the request’s response will contain the HTML source code. You can use the get() method of Python requests to make the HTTP request.

response = requests.get("https://www.ebay.com/sch/i.html?_from=R40&amp;_trksid=p4432023.m570.l1313&amp;_nkw=smartphones&amp;_sacat=0")

Extract all the div elements containing the product details from the response text. From these div elements, you can then extract the name, URL, and price. The findall() method of the re-module can help you find the div elements.

The findall() method takes two arguments, a pattern, and a string. It checks for the pattern in the string and returns the matched values. Here, the pattern matches a string that

Starts with ‘<div class=”s-item__wrapper’
Contains ‘<span class=s-item__price’
Ends with ‘</div>’

products = re.findall(r'<div class="s-item__wrapper.+?>.+?<span class=s-item__price.+?<\/div>',response.text)

The extracted div elements will be within a list; you can iterate through this list and extract the required data points. A different RegEx pattern is required for each data point.

Extracting Name

The name will be inside a span tag with the role ‘heading’

<span role=heading aria-level=3>
             <!--F#f_0-->HP Chromebook 11 G6 11.6" Intel 2.40
             GHz 4GB RAM 16GB eMMC Bluetooth Webcam<!--F/-->
</span>

Therefore, the RegEx pattern to extract the name should

Start with `<span role=heading`
End with `<\/span>`.

name_pattern = r'<span role="heading.+?">&lt;.+?&gt;(.+?)&lt;.+?&gt;&lt;\/span&gt;'
</span>

Extracting Price

The price will be inside a span element with the class ‘s-item__price’

<span
   class=s-item__price>
  <!--F#f_0--><!--F#f_0-->$79.99<!--F/--><!--F/-->
</span>

Therefore, the RegEx pattern to extract the name should

Start with `<span class=s-item__price>`
End with `<\/span>`

price_pattern = r'<span class=s-item__price><.+?><.+?>(.+?)<.+?><.+?><\/span>'

Extracting URL

The URL will be inside an anchor tag as the href attribute.

<a data-interactions='[{"actionKind":"NAVSRC","interaction":"wwFVrK2vRE0lhQQ0MDFKNDlCQ0VOSzQzR001MTNZUEFBWUVYOVY0MDFKNDlCQ0VKWVBDVlhCNDBIUVJGNjFRU0MAAAg3NDAwDE5BVlNSQwA="}]'
         target=_blank
         data-s-03wf764='{"eventFamily":"LST","eventAction":"ACTN","actionKind":"NAVSRC","actionKinds":["NAVSRC"],"operationId":"2351460","flushImmediately":false,"eventProperty":{"$l":"62869668414928"}}'
         _sp=p2351460.m1686.l7400 class=s-item__link
         href=https://www.ebay.com/itm/115603286837?hash=item1aea7e2b35:g:nV0AAOSw~wdjfE2o&itmprp=enc%3AAQAJAAAA4AsnK4hoYAelsNVt8vNwmOQEIEKRSBTpVOI4Fzbr7cY0wK4o3g%2BrtWEJhLd2tsPKiwUrnIGyzEdYgBJOcmzcLc1%2FC4tkCFZqpT5nMaUS3UHfDgnh%2FeiHBaoBh%2BjUmuHYeZzx45Agc8Zvj897LpZpEWGXKSH%2BHigaqb%2BZETNr3mFR9d7CbpBPZ%2BxtTxJxa6HdFw%2BaFHzqxi3xQ3hBXOP9NuoOZo631pXyyFqCMy4eTMG1UcJrvFr5eAspGuquv8tgpImBsQ2ndtFEiB6zKuMfsFvQQI3hdTAvWd926KbKYgqF%7Ctkp%3ABFBM1Omxq6Jk>

Therefore, the RegEx pattern to extract the name should

Start with `href=`
End with a space.

url_pattern = r'href=(https:.+?) .+?&gt;'

Note: The above patterns are specific to the eBay search results page. Analyze the HTML source code to determine the appropriate RegEx patterns in each project.

You can use the above pattern to extract data from each div element. Iterate through the extracted div elements, and in each iteration

1. Extract name, price, and URL

name = re.search(name_pattern,product).group(1)
price = re.search(price_pattern,product).group(1)
url = re.search(url_pattern,product).group(1)

2. Store them in a dict and append it to an array. Here, the patterns also match strings that are not required. So use a conditional statement while appending; specifically, it should not append the values if the name contains ‘Shop on eBay’ or the character ‘<’

nameAndUrl.append(
    {
        "Name":name,
        "Price":price,
        "URL":url
    }
) if name !='Shop on eBay'and '&lt;' not in name else None

Finally, you can save the array as a JSON file using the json module. To do so, use json.dump().

with open("regEx.json","w",encoding="utf-8") as f:
    json.dump(nameAndUrl,f,indent=4,ensure_ascii=False)

Code Limitations

The code shown in this tutorial is only efficient if the code is well-structured. For complex, highly nested HTML source code, web scraping using RegEx can become slow.

Moreover, a slight change in the HTML code can break the code. For example, a change in spacing or the order of attributes may render the code unusable even if the attributes and the tag names of the data points remain unchanged.

The code does not bypass anti-scraping measures. Hence, it is not appropriate for large-scale web scraping, as the massive number of requests makes your scraper more susceptible to these measures.

Why Code Yourself? Use ScrapeHero’s Web Scraping Service

The code can scrape three data points from an eBay search results page, showing web scraping with RegEx in Python.

However, maintaining a RegEx code can be challenging as slight changes can break it. Moreover, trying to scrape additional data points requires complex RegEx that can slow down the process.

Therefore, it is better to use a professional web scraping service, like ScrapeHero, for large-scale projects where scalability is important.

ScrapeHero’s web scraping service can build enterprise-grade web scrapers and crawlers according to your specifications. This way, you can focus on using the data to derive insights rather than gathering it. Contact ScrapeHero now for high-quality data.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data

Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help