Scraping Dynamic Web Pages with Python

Have you noticed that sometimes Python’s requests library gets you a different source code from the one you saw on a website? That is because some websites generate their HTML code via JavaScript at runtime. These are dynamic websites. In this tutorial, you will learn about web scraping dynamic web pages with Python.

What are Dynamic Websites?

Dynamic websites are those that generate HTML code during run-time, giving these sites a lot of flexibility. They can vary content based on visitor’s location, time, actions, etc.

E-commerce websites are examples of dynamic websites. They display information on their homepage based on various metrics, including supply and demand.

Advantages of dynamic websites include

Personalized content display: The site can display information based on who logged in. E-commerce websites do this; they recommend products based on the user’s previous purchases.
Quick site updates: As the site uses code to generate HTML, you only have to change the code to make site-wide changes.

Disadvantages of dynamic websites:

Complex: You need to have advanced technical knowledge to build a dynamic website. For example, Angular, a popular framework for building dynamic websites, has a steep learning curve.
Slow: Dynamic sites may be slower than static sites as they rely on processing information before displaying the HTML code.

Types of Dynamic Websites

Dynamic websites may use client-side and server-side scripting.

Client-side scripts render JavaScript on the client’s browser to generate the HTML code. Your actions will trigger the JavaScript and update the HTML.
The server-side scripting processes the information on the server and sends the HTML page back to the client. Besides JavaScript, the server may use languages such as PHP or Python.

Problems When Using Python Requests

This tutorial uses “https://angular.dev/docs” as the target for web scraping with Python. You can see a lot of content when you visit the website in the browser. However, if you use requests, you will receive a response with less content.

You can see why this happens if you go to the target page without enabling JavaScript.

Target webpage with JavaScript disabled

Here is the same web page with JavaScript enabled

Target webpage with JavaScript enabled.

Thus, when you use requests to get the target webpage, it fetches the source code from the first scenario.

Install the packages required for Scraping Dynamic Websites

This tutorial uses two external packages for web scraping.

BeautifulSoup
Selenium

You can install them using Python pip.

pip install selenium beautifulsoup

Scraping Dynamic Web Pages with Python

You will use Selenium in this tutorial to render the dynamic web page. Selenium is a browser automation package available in JavaScript (Node.js), Python, Ruby, Java, Kotlin, and C#. However, this tutorial will show Python code to extract data using Selenium.

You then have two methods to get data from a dynamic website:

Get the page sources using Selenium and then extract data using BeautifulSoup
Extract data directly using Selenium methods.

Here is a flowchart showing how to scrape a dynamic website with Python.

Basic flow of scraping dynamic web pages with Python

BeautifulSoup is more intuitive than using Selenium to extract data. However, you will see both methods in this tutorial.

The first step is to import the necessary modules for scraping dynamic web pages with Python.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import pandas as pd

Next, start a browser instance with Selenium; this will open up a Chrome browser.

browser = webdriver.Chrome()

After that, go to the target site, “https://angular.dev/docs,” using the get() method.’

page = browser.get("https://angular.dev/docs")

You can then find any element using the find_elements() method. Here, the code finds all the elements with the class “docs-card,” which are anchor tags.

Screenshot showing the “docs-card”

browser.implicitly_wait(10)
docs = browser.find_elements(By.CLASS_NAME,"docs-card")

The implicitly_wait() method in the above code tells the browser to wait for certain seconds to load the page.

Each anchor tag with the class “docs-card” has several child elements. You can get these child elements by using the find_element() method on the parent element.

To get the href attribute, you can use the get_attribute() method.

file = []
for d in docs:
      file.append([d.find_element(By.TAG_NAME,"h3").text,d.get_attribute("href")])

The above code

declares an empty list
extracts the h3 tag and the link from each element in docs
appends them to the list

Finally, you can save the file variable to a CSV file using Pandas.

df = pd.DataFrame(file,columns=["Docs","Links"])
df.to_csv("angular.csv",index=False)

Screenshot showing the scraped data

Element Extraction With BeautifulSoup

You can also use BeautifulSoup for scraping dynamic web pages with Python. However, you must get the page source using the page_source attribute.

source = browser.page_source
soup = BeautifulSoup(source,"html.parser")
docs = soup.select('a.docs-card')
newFile = []
for d in docs:
     if 'https' in d['href']:
             newFile.append([d.h3.string,d['href']])
     else:
            newFile.append([d.h3.string,"https://angular.dev/"+d['href']])

The above code snippet

Gets the page source
Passes the source to BeautifulSoup()
Selects the anchor tag having the “docs-card” class
Declares a variable newFile
Appends newFile with the h3 heading and the corresponding link

Note: Some href values only have relative URLs. Therefore, you must check for “https”; if it is absent, add the base URL as a prefix.

You can then write newFIle to a CSV file as before.

Also Read: How to Scrape Lazy Loading Pages

How ScrapeHero Can Help

ScrapeHero is an enterprise-grade web scraping service provider. We have an excellent web crawling infrastructure capable of scraping millions of pages daily. And we can handle web scraping dynamic websites because of our massive browser farms.

How browser farms help dynamic web scraping

Dynamic web scraping requires full-fledged browsers. You need them to visit websites, render JavaScript, and extract data. However, performing the operation with a single browser constrains your productivity. Here, browser farms, an infrastructure capable of operating many browsers simultaneously, can be of immense help.

With browser farms, you can run several simultaneous instances of Selenium and scrape dynamic websites faster.

However, enterprise-grade browser farms demand enormous initial investment. You don’t need to worry about all that or learn how to scrape a dynamic website if you choose ScrapeHero. We will take care of everything. Our infrastructure can help you gather high-quality data for purposes such as product monitoring, brand monitoring, etc.

Also Read: Web Scraping oddsportal.com

Wrapping Up

You can use Python for scraping dynamic web pages. Automated browsers like Selenium can help you do that. You can either use Selenium directly for extracting data or use BeautifulSoup to parse the source code obtained using Selenium.

For large-scale web scraping, you can use browser farms, but these require you to bear high initial costs. ScrapeHero services can help you with that. Our web scraping services include custom, large-scale web scraping and crawling solutions. You only need to pay according to your requirements.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data

Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Posted in: Featured, Tutorials, Web Scraping Tutorials

Published On: February 15, 2024

How to Scrape a Dynamic Website

What are Dynamic Websites?

Types of Dynamic Websites

Problems When Using Python Requests

Install the packages required for Scraping Dynamic Websites

Scraping Dynamic Web Pages with Python

Element Extraction With BeautifulSoup

How ScrapeHero Can Help

How browser farms help dynamic web scraping

Wrapping Up

We can help with your data or automation needs

Continue Reading ..

Turn the Internet into meaningful, structured and usable data