Python’s requests library is enough for scraping static sites. But what if the website is dynamic? You then need to execute JavaScript while scraping, which is where Selenium web scraping shines.
This article shows you how to get started with web scraping using Selenium.
Selenium Web Scraping: The Environment
You can install Selenium with a single pip command.
pip install selenium
This tutorial solely uses Selenium’s data extraction methods. However, you can also get the dynamically generated HTML source code and then extract data using BeautifulSoup;
Selenium Web Scraping: The Code
The code in this tutorial illustrates Selenium web scraping by extracting details of professional certifications from Coursera.org.
It will
- Search for ‘Data Analyst’
- Select the filters ‘Data Science’ and ‘Professional Certificates’
- Extract the details of the loaded certifications
The code starts with importing the required packages. Import three modules from the Selenium library
- webdriver: For controlling the browser
- By: For specifying the locator method while finding elements
- Keys: For sending keyboard inputs
Other packages to import are
- sleep: For pausing the script execution to ensure all the HTML elements are loaded.
- json: For writing extracted data to a JSON file.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
import json
After importing the packages, the next step is to define functions to perform four operations:
- run_selenium(): Navigates Coursera.org to reach the desired page.Â
- extract_details(): Extracts the certification details.Â
- clean_details(): Cleans the details.
- save_details(): Saves the extracted details.
Let’s learn about them in detail.
run_selenium()
The function is responsible for controlling the browser and begins by launching the Selenium browser.Â
Launch a Chrome browser with the Chrome() method of the webdriver module. Although this tutorial uses a Chrome webdriver, you can use others, like Firefox, Chromium, etc.
browser = webdriver.Chrome()
To visit a webpage, use the get() method. This method takes a URL as an argument.Â
browser.get("https://coursera.org")
Coursera’s website structure changes with the window size. Therefore, to keep a standard, maximize the window of the Selenium web browser.
browser.maximize_window()
Now, you can search the query on the input bar. So, you need to find the input bar using Selenium’s find_element method. This method can locate the element using several locators; here, locate the element by its tag name, ‘input.’
inputbar = browser.find_element(By.TAG_NAME,'input')
Other locators, inlude XPath, class name, and id.
After locating the input bar, the code is now ready to interact with it. Use the send_keys() method to send keystrokes to the browser.Â
First, send a string ‘Data Analyst’ to the input bar.
inputbar.send_keys('Data Analyst')
Next, simulate the ‘ENTER’ key using the Keys class, which will take the Selenium browser to the page containing the details of available courses.
inputbar.send_keys(Keys.ENTER)
The current page will have details of all types of products, but you only need information about professional certifications. That means you need to add some filters, which means more browser interactions.
However, Coursera.org has two kinds of search results page, each with different methods for adding filters. Thankfully, you can differentiate between them by checking if the page has an element with the class ‘filterContent.’Â
The code makes use of the Selenium webdriver’s behavior of returning an error when it can’t find an element.Â
Therefore, use a try and except block. Try finding the element with the class filterContent and add filters using one method.Â
If the try block fails, the exception will use the other method to add filters.
Try Block
Use Selenium’s find_element() method to find the element with the class filterContent. If the page does not have this element, this code will raise an exception and start executing the except block.
browser.find_element(By.CLASS_NAME,'filterContent')
If the above code snippet did not raise an exception, the code will continue executing the try block and adding filters.
Search for all the filters and click on the ones you need. Here, you need to click on ‘Professional Certificates’ and ‘Data Science’ filters.
filters = browser.find_elements(By.CLASS_NAME,"css-133xgeo")
To search the required filters, loop through all of the extracted filters and use an if statement. As you don’t need to execute the loop once you get all the filters, define a variable loop that tracks the number of filters you have.
loop = 0
Then, start a loop:
1. Start a loop and check for the substring ‘Professional Certificates’ and ‘Data Science’ exist in the filter text.
if "Professional Certificates" in filter.text or "Data Science" in filter.text:
2. Locate an input tag if the above condition is true. Selenium has a click() method for this.
filter.find_element(By.TAG_NAME,'input').click()
3. Increment the loop variable
loop = loop+1
4. Break the loop when the loop variable’s value becomes 2.
if loop == 2:
break
Except Block
If the code cannot find a variable with the class filteredContent, it executes this block. The except block adds filters by clicking on the buttons, opening the drop menus, and choosing a filter.
Here are the steps to add a filter on this page:
1. Extract all the button elements with the class ‘css-1vwfrco’ and loop through them to find and click the one with the text ‘All Filters.’
for button in buttons:
if 'All Filters' in button.text:
button.click()
2. Extract all the filters
filters = browser.find_elements(By.CLASS_NAME,"css-1g4pf2i")
3. Start a sub-loop to iterate through the filters and click on the ones with the text ‘Data Science’ and ‘Professional Certificates.’ This step is similar to the code in the try block.
loop = 0
for filter in filters:
if "Professional Certificates" in filter.text or "Data Science" in filter.text:
filter.find_element(By.TAG_NAME,'input').click()
sleep(3)
loop = loop+1
if loop == 2:
break
4. Search and click on the button containing the text ‘Show results.’
browser.find_element(By.XPATH,"//button/span[contains(text(),'Show results')]").click()
You can also scroll using Selenium with its execute_script() method. Here, the code uses this method to scroll 10 times to load all the available certifications. To scroll 10 times, use a loop in range(10):
for i in range(10):
browser.execute_script("window.scrollBy(0, 500);")
sleep(3)
Finally, you can extract all the elements holding the certificate details using the find_elements() method; the function returns this list of elements.
certificates = browser.find_elements(By.CLASS_NAME,"css-16m4c33")
return certificates
extract_details()
The function accepts a list of HTML elements containing certificate details, loops through them, and extracts details from them. It extracts the details and stores them as an array of dicts, which you need to define before starting the loop.
proCerts = []
Now, loop through the list of elements holding certification details:
1. Get all the text from the current element and build an array using it. To do so, split the text at every new line.
details = certificate.text.split('\n')
2. Get the URL from the anchor tag inside the element. You can use Selenium’s get_attribute() method to extract the ‘href’ attribute and get the URL.
url = certificate.find_element(By.TAG_NAME,'a').get_attribute('href')
3. Create an empty dict. This dict will store the details of the current certification.
detailDict = {}
4. Call clean_details() mentioned above to clean the details. It accepts the created empty dict and the extracted text as arguments.
clean_details(detailDict,details)
5. Add the URL to the dict.
detailDict['URL']=url
6. Append the extracted dict to the empty array defined earlier.
proCerts.append(detailDict)
Finally, extract_details() will return the array.Â
clean_details()
The text extracted from the element holding the details of the professional certifications may contain unnecessary strings or characters. This function removes them and gives the extracted text a consistent structure.Â
One of the arguments of clean_details() is an array containing extracted text. Each item of this array corresponds to one line:
['IBM', 'IBM Data Analyst', "Skills you'll gain: Python Programming, Microsoft Excel, Data Visualization, Spreadsheet Software, Data Analysis, Plot (Graphics), Exploratory Data Analysis, Business Analysis, Communication, Statistical Visualization, Business Communication, Data Management, Data Structures, Databases, Human Resources, Planning, SQL, Big Data, Data Mining, Data Science, General Statistics, NoSQL, Cloud Computing, Computer Programming, Data Visualization Software, Interactive Data Visualization, Machine Learning, Machine Learning Algorithms, Probability & Statistics, Regression", 'Make progress toward a degree', '4.6', '4.6 stars', '(82K reviews)', 'Beginner · Professional Certificate · 3 - 6 Months']
The code loops through the lines, extracts and processes the required lines, and stores them in a dict with keys corresponding to the information in the line. For instance, if the line tells you the course name, the key would be ‘course.’
Here is the code that cleans the details:
fluff = ['Status','New','stars','Make progress','Months','Skills']
keys = ['Company','Certificate','Skills','Rating','Review Count','Other Details']
i = 0
print(details)
for detail in details:
if not any(s in detail for s in fluff):
try:
detailDict[keys[i]] = detail.replace('(','').replace(')','')
i=i+1
except:
continue
if 'Months' in detail:
metas = ['Level','Type','Duration']
meta = detail.split('·')
meta_dict = {}
n = 0
for m in meta:
meta_dict[metas[n]] = m
n=n+1
try:
detailDict[keys[i]] = meta_dict
except:
break
i=i+1
if 'Skills' in detail:
keyValue = detail.split(":")
detailDict[keyValue[0]] = keyValue[1]
i=i+1
save_details()
This function takes the extracted and cleaned details and saves them into a JSON file.
def save_details(details):
with open('selenium.json','w',encoding='utf-8') as f:
json.dump(details,f,indent=4,ensure_ascii = False)
Finally, you can call the functions in order to execute the script.
if __name__ == "__main__":
certificates = run_selenium()
certificate_details = extract_details(certificates)
save_details(certificate_details)
The extracted details would look like this.
{
"Company": "IBM",
"Certificate": "IBM Data Analyst",
"Skills you'll gain": " Python Programming, Microsoft Excel, Data Visualization, Spreadsheet Software, Data Analysis, Plot (Graphics), Exploratory Data Analysis, Business Analysis, Communication, Statistical Visualization, Business Communication, Data Management, Data Structures, Databases, Human Resources, Planning, SQL, Big Data, Data Mining, Data Science, General Statistics, NoSQL, Cloud Computing, Computer Programming, Data Visualization Software, Interactive Data Visualization, Machine Learning, Machine Learning Algorithms, Probability & Statistics, Regression",
"Rating": "4.6",
"Review Count": "82K reviews",
"Other Details": {
"Level": "Beginner ",
"Type": " Professional Certificate ",
"Duration": " 3 - 6 Months"
},
"URL": "https://www.coursera.org/professional-certificates/ibm-data-analyst"
}
Here is the complete code for Selenium web scraping.Â
#import packages
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
import json
# function to get the extract HTML elements holding the certificate details
def run_selenium():
browser = webdriver.Chrome()
browser.get("https://coursera.org")
browser.maximize_window()
inputbar = browser.find_element(By.TAG_NAME,'input')
inputbar.send_keys('Data Analyst')
inputbar.send_keys(Keys.ENTER)
try:
browser.find_element(By.CLASS_NAME,'filterContent')
filters = browser.find_elements(By.CLASS_NAME,"css-133xgeo")
loop = 0
for filter in filters:
if "Professional Certificates" in filter.text or "Data Science" in filter.text:
filter.find_element(By.TAG_NAME,'input').click()
sleep(3)
loop = loop+1
if loop == 2:
break
except:
buttons = browser.find_elements(By.XPATH,"//button[@class='css-1vwfrco']")
for button in buttons:
if 'All Filters' in button.text:
button.click()
filters = browser.find_elements(By.CLASS_NAME,"css-1g4pf2i")
loop = 0
for filter in filters:
if "Professional Certificates" in filter.text or "Data Science" in filter.text:
filter.find_element(By.TAG_NAME,'input').click()
sleep(3)
loop = loop+1
if loop == 2:
break
browser.find_element(By.XPATH,"//button/span[contains(text(),'Show results')]").click()
sleep(3)
for i in range(10):
browser.execute_script("window.scrollBy(0, 500);")
sleep(3)
certificates = browser.find_elements(By.CLASS_NAME,"css-16m4c33")
return certificates
# function to clean the extracted details
def clean_details(detailDict,details):
fluff = ['Status','New','stars','Make progress','Months','Skills']
keys = ['Company','Certificate','Skills','Rating','Review Count','Other Details']
i = 0
for detail in details:
if not any(s in detail for s in fluff):
try:
detailDict[keys[i]] = detail.replace('(','').replace(')','')
i=i+1
except:
continue
if 'Months' in detail:
metas = ['Level','Type','Duration']
meta = detail.split('·')
meta_dict = {}
n = 0
for m in meta:
meta_dict[metas[n]] = m
n=n+1
try:
detailDict[keys[i]] = meta_dict
except:
break
i=i+1
if 'Skills' in detail:
keyValue = detail.split(":")
detailDict[keyValue[0]] = keyValue[1]
i=i+1
# function to extract the certificate details from the extracted certificated elements
def extract_details(certificates):
proCerts = []
for certificate in certificates:
details = certificate.text.split('\n')
url = certificate.find_element(By.TAG_NAME,'a').get_attribute('href')
detailDict = {}
clean_details(detailDict,details)
detailDict['URL']=url
proCerts.append(detailDict)
return proCerts
# function to save the extracted certificate details
def save_details(details):
with open('selenium.json','w',encoding='utf-8') as f:
json.dump(details,f,indent=4,ensure_ascii = False)
if __name__ == "__main__":
certificates = run_selenium()
certificate_details = extract_details(certificates)
save_details(certificate_details)
Code Limitations
Although the code works, there are some limiations:
- You need to monitor Coursera.org constantly for any changes and reflect in the code.
- The code is not suitable for large-scale scraping as it doesn’t bypass anti-scraping measures.
Best Practices for Web Scraping with Selenium
Web scraping requires a mindful approach, especially when using powerful tools like Selenium. Here are some best practices:
1. Be Mindful of Server Load
Sending too many requests in a short time can overwhelm a website’s servers. To avoid this, use proper delays between requests, or consider adding randomness to your delays to mimic human browsing behavior.
2. Use Proxies to Avoid IP Blocking
Many websites will block an IP address after multiple scraping attempts. Using a proxy server can help you distribute your requests, reducing the chances of being blocked. There are free proxy services available, but investing in a paid service often provides more reliability and security.
3. Handle Captchas
Websites increasingly use captchas to prevent scraping. While Selenium can solve captchas on its own by using its click() method, you can integrate third-party services that specialize in solving captchas.
4. Monitor and Maintain Your Scripts
Web scraping is a constant game of cat and mouse. Websites frequently update their structures, which can break your scraper. Monitor your scripts regularly to ensure they are working as expected.
Advantages and Disadvantages of Selenium for Scraping
Selenium has its strengths, but it also has limitations. Here’s a quick breakdown:Â
Advantages:
- Handles dynamic content and JavaScript with ease
- Can simulate user actions like clicks and form submissions
- Works across different browsers, offering flexibility
Disadvantages:
- Slower compared to other scraping tools like BeautifulSoup or Scrapy since it loads the entire page, including images and scripts
- Requires more setup and maintenance for complex tasks
- Needs advanced infrastructure for large-scale scraping
FAQs about Web Scraping with Selenium Python
Technically, you can scrape most websites, but it may not be efficient for all of them. Websites that don’t use JavaScript to display content don’t require Selenium, and you can get away with using Python requests.
Selenium is ideal for scraping a dynamic website or interacting with a website. BeautifulSoup, on the other hand, is faster and more efficient for static pages. Many developers use them together for maximum effectiveness.
You can avoid being blocked by using proxies, rotating user agents, and adding delays between requests. Scraping responsibly is critical to staying under the radar.
Yes, you need basic programming knowledge, especially Python, to effectively use Selenium for web scraping. The good news is that Selenium’s syntax is simple and easy to learn.
Yes, Selenium can handle websites that use AJAX to load content. You just need to wait for the page to fully load before extracting data.
How a Web Scraping Service Can Help You
Selenium web scraping can extract data from websites with dynamic content. While it may not be the fastest scraping tool, it excels when interacting with the page is necessary, such as filling out forms or clicking buttons.
However, you need to consider the legalities and implement techniques to bypass anti-scraping measures, which you can forget by using a web scraping service like ScrapeHero.Â
ScrapeHero is a fully managed web scraping service capable of building large-scale scrapers and crawlers. We will take care of the best practices and legalities of web scraping. Moreover, our advanced browser farms can handle large-scale scraping using simultaneous Selenium instances.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data