Struggling to Scrape Shein Data? Start Using Selenium

Share:

Scrape shein data

Scraping data from Shein can be challenging due to its dynamic content and anti-scraping measures. However, a browser automation library, like Selenium, can help you scrape dynamic web pages. Here’s a step-by-step guide on how to scrape Shein data using Selenium.

Data Scraped from Shein

This tutorial scrapes Shein product data from its super-deals page.

  1. Product name
  2. Product Price
  3. Product URL

You can use the browser’s inspect panel to determine which HTML elements on Shein’s home page contain the details:

  1. Right-click on a product data, like price
  2. Click ‘Inspect’
Developer tools showing product name and product price on Shein’s super-deals page

 

Scrape Shein Data: The Environment

The code uses Selenium Python to fetch the HTML source code of Shein’s super-deals page. Selenium’s ability to interact with browsers makes it excellent for scraping e-commerce websites like Shein.

For parsing the source code, the code uses BeautifulSoup.

Both Selenium and BeautifulSoup are external libraries, so you need to install them, which you can do with Python pip.

pip install bs4 selenium

To lean

The code also uses three packages from the Python standard library:

  1. json to save the extracted data to a JSON file
  2. urllib.parse to make relative links absolute
  3. time to delay the script execution

Scrape Shein Data: The Code

Begin your code to scrape Shein data by importing the packages mentioned above.

import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from urllib.parse import urljoin
from time import sleep

The code only imports two Selenium modules: webdriver and By. The webdriver module interacts with the browser (navigating to a URL, setting browser options, etc.) and the By module lets you specify how to locate an HTML element (by XPaths, class names, etc.) 

This tutorial scrapes a specific number of Shein’s super-deals pages. Therefore, the scraper requires two inputs: the URL of the page and the number of pages to be scraped. It’s better to store these in different variables, allowing you to change them without touching your scraper’s main code.

source = "https://us.shein.com/super-deals"
max_pages = 5

Selenium browser is faster in headless mode. You can start the browser in headless mode by adding the argument “–headless=new” to the browser’s options and launching the browser with it.

options = webdriver.ChromeOptions()
options.add_argument("--headless=new")
browser = webdriver.Chrome(options=options)

Note: ChromeOptions() is used when you are using the Chrome browser while running Selenium. You need to use other methods corresponding to the browser used.

After launching the browser, use the get() method with the source variable as the argument to visit the super-deals page.

browser.get(source)

You can now extract the product details, but first, declare an empty array to store the details.

products = []

Use a loop to extract product details from multiple pages. The number of iterations in the loop will be according to max_pages.  

In the loop:

1. Pause the script execution for 5 seconds to allow all the products on the page to load.

sleep(5)

2. Get the HTML source code using Selenium’s page_source attribute.

response = browser.page_source

3. Pass the source code to BeautifulSoup for parsing.

soup = BeautifulSoup(response,'lxml')

4. Find all  the section elements holding the product details.

product_list = soup.find('div',{'class':'thrifty-find-products'}).find_all('section')

5. Iterate through the sections and

  1. Extract name, URL, and price
  2. Append the details to the array defined before the loop
  3. Navigate to the next page if the current page is not greater than max_pages
for section in product_list:
        #extract details
        try:
            name = section['aria-label']
            url = section.a['href']
            price = section.find('span',{'class':'product-item__camecase-price'}).text
        except:
            continue
       
        #append the details to the array
        products.append(
            {
                "Name":name,
                "Price":price,
                "URL": urljoin("https://shein.com",url)
            }
        )

        #navigate to the next page
        if i < max_pages or i == max_pages:
            try:
                browser.find_element(By.XPATH,f"//span[@class='sui-pagination__inner sui-pagination__hover' and contains(text(),'{i}')]").click()
            except:
                print("No more pages")
                break

Close the browser after the loop completes. 

browser.quit()

Finally, save the extracted Shein product data to a JSON file. 

with open("shein.json",'w') as f:
    json.dump(products,f,indent=4,ensure_ascii=False)

Here is a flowchart showing the entire process.

Code logic for web scraping Shein data

 

Code Limitations

Shein has strong anti-scraping mechanisms like captcha challenges and IP rate limiting. To overcome this, you might need to rotate proxies and solve captchas. This code doesn’t do that, which also means that the code is unsuitable for large-scale web scraping. 

Moreover, you must keep watching Shein’s website for any changes in its HTML structure because this code relies on it to extract the product details.

Why Not Use a Web Scraping Service?

So you can use Selenium WebDriver and BeautifulSoup to scrape Shein data. This tutorial showed how you can scrape their super-deals page; similarly, you can scrape their other pages by altering the code.

You also need to alter the code for large-scale web scraping or whenever Shein changes its HTML structure. However, you can avoid all that by choosing ScrapeHero’s Web Scraping Service.

ScrapeHero is a fully-managed web scraping service provider capable of building large-scale web scraping and crawling. 

FAQ

1. Is scraping Shein legal?

Although it’s legal to scrape a public website, Scraping Shein or any website without permission may violate its terms of service. It’s important to consult a legal expert to ensure compliance. Check out this page on the legality of web scraping to learn more.

2. Why do I need proxies to scrape Shein?

Shein has rate limits and IP bans prevent scraping. Rotating proxies allows you to send requests from different IPs, reducing the chance of getting blocked.

3. Can I scrape Shein without using Selenium?

Scraping Shein without browser automation is difficult due to the site’s use of JavaScript to load content dynamically. However, if you don’t want to use Selenium, you can use other browser automation libraries like Playwright or Puppeteer. 

5. How do I avoid getting blocked while scraping Shein?

Here are the steps for web scraping without getting blocked by Shein
Use rotating proxies
Mimic human behavior (like random delays between requests)
Respect the site’s robots.txt file
Avoid to many requests

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

Table of content

Scrape any website, any format, no sweat.

ScrapeHero is the real deal for enterprise-grade scraping.

Ready to turn the internet into meaningful and usable data?

Contact us to schedule a brief, introductory call with our experts and learn how we can assist your needs.

Continue Reading

NoSQL vs. SQL databases

Stuck Choosing a Database? Explore NoSQL vs. SQL Databases in Detail

Find out which SQL and NoSQL databases are best suited to store your scraped data.
Scrape JavaScript-Rich Websites

Upgrade Your Web Scraping Skills: Scrape JavaScript-Rich Websites

Learn all about scraping JavaScript-rich websites.
Web scraping with mechanicalsoup

Ditch Multiple Libraries by Web Scraping with MechanicalSoup

Learn how you can replace Python requests and BeautifulSoup with MechanicalSoup.
ScrapeHero Logo

Can we help you get some data?