Are you interested in fashion trends, price, monitoring, or market research? If so, scrape Product data from H&M’s website and analyze it.
Although H&M uses lazy loading to display results, meaning you need to scroll to load all items, you can still use HTTP requests to gather the data. How? Simply extract the data from the JSON string found within one of the script tags.
This article will guide you through building an H&M Product data scraper using MechanicalSoup, a request-based web scraping library for Python.
Data Scraped From H&M
H&M stores the product data available on a search results page as a JSON string inside a script tag with the id __NEXT_DATA__.
Initially, the code retrieves only the product URLs from the search results page. Then, it collects these four details from the product pages targeted by the product URLs:
- Name
- Price
- Description
- URL
All of this information is also available as a JSON string within a script tag on the product page.
Scrape Product Data from H&M: The Environment
This code for H&M data extraction requires just two packages:
1. MechanicalSoup: A web scraping library built on top of Python requests and BeautifulSoup, which you must install using pip.
pip install mechanicalsoup
2. json: A module for handling JSON, including saving a dict object to a JSON file, which is available in the Python standard library.
Scrape Product Data from H&M: The Code
Start your code to scrape H&M by importing the necessary packages. For MechanicalSoup, you only need to import the StatefulBrowser class.
from mechanicalsoup.stateful_browser import StatefulBrowser
import json
This code will search for a term on H&M’s website and extract details from the results, so you need a search term:
search_keyword = "suits"
Create an object of the StatefulBrowser class; this object will have methods for fetching and parsing. You can also specify which parser to use using the ‘soup_config=’ argument.
browser = StatefulBrowser(
soup_config={'features':'lxml'}
)
To ensure that your scraper isn’t blocked, use headers. Store the headers in a dict and update the headers of your MechanicalSoup object using session.headers.update().
headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
'*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
'dpr': '1',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
browser.session.headers.update(headers)
Now, fetch H&M’s search results page using the open() method with the URL as an argument, which you can get by visiting H&M and searching for a product.
Use an f-string to replace the search term in the URL with the variable defined earlier.
browser.open(f'https://www2.hm.com/en_us/search-results.html?q={search_keyword}')
If the request is successful, MechanicalSoup will fetch and parse the HTML code of the search results page. You can access this parsed page through the page attribute.
soup = browser.page
This parsed object is a BeautifulSoup tag, allowing you to use BeautifulSoup’s methods; this means you can find the script tag containing the product data as JSON using the find() method.
product_script = soup.find('script',{'id':'__NEXT_DATA__'}).text
You can now parse the JSON string using the json module, allowing you to navigate keys and values.
product_json = json.loads(product_script)
Navigate through this parsed JSON data to get URLs and prices of products listed on the search results page.
products = product_json['props']['pageProps']['srpProps']['hits']
product_details = [[product['pdpUrl'], product['regularPrice']] for product in products]
Define a list to store data extracted from H&M:
all_product_data = []
Now, you can iterate through this list. In each iteration:
1. Make an HTTP request to each product’s URL
browser.open(product[0])
2. Get the parsed data through the page attribute.
souplet = browser.page
3. Extract and parse the JSON string.
script = souplet.find('script', {'id': '__NEXT_DATA__'}).text
json_data = json.loads(script)
4. Extract the required details and save them in a dict.
product_info = json_data['props']['pageProps']['productPageProps']['aemData']['productArticleDetails']
firstKey = list(product_info['variations'].keys())[0]
data = {
'Name': product_info['productName'],
'Price': product[1],
'Description': product_info['variations'][firstKey]['description'],
'URL': product_info['productUrl']
}
5. Append this data to your previously defined list.
all_product_data.append(data)
Finally, save the extracted H&M product data to a JSON file.
with open('hNm.json','w',encoding='utf-8') as f:
json.dump(all_product_data,f,indent=4,ensure_ascii=False)
The extracted data will look like this:
{
"Name": "High-waist Dress Pants",
"Price": "$ 19.99",
"Description": "Relaxed-fit, dressy pants in jersey with a high waist. Elasticized waistband, diagonal side pockets, and wide legs with pleats at top and creases at front.",
"URL": "https://www2.hm.com/en_us/productpage.1091186001.html"
}
And here’s the complete code to scrape H&M data.
from mechanicalsoup.stateful_browser import StatefulBrowser
import json
search_keyword = "suits"
browser = StatefulBrowser(
soup_config={'features':'lxml'}
)
headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,'
'*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
'dpr': '1',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
browser.session.headers.update(headers)
browser.open(f'https://www2.hm.com/en_us/search-results.html?q={search_keyword}')
soup = browser.page
product_script = soup.find('script',{'id':'__NEXT_DATA__'}).text
product_json = json.loads(product_script)
products = product_json['props']['pageProps']['srpProps']['hits']
product_details = [[product['pdpUrl'], product['regularPrice']] for product in products]
all_product_data = []
for product in product_details[:10]:
browser.open(product[0])
souplet = browser.page
script = souplet.find('script', {'id': '__NEXT_DATA__'}).text
json_data = json.loads(script)
product_info = json_data['props']['pageProps']['productPageProps']['aemData']['productArticleDetails']
firstKey = list(product_info['variations'].keys())[0]
data = {
'Name': product_info['productName'],
'Price': product[1],
'Description': product_info['variations'][firstKey]['description'],
'URL': product_info['productUrl']
}
all_product_data.append(data)
with open('hNm.json','w',encoding='utf-8') as f:
json.dump(all_product_data,f,indent=4,ensure_ascii=False)
Code Limitations
This code can scrape product data from H&M; however, you’ll need to
- Modify it if you want additional data points beyond those covered in this tutorial.
- Monitor H&M’s website for any changes in HTML structure; otherwise, the code may fail to retrieve data.
- Implement techniques to bypass anti-scraping measures—such as request delays and proxy rotation—for large-scale web scraping; otherwise, you may get blocked.
Why Use a Web Scraping Service?
To fetch a few hundred product data, you can use the H&M data scraping code shown here, to fetch a few hundred product data, but for large-scale data extraction, it is better to use a web scraping service.
A web scraping service, like ScrapeHero, can take care of
- Bypassing anti-scraping measures
- Executing JavaScript
- Monitoring site changes
This allows you to focus on utilizing the data instead of managing technical challenges.
ScrapeHero is a fully managed web scraping service provider capable of building enterprise-grade web scrapers and crawlers. Our services also include custom robotic process automation and developing tailored AI models.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data