How to Scrape Amazon Product Data: Using Code and No Code Approaches

This article outlines a few methods for scraping Amazon product data. This could effectively export data to Excel or other formats for easier access and use.

There are three methods for scraping Amazon product data:

  1. Scraping Amazon in Python or JavaScript
  2. Using the ScrapeHero Cloud, Amazon Product Details and Pricing Scraper, a no-code tool
  3. Using the Amazon Product Details and Pricing API by ScrapeHero Cloud

Building an Amazon Scraper in Python or JavaScript

In this section, we will guide you on how to scrape Amazon product data using either Python or JavaScript. We will utilize the browser automation framework called Playwright to emulate browser behavior in our code.

You could also use Python Requests, BeautifulSoup, or LXML to build an Amazon scraper without using a browser or a browser automation library. But bypassing the anti-scraping mechanisms put in place can be challenging and is beyond the scope of this article.

You could also use Python Requests, BeautifulSoup, or LXML to build an Amazon scraper without using a browser or a browser automation library. But bypassing the anti-scraping mechanisms put in place can be challenging and is beyond the scope of this article.

Here are the steps for scraping Amazon product data using Playwright:

 Step 1: Choose either Python or JavaScript as your programming language.

Step 2: Install Playwright for your preferred language.

pip install playwright
# to download the necessary browsers
playwright install
npm install playwright@latest

Step 3: Write your code to emulate browser behavior and extract the desired data from Amazon using the Playwright API. You can use the code:

import asyncio
import json

from playwright.async_api import async_playwright

url = "https://www.amazon.com/Imploding-Kittens-First-Expansion-Exploding/dp/B01HSIIFQ2/]?th=1"


async def extract_data(page) -> list:
    """
    Parsing details from the product page

    Args:
        page: webpage of the browser

    Returns:
        list: details of product on amazon
    """

    # Initializing selectors and xpaths
    title_xpath = "h1[id='title']"
    asin_selector = "//td/div[@id='averageCustomerReviews']"
    rating_xpath = "//div[@id='prodDetails']//i[contains(@class,'review-stars')]/span"
    ratings_count_xpath = "//div[@id='prodDetails']//span[@id='acrCustomerReviewText']"
    selling_price_xpath = "//input[@id='priceValue']"
    listing_price_xpath = "//div[@id='apex_desktop_qualifiedBuybox']//span[@class='a-price a-text-price']/span[@class='a-offscreen']"
    img_link_xpath = "//div[contains(@class,'imgTagWrapper')]//img"
    brand_xpath = (
        "//tr[contains(@class,'po-brand')]//span[@class='a-size-base po-break-word']"
    )
    status_xpath = "//div[@id='availabilityInsideBuyBox_feature_div']//div[@id='availability']/span"
    description_ul_xpath = (
        "//ul[@class='a-unordered-list a-vertical a-spacing-mini']/li"
    )
    product_description_xpath = "//div[@id='productDescription']//span"

    # Waiting for the page to finish loading
    await page.wait_for_selector(title_xpath)

    # Extracting the elements
    product_title = (
        await page.locator(title_xpath).inner_text()
        if await page.locator(title_xpath).count()
        else None
    )
    asin = (
        await page.locator(asin_selector).get_attribute("data-asin")
        if await page.locator(asin_selector).count()
        else None
    )
    rating = (
        await page.locator(rating_xpath).inner_text()
        if await page.locator(rating_xpath).count()
        else None
    )
    rating_count = (
        await page.locator(ratings_count_xpath).inner_text()
        if await page.locator(ratings_count_xpath).count()
        else None
    )
    selling_price = (
        await page.locator(selling_price_xpath).get_attribute("value")
        if await page.locator(selling_price_xpath).count()
        else None
    )
    listing_price = (
        await page.locator(listing_price_xpath).inner_text()
        if await page.locator(listing_price_xpath).count()
        else None
    )
    brand = (
        await page.locator(brand_xpath).inner_text()
        if await page.locator(brand_xpath).count()
        else None
    )
    product_description = (
        await page.locator(product_description_xpath).inner_text()
        if await page.locator(product_description_xpath).count()
        else None
    )
    image_link = (
        await page.locator(img_link_xpath).get_attribute("src")
        if await page.locator(img_link_xpath).count()
        else None
    )
    status = (
        await page.locator(status_xpath).inner_text()
        if await page.locator(status_xpath).count()
        else None
    )

    # full_description is found as list, so iterating the list elements to get the descriptions
    full_description_list = []
    desc_lists = page.locator(description_ul_xpath)
    desc_count = await desc_lists.count()
    for index in range(desc_count):
        li_element = desc_lists.nth(index=index)
        desc = (
            await li_element.locator("//span").inner_text()
            if await li_element.locator("//span").count()
            else None
        )
        full_description_list.append(desc)
    full_description = " | ".join(full_description_list)

    # cleaning data
    product_title = clean_data(product_title)
    asin = clean_data(asin)
    rating = clean_data(rating)
    rating_count = clean_data(rating_count)
    selling_price = clean_data(selling_price)
    listing_price = clean_data(listing_price)
    brand = clean_data(brand)
    image_link = clean_data(image_link)
    status = clean_data(status)
    product_description = clean_data(product_description)
    full_description = clean_data(full_description)

    data_to_save = {
        "product_title": product_title,
        "asin": asin,
        "rating": rating,
        "rating_count": rating_count,
        "selling_price": selling_price,
        "listing_price": listing_price,
        "brand": brand,
        "image_links": image_link,
        "status": status,
        "product_description": product_description,
        "full_description": full_description,
    }

    save_data(data_to_save, "Data.json")


async def run(playwright) -> None:
    # Initializing the browser and creating a new page.
    browser = await playwright.chromium.launch(headless=False)
    context = await browser.new_context()
    page = await context.new_page()

    await page.set_viewport_size({"width": 1920, "height": 1080})
    page.set_default_timeout(300000)

    # Navigating to the homepage
    await page.goto(url, wait_until="domcontentloaded")
    await extract_data(page)

    await context.close()
    await browser.close()


def clean_data(data: str) -> str:
    """
    Cleaning data by removing extra white spaces and Unicode characters

    Args:
        data (str): data to be cleaned

    Returns:
        str: cleaned string
    """
    if not data:
        return None
    cleaned_data = " ".join(data.split()).strip()
    cleaned_data = cleaned_data.encode("ascii", "ignore").decode("ascii")
    return cleaned_data


def save_data(product_page_data: dict, filename: str):
    """Converting a list of dictionaries to JSON format

    Args:
        product_page_data (list): details of each product
        filename (str): name of the JSON file
    """
    with open(filename, "w") as outfile:
        json.dump(product_page_data, outfile, indent=4)


async def main() -> None:
    async with async_playwright() as playwright:
        await run(playwright)


if __name__ == "__main__":
    asyncio.run(main())
const { chromium, firefox } = require('playwright');
const fs = require('fs');
const { title } = require('process');

const url = "https://www.amazon.com/Imploding-Kittens-First-Expansion-Exploding/dp/B01HSIIFQ2/]?th=1";

/**
* Save data as list of dictionaries
as json file
* @param {object} data
*/
function saveData(data) {
    let dataStr = JSON.stringify(data, null, 2)
    fs.writeFile("data.json", dataStr, 'utf8', function (err) {
        if (err) {
            console.log("An error occurred while writing JSON Object to File.");
            return console.log(err);
        }
        console.log("JSON file has been saved.");
    });
}

function cleanData(data) {
    if (!data) {
        return;
    }
    // removing extra spaces and unicode characters
    let cleanedData = data.split(/\s+/).join(" ").trim();
    cleanedData = cleanedData.replace(/[^\x00-\x7F]/g, "");
    return cleanedData;
}


// The data extraction function used to extract
// necessary data from the element.

async function extractData(data, type) {
    let count = await data.count();
    if (count) {
        if (type == 'innerText') {
            return await data.innerText()    
        }else {
            return await data.getAttribute(type)
        }
        
    }
    return null
};

async function parsePage(page) {
    // initializing xpaths
    let titleXPath = "h1[id='title']";
    let asinSelector = "//td/div[@id='averageCustomerReviews']";
    let ratingXPath = "//div[@id='prodDetails']//i[contains(@class,'review-stars')]/span";
    let ratingsCountXPath = "//div[@id='prodDetails']//span[@id='acrCustomerReviewText']";
    let sellingPriceXPath = "//input[@id='priceValue']";
    let listingPriceXPath = "//div[@id='apex_desktop_qualifiedBuybox']//span[@class='a-price a-text-price']/span[@class='a-offscreen']";
    let imgLinkXPath = "//div[contains(@class,'imgTagWrapper')]//img";
    let brandXPath = "//tr[contains(@class,'po-brand')]//span[@class='a-size-base po-break-word']";
    let statusXPath = "//div[@id='availabilityInsideBuyBox_feature_div']//div[@id='availability']/span";
    let descriptionULXPath = "//ul[@class='a-unordered-list a-vertical a-spacing-mini']/li";
    let productDescriptionXPath = "//div[@id='productDescription']//span";

    // wait until page loads
    await page.waitForSelector(titleXPath);

    // extract data using xpath
    let productTitle = page.locator(titleXPath);
    productTitle = await extractData(productTitle, type ='innerText');

    let asin = page.locator(asinSelector);
    asin = await extractData(asin, type = 'data-asin');

    let rating = page.locator(ratingXPath);
    rating = await extractData(rating, type ='innerText');

    let ratingCount = page.locator(ratingsCountXPath);
    ratingCount = await extractData(ratingCount, type ='innerText');
    
    let sellingPrice = page.locator(sellingPriceXPath);
    sellingPrice = await extractData(sellingPrice, type='value');

    let listingPrice = page.locator(listingPriceXPath);
    listingPrice = await extractData(listingPrice, type ='innerText');

    let brand = page.locator(brandXPath);
    brand = await extractData(brand, type ='innerText');

    let productDescription = page.locator(productDescriptionXPath);
    productDescription = await extractData(productDescription, type ='innerText');

    let imageLink = page.locator(imgLinkXPath);
    imageLink = await extractData(imageLink, type ='src');

    let status = page.locator(statusXPath);
    status = await extractData(status, type ='innerText');

    // since fulldescription is in <li> element , iteration is needed let fullDescriptionList = []; let descLists = page.locator(descriptionULXPath); let descCount = await descLists.count(); for (let index = 0; index < descCount; index++) { let liElement = descLists.nth(index); let desc = liElement.locator('//span'); desc = await extractData(desc, type ='innerText'); fullDescriptionList.push(desc); } let fullDescription = fullDescriptionList.join(" | ") || null;
// cleaning data
productTitle = cleanData(productTitle)
asin = cleanData(asin)
rating = cleanData(rating)
ratingCount = cleanData(ratingCount)
sellingPrice = cleanData(sellingPrice)
listingPrice = cleanData(listingPrice)
brand = cleanData(brand)
imageLink = cleanData(imageLink)
status = cleanData(status)
productDescription = cleanData(productDescription)
fullDescription = cleanData(fullDescription)

let dataToSave = {
productTitle: productTitle,
asin: asin,
rating: rating,
ratingCount: ratingCount,
sellingPrice: sellingPrice,
listingPrice: listingPrice,
brand: brand,
imageLinks: imageLink,
status: status,
productDescription: productDescription,
fullDescription: fullDescription,
};

saveData(dataToSave);
}

/**
* The main function initiates a browser object and handles the navigation.
*/
async function run() {
// initializing browser and creating new page
const browser = await chromium.launch({ headless: false });
const context = await browser.newContext();
const page = await context.newPage();

await page.setViewportSize({ width: 1920, height: 1080 });
page.setDefaultTimeout(30000);

// Navigating to the home page
await page.goto(url, { waitUntil: 'domcontentloaded' });
await parsePage(page);

await context.close();
await browser.close();
};

run();

This code shows how to scrape Amazon using the Playwright library in Python and JavaScript.

The corresponding scripts have two main functions, namely:

  1. run function: This function takes a Playwright instance as an input and performs the scraping process. The function launches a Chromium browser instance, navigates to an Amazon page, fills in a search query, clicks the search button, and waits for the results to be displayed on the page.
    The extract_details function is then called to extract the product details and store the data in a JSON file.
  2. extract_data function: This function takes a Playwright page object as input and returns a list of dictionaries containing product details. The details include name, brand, seller, rating, sale price, etc.

Finally, the main function uses the async_playwright context manager to execute the run function. A JSON file containing the listings of the Amazon product data script you just executed would be created.

Step 4: Run your code for Scraping Amazon product data.

Using No-Code Amazon Product Details and Pricing Scraper by ScrapeHero Cloud

The Amazon Product Details and Pricing Scraper by ScrapeHero Cloud is a convenient method for scraping product details from Amazon. It provides an easy, no-code method for scraping data, making it accessible for individuals with limited technical skills.

This section will guide you through the steps to set up and use the Amazon Product Details and Pricing scraper.

1. Sign up or log in to your ScrapeHero Cloud account.

If you don’t like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

Get Started for Free
Deploy to ScrapeHero Cloud

2. Go to Amazon Product Details and Pricing Scraper by ScrapeHero Cloud in the marketplace.

Choosing ScrapeHero Amazon product details and pricing scraper from ScrapeHero Cloud Crawlers page

3. Add the scraper for scraping Amazon product data to your account. (Don’t forget to verify your email if you haven’t already.)

Adding ScrapeHero Amazon product details and pricing scraper from ScrapeHero Cloud to the user's account

4. You need to add the product URL or ASIN to start the scraper. If it’s just a single query, enter it in the field provided.

  1. You can get the product URL from the Amazon search results page.
    Finding the URL of the Amazon product page
  2. You can get the product’s ASIN from the product information section of a product listing page.
    Finding the ASIN from Amazon product details section

5. To scrape results for multiple queries, add multiple product URLs or ASINs to the SearchQuery field and save the settings.

Entering multiple ASINs of products to scrape

6. To start the scraper, click on the Gather Data button.

Clicking the Gather Data button to start the scraper.

7. The scraper will start fetching data for your queries, and you can track its progress under the Jobs tab.

8. Once finished, you can view or download the data from it.

Downloading the results

9. You can also extract data from Amazon to Excel from here. Just click on the Download Data button and select “Excel” and open the downloaded file using Microsoft Excel.

Using Amazon Product Details and Pricing API by ScrapeHero Cloud

The ScrapeHero Cloud Amazon Product Details and Pricing API is an alternate tool for extracting product details from Amazon. This user-friendly API enables those with minimal technical expertise to obtain product data effortlessly from Amazon.

This section will walk you through the steps to configure and utilize the Amazon Product Details and Pricing API provided by ScrapeHero Cloud.

  1. Sign up or log in to your ScrapeHero Cloud account.
  2. Go to the Amazon Product Details and Pricing API by ScrapeHero Cloud in the marketplace.
  3. Click on the subscribe button.
    Note: As this is a paid API, you must subscribe to one of the available plans to use the API.
  4. After subscribing to a plan, head over to the Documentation tab to get the necessary steps to integrate the API into your application.

Uses Cases of Amazon Product Data

If you’re unsure as to why you should scrape Amazon product data, here are a few use cases where this data would be helpful:

  • By scraping Amazon product data, businesses can analyze market trends, understand consumer preferences, and monitor competitor activities.
  • Price Optimization

    By scraping Amazon prices using an Amazon price scraper, retailers and sellers can use the data obtained to optimize their pricing strategies by analyzing the pricing patterns of similar products.
  • Product Development and Innovation

    Manufacturers and brands can scrape Amazon product data to identify gaps in the market, understand consumer pain points, and gather ideas for product improvements or new product features.
  • Reputation and Brand Management

    The data obtained by the Amazon data scraper can be used to monitor product reviews and ratings on Amazon. It also helps businesses manage their online reputation and respond to customer feedback effectively.
  • Inventory and Supply Chain Management

    With Amazon scraping, businesses can better forecast demand, optimize stock levels, and reduce inventory holding costs by analyzing sales velocity, seasonal trends, and consumer demand patterns on Amazon.

Frequently Asked Questions

1. Can you scrape data from Amazon?

Yes. You can scrape Amazon product data by using a Python or JavaScript scraper. If you do not want to code, then use ScrapeHero Amazon Product Details and Pricing Scraper.

You can also choose the Amazon Product Details and Pricing API by ScrapeHero Cloud to integrate with any application to stream product data.

To scrape Amazon product information using BeautifulSoup, send GET requests to the product’s page using the Requests library, then parse the HTML response using BeautifulSoup to extract essential information like name, price, and description.

For web scraping Amazon using Selenium(Python), you have to set up Selenium WebDriver for automating a web browser and navigating to the Amazon product page. Later, use locators like By.XPATH to find and interact with search elements.

Amazon does not directly support or encourage web scraping. But it is not illegal to scrape publicly available data.

You can scrape Amazon product reviews using Python or JavaScript. ScrapeHero provides an Amazon Product Reviews and Ratings Scraper, which is a no-code tool for this purpose. You can also use the Amazon Reviews API by ScrapeHero Cloud for integrating with applications.

To learn about the pricing, visit the ScrapeHero pricing page.

The legality of web scraping depends on the jurisdiction, but it is generally considered legal if you are scraping publicly available data. Please refer to Legal Information to learn more about the legality of web scraping.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?