Learn to scrape Amazon using Python. Extract Amzaon product details like Name, Price, ASIN and more by scraping Amazon.
Scrapy is the most popular open source web scraping framework. Written in Python, it has most of the modules you would need to efficiently extract, process, and store data from websites in pretty much any structured data format. Scrapy is best suited for web crawlers which scrapes data from multiple types of pages.
In this tutorial, we will show you how to scrape product data from Alibaba.com – the world’s leading marketplace.
Prerequisites
Install Python 3 and Pip
We will use Python 3 for this tutorial. To start, you need a computer with Python 3 and PIP .
Follow the guides below to install Python 3 and pip:
- Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/
- Mac – http://docs.python-guide.org/en/latest/starting/install3/osx/
- Windows – https://www.scrapehero.com/how-to-install-python3-in-windows-10/
Install Packages
pip3 install scrapy selectorlib
You can find more details on installation here – https://doc.scrapy.org/en/latest/intro/
Create a Scrapy Project
Let’s create a scrapy project using the following command.
scrapy startproject scrapy_alibaba
This command creates a Scrapy project with the Project Name (scrapy_alibaba) as the folder name. It will contain all necessary files with proper structure and basic doc strings for each file, with a structure similar to
scrapy_alibaba/ # Project root directory scrapy.cfg # Contains the configuration information to deploy the spider scrapy_alibaba/ # Project's python module __init__.py items.py # Describes the definition of each item that we’re scraping middlewares.py # Project middlewares pipelines.py # Project pipelines file settings.py # Project settings file spiders/ # All the spider code goes into this directory __init__.py
Create a Spider
Scrapy has a built-in command called genspider
to generate the basic spider template.
scrapy genspider <spidername> <website>
Let’s generate our spider
scrapy genspider alibaba_crawler alibaba.com
and this will create a spiders/scrapy_alibaba.py file for you with the initial template to crawl alibaba.com.
The code should look like this
# -*- coding: utf-8 -*- import scrapy class AlibabaCrawlerSpider(scrapy.Spider): name = 'alibaba_crawler' allowed_domains = ['alibaba.com'] start_urls = ['http://alibaba.com/'] def parse(self, response): pass
The class AlibabaCrawler inherits the base class scrapy.Spider
. The Spider class knows how to follow links and extract data from web pages but it doesn’t know where to look or what data to extract. We will add this information later.
Variable and functions
- name is the name of the spider which was given in the standard generation command.
We will use this name to start the spider from the command line. - A list of allowed_domains are the domains that spider is allowed to crawl
- start_urls is the urls which the spider will start crawling when it is invoked.
- parse() is the Scrapy’s default callback method which is called for requests without an explicitly assigned callback. parse function get invoked after each start_url is crawled. You can use this function to parse the response, extract the scraped data, and find new URLs to follow by creating new requests (Request) from them.
Scrapy provides comprehensive information about the crawl, as you go through the logs, you can understand what’s happening in the spider.
- The spider is initialized with the bot name “scrapy_alibaba” and prints all packages used in the project with version numbers.
- Scrapy looks for spider modules which are located in the /spiders directory. Setting default values to variables such as CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS, SPIDER_MODULES,
DOWNLOAD_TIMEOUT - Loaded all the components like middlewares, extensions, and pipelines which are needed to handle the requests
- Used the URLs provided in
start_urls
and retrieved the HTML content of the page. Since we didn’t specify any callbacks for the start_urls the response is received at the parse() function. We also did not write any lines to handle the response received, so the spider finished with stats like the pages scraped in the crawl, bandwidth used in bytes, the number of items scraped, status code counts etc.
Extract Data from Alibaba.com
For this tutorial we will only extract the following fields from any search result page of Alibaba.com:
- Product Name
- Price Range
- Product Image
- Link to Product
- Minimum Order
- Seller Name
- Seller Response Rate
- Number of years as a seller on Alibaba
You could go further and scrape the product and pricing details based on filters and orders. But for now, we’ll keep it simple and stick to these fields.
When you search for any keyword, say “earphones” in Alibaba you will see that the result page has a URL similar– https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=earphones&viewtype=G, where the parameter SearchText having the keyword you searched for.
Creating a Selectorlib Template for Alibaba Search Results Page
We removed a lot of code from the previous version of this scraper by using Selectorlib. You can see those changes here
You will notice in the code above that we used a file called selectors.yml
. This file is what makes this tutorial so easy to create and follow. The magic behind this file is a tool called Selectorlib.
Selectorlib is a tool that makes selecting, marking up, and extracting data from web pages visual and very easy. The Selectorlib Chrome Extension lets you mark data that you need to extract, and creates the CSS Selectors or XPaths needed to extract that data, then previews how the data would look like. You can learn more about Selectorlib and how to use it here
If you just need the data we have shown above, you do not need to use Selectorlib because we have done that for you already and generated a simple “template” that you can just use. However, if you want to add a new field, you can use Selectorlib to add that field to the template.
Here is how we marked up the fields in the code for all the data we need from Alibaba Search Results page using Selectorlib Chrome Extension
Once you have created the template, click on ‘Highlight’ to highlight and preview all of your selectors. Finally, click on ‘Export’ and download the YAML file. Save the file as search_results.yml in the /resources folder.
Here is our template https://github.com/scrapehero/alibaba-scraper/blob/master/scrapy_alibaba/resources/search_results.yml
Reading search keywords from a file
Let’s modify the spider to read keywords from a file from a folder called /resources
in the project directory and get the products for all those keyword inputs. Let’s create the folder and create a CSV file inside it called keywords.csv
. Here is how the file looks like if we need to search separately for headphones and then earplugs.
keyword headphones earplugs
Let’s use Python’s standard CSV module to read the keywords file.
def parse(self, response): """Function to read keywords from keywords file""" keywords = csv.DictReader(open(os.path.join(os.path.dirname(__file__),"../resources/keywords.csv"))) for keyword in keywords: search_text = keyword["keyword"] url = "https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText={0}&viewtype=G".format(search_text) yield scrapy.Request(url, callback=self.parse_listing, meta={"search_text":search_text})
The Full Scrapy Spider Code
Please see the full code here – https://github.com/scrapehero/alibaba-scraper
The spider alibaba_crawler
would look like https://github.com/scrapehero/alibaba-scraper/blob/master/scrapy_alibaba/spiders/alibaba_crawler.py
Let’s try to run our scraper using
scrapy crawl alibaba_crawler
DEBUG: Forbidden by robots.txt: <GET https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=headphones&viewtype=G&page=1>
This is because the Alibaba site has disallowed to crawl all URLs with pattern /trade
. You can verify this by visiting the robots.txt page which is located at https://www.alibaba.com/robots.txt
All spiders created using Scrapy 1.1+ already respect robots.txt by default. You can disable this by setting the variable ROBOTSTXT_OBEY = False. Now scrapy knows there is no need to check the robots.txt file. It will start crawling the URLs specified the start_urls list.
Export Product Data into JSON or CSV using Scrapy
Scrapy provides in-built CSV and JSON formats.
scrapy crawl <spidername> -o output_filename.csv -t csv scrapy crawl <spidername> -o output_filename.json -t json
To store the output as a CSV file:
scrapy crawl alibaba_crawler -o alibaba.csv -t csv
For a JSON file:
scrapy crawl alibaba_crawler -o alibaba.csv -t json
This will create an output file that will be in the same folder as the script.
Here is some sample data extracted from Alibaba.com as CSV
Known Limitations
This code should be capable of scraping the details of most product listing pages of Alibaba as long as the structure remains same or similar. In case you see errors related to LXML while scraping, it could be because:
- Anti-Scraping measures of Alibaba.com might have flagged the crawler as a Bot.
- The structure of the website might have changed, making all the selectors we have invalid
Want to scrape and extract product data from thousands of pages yourself?
Read more:
Scalable do-it-yourself scraping – How to build and run scrapers on a large scale
If you need some professional help extracting eCommerce product data, let us know through the form below.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data
Responses
can we scrape alibaba product keywords?
Could you explain how to scrape more than 1 page or multiple pages at a time from alibaba website.
Comments are closed.