How to scrape Alibaba.com product data using Scrapy

Scrapy is the most popular open source web scraping framework. Written in Python, it has most of the modules you would need to efficiently extract, process, and store data from websites in pretty much any structured data format. Scrapy is best suited for web crawlers which scrapes data from multiple types of pages.

In this tutorial, we will show you how to scrape product data from Alibaba.com – the world’s leading marketplace.

Prerequisites

Install Python 3 and Pip

We will use Python 3 for this tutorial.  To start, you need a computer with Python 3 and PIP .

Follow the guides below to install Python 3 and pip:

Install Packages

pip3 install scrapy selectorlib

You can find more details on installation here  – https://doc.scrapy.org/en/latest/intro/

Create a Scrapy Project 

Let’s create a  scrapy project using the following command.

scrapy startproject scrapy_alibaba

This command creates a Scrapy project with the Project Name  (scrapy_alibaba) as the folder name.  It will contain all necessary files with proper structure and basic doc strings for each file, with a structure similar to

scrapy_alibaba/ # Project root directory
scrapy.cfg  # Contains the configuration information to deploy the spider
scrapy_alibaba/ # Project's python module
__init__.py
items.py      # Describes the definition of each item that we’re scraping
middlewares.py  # Project middlewares
pipelines.py     # Project pipelines file
settings.py      # Project settings file
spiders/         # All the spider code goes into this directory
__init__.py

Create a Spider 

Scrapy has a built-in command called genspider to generate the basic spider template.

scrapy genspider <spidername> <website>

Let’s generate our spider

scrapy genspider alibaba_crawler alibaba.com

and this will create a spiders/scrapy_alibaba.py file for you with the initial template to crawl alibaba.com.

The code should look like this

# -*- coding: utf-8 -*-
import scrapy
class AlibabaCrawlerSpider(scrapy.Spider):
name = 'alibaba_crawler'
allowed_domains = ['alibaba.com']
start_urls = ['http://alibaba.com/']
def parse(self, response):
pass

The class AlibabaCrawler inherits the base class scrapy.Spider. The Spider class knows how to follow links and extract data from web pages but it doesn’t know where to look or what data to extract. We will add this information later.

Variable and functions

  • name is the name of the spider which was given in the standard generation command.
    We will use this name to start the spider from the command line.
  • A list of allowed_domains are the domains that spider is allowed to crawl
  • start_urls is the urls which the spider will start crawling when it is invoked.
  • parse() is the Scrapy’s default callback method which is called for requests without an explicitly assigned callback. parse function get invoked after each start_url is crawled. You can use this function to parse the response, extract the scraped data, and find new URLs to follow by creating new requests (Request) from them.

Scrapy provides comprehensive information about the crawl, as you go through the logs, you can understand what’s happening in the spider.

  • The spider is initialized with the bot name “scrapy_alibaba” and prints all packages used in the project with version numbers.
  • Scrapy looks for spider modules which are located in the /spiders directory. Setting default values to variables such as CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS, SPIDER_MODULES,
    DOWNLOAD_TIMEOUT
  • Loaded all the components like middlewares, extensions, and pipelines which are needed to handle the requests
  • Used the URLs provided in start_urls and retrieved the HTML content of the page. Since we didn’t specify any callbacks for the start_urls the response is received at the parse() function. We also did not write any lines to handle the response received, so the spider finished with stats like the pages scraped in the crawl, bandwidth used in bytes, the number of items scraped, status code counts etc.

Extract Data from Alibaba.com

For this tutorial we will only extract the following fields from any search result page of Alibaba.com:

  1. Product Name
  2. Price Range
  3. Product Image
  4. Link to Product
  5. Minimum Order
  6. Seller Name
  7. Seller Response Rate
  8. Number of years as a seller on Alibaba

You could go further and scrape the product and pricing details based on filters and orders. But for now, we’ll keep it simple and stick to these fields.

When you search for any keyword, say “earphones” in Alibaba you will see that the result page has a URL similar– https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=earphones&viewtype=G, where the parameter SearchText having the keyword you searched for.

Creating a Selectorlib Template for Alibaba Search Results Page

We removed a lot of code from the previous version of this scraper by using Selectorlib. You can see those changes here

You will notice in the code above that we used a file called selectors.yml. This file is what makes this tutorial so easy to create and follow. The magic behind this file is a tool called Selectorlib.

Selectorlib is a tool that makes selecting, marking up, and extracting data from web pages visual and very easy. The Selectorlib Chrome Extension lets you mark data that you need to extract, and creates the CSS Selectors or XPaths needed to extract that data, then previews how the data would look like. You can learn more about Selectorlib and how to use it here

If you just need the data we have shown above, you do not need to use Selectorlib because we have done that for you already and generated a simple “template” that you can just use. However, if you want to add a new field, you can use Selectorlib to add that field to the template.

Here is how we marked up the fields in the code for all the data we need from Alibaba Search Results page using Selectorlib Chrome Extension

Selectorlib Template for Alibaba Search Result Page

 

Once you have created the template, click on ‘Highlight’ to highlight and preview all of your selectors. Finally, click on ‘Export’ and download the YAML file. Save the file as search_results.yml in the /resources folder.

Here is our template https://github.com/scrapehero/alibaba-scraper/blob/master/scrapy_alibaba/resources/search_results.yml

Reading search keywords from a file

Let’s modify the spider to read keywords from a file from a folder called /resources in the project directory and get the products for all those keyword inputs. Let’s create the folder and create a CSV file inside it called keywords.csv. Here is how the file looks like if we need to search separately for headphones and then earplugs.

keyword
headphones
earplugs

Let’s use Python’s standard CSV module to read the keywords file.

 def parse(self, response):
      """Function to read keywords from keywords file"""
      keywords = csv.DictReader(open(os.path.join(os.path.dirname(__file__),"../resources/keywords.csv")))
      
      for keyword in keywords:
          search_text = keyword["keyword"]
          url = "https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText={0}&viewtype=G".format(search_text)
          yield scrapy.Request(url, callback=self.parse_listing, meta={"search_text":search_text})

The Full Scrapy Spider Code

Please see the full code here – https://github.com/scrapehero/alibaba-scraper

The spider alibaba_crawler would look like https://github.com/scrapehero/alibaba-scraper/blob/master/scrapy_alibaba/spiders/alibaba_crawler.py

Let’s try to run our scraper using

scrapy crawl alibaba_crawler
DEBUG: Forbidden by robots.txt: <GET https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=headphones&viewtype=G&page=1>

This is because the Alibaba site has disallowed to crawl all URLs with pattern /trade. You can verify this by visiting the robots.txt page which is located at https://www.alibaba.com/robots.txt

All spiders created using Scrapy 1.1+ already respect robots.txt by default. You can disable this by setting the variable ROBOTSTXT_OBEY = False.  Now scrapy knows there is no need to check the robots.txt file. It will start crawling the URLs specified the start_urls list.

Export Product Data into JSON or CSV using Scrapy

Scrapy provides in-built CSV and JSON formats.

scrapy crawl <spidername> -o output_filename.csv -t csv
scrapy crawl <spidername> -o output_filename.json -t json

To store the output as a CSV file:

scrapy crawl alibaba_crawler -o alibaba.csv -t csv

For a JSON file:

scrapy crawl alibaba_crawler -o alibaba.csv -t json

This will create an output file that will be in the same folder as the script.

Here is some sample data extracted from Alibaba.com as CSV

Known Limitations

This code should be capable of scraping the details of most product listing pages of Alibaba as long as the structure remains same or similar. In case you see errors related to LXML while scraping, it could be because:

  • Anti-Scraping measures of Alibaba.com might have flagged the crawler as a Bot.
  • The structure of the website might have changed, making all the selectors we have invalid

Want to scrape and extract product data from thousands of pages yourself?

Read more:

Scalable do-it-yourself scraping – How to build and run scrapers on a large scale

How to prevent getting blacklisted while scraping

If you need some professional help extracting eCommerce product data, let us know through the form below.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data



Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help

 

Posted in:   eCommerce Data Gathering Tutorials, Scrapy, Web Scraping Tutorials

Responses

arfanshabbirahmed July 13, 2019

can we scrape alibaba product keywords?

Reply

frank August 23, 2019

Could you explain how to scrape more than 1 page or multiple pages at a time from alibaba website.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Turn the Internet into meaningful, structured and usable data