Webscraping using Python without using large frameworks like Scrapy

ScrapeHero
Last Updated: December 9, 2024

Here are some basic steps performed by most web spiders:
Here are the 3 major modules in every web crawler:
Request/Response Handler
Data parsing/data cleansing/data munging process
Data serialization/data pipelines
Basic spider rules

Scrapy is a well-established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow. If you would like to roll up your sleeves and perform web scraping in Python. continue reading.

If you need publicly available data from scraping the Internet, before creating a web scraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data. Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.

Here are some basic steps performed by most web spiders:

Start with a URL and use an HTTP GET or PUT request to access the URL
Fetch all the contents in it and parse the data
Store the data in any database or put it into any data warehouse
Enqueue all the URLs in a page
Use the URLs in the queue and repeat from process 1

Read More – Best Web Scraping Tools

Best Data Extraction Tools and Software

Best Web Crawling Tools and Framework

Here are the 3 major modules in every web crawler:

Request/Response handler.
Data parsing/data cleansing/data munging process.
Data serialization/data pipelines.

Let’s look at each of these modules and see what they do and how to use them.

Request/Response Handler

Request/response handlers are managers who make HTTP requests to a url or a group of urls, and fetch the response objects as HTML contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used

urllib (20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.
urllib2 (20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – an extensible library of urllib, which would handle basic HTTP requests, digest authentication, redirections, cookies and more.
requests (Requests: HTTP for Humans) – Much advanced request library

which is built on top of basic request handling libraries.

Data parsing/data cleansing/data munging process

This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.

In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.

Data serialization/data pipelines

Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below for web scraping in python

pickle (pickle – Python object serialization) – This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure
JSON (JSON encoder and decoder)
CSV (https://docs.python.org/2/library/csv.html)
Basic database interface libraries like pymongo (Tutorial – PyMongo), mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)

And many more such libraries based on the format and database/data storage.

Basic spider rules

The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.

Limit the number of requests in a second and build enough delays in the spiders so that you don’t adversely affect the site.

It just makes sense to be nice.

To learn more on web scraping in Python check out our web scraping tutorials page.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data

Please DO NOT contact us for any help with our Tutorials and Code using this form or by calling us, instead please add a comment to the bottom of the tutorial page for help