Beginners guide to Web Scraping: Part 2 – Build a web scraper for Reddit using Python and BeautifulSoup

This article is the continuation of our Beginners Guide to Web Scraping Series.

Part 1 – Beginner’s guide to Web Scraping – Part 1 – The Basics

Let’s build a very basic web scraper using Python and BeautifulSoup and scrape the top links from Reddit.com.

We are going to use Python to build the scraper for the reasons we explained in Part 1. Here is what our simple web scraper is going to do :

  1. Send a request to https://old.reddit.com/top/ and download the HTML Content of the page. This scraper does not need a web crawling component as we are only extracting data from a single link. In later posts of this series, we show you how to build more complex scrapers that need web crawlers. We will use Python’s built-in URL handling library – urllib to download the HTML of the web page.
  2. Parse the downloaded data using an HTML Parser to extract some data. ( The scraper’s parser module ). For parsing the HTML, we will use BeautifulSoup 4, a library used for pulling data out of HTML and XML files. It works with HTML parsers to provide idiomatic ways of navigating, searching and modifying the parse tree. For this tutorial, we will use Beautifulsoup along with Python’s own HTML Parser library.
  3. Transform the data into a usable format – The data transformation and cleaning module of this scraper. We don’t need any special packages to transform data in this scraper.
  4. Print the extracted data into the terminal ( or console ) and also save the data to a JSON file (our Data Serialisation and Storage Module). We will use python’s inbuilt JSON library to serialize the data into JSON and write it into a file.

Reddit(old.reddit.com) is used here as it is a popular website. If you are looking for reddit data use the reddit API instead.

How to set up your computer for web scraper development

We will use Python 3 for this tutorial. The code will not run if you are using Python 2.7. To start, you need a computer with Python 3 and PIP installed in it.

Most UNIX operating systems like Linux and Mac OS comes with Python pre-installed. But, not all the Linux Operating Systems ship with Python 3 by default.

Let’s check your python version. Open a terminal ( in Linux and Mac OS ) or Command Prompt ( on Windows ) and type

python --version

and press enter. If the output looks something like Python 3.x.x, you have Python 3 installed. If it says Python 2.x.x you have Python 2. If it prints an error, you don’t probably have python installed.

If you don’t have Python 3, install it first.

Install Python 3 and Pip

Here is a guide to install Python 3 in Linuxhttp://docs.python-guide.org/en/latest/starting/install3/linux/

Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/

Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/

Install Packages

Two of our requirements – JSON and urllib libraries come pre-installed with Python 3. We just have to install BeautifulSoup. In Linux and Mac OS, open a terminal and run

sudo pip install beautifulsoup4

In Windows, open Command Prompt or Power Shell and run

pip install beautifulsoup4

You have all the libraries required to build and run the scraper now.

What data are we extracting?


We will extract the following data fields from this page – https://www.reddit.com/top/ :

  1. Title of the link
  2. The URL where it points to

Finding the Data

Before we start building the scraper, we need to find where the data is present in the web page’s HTML Tags. You need to understand the HTML tags inside the page’s content to do so.

We assume you already understand HTML and know to code in Python. You don’t need advanced programming skills for the most part of this tutorial,

If you don’t know much about HTML and Python, spend some time reading Getting started with HTML – Mozilla Developer Network and https://www.programiz.com/python-programming

Let’s inspect the HTML of the web page and find out where the data is. Here is our logic

  1. Find the tag that encloses the list of links
  2. Get links from it and extract data

Inspecting the HTML

Open a browser ( we are using Chrome here ) and go to https://old.reddit.com/top/

Right-click on any link on the page and choose – Inspect Element. The browser will open a toolbar and show the HTML Content of the Web Page, formatted nicely.

If you look closely at the GIF above, there is DIV tag, with its attribute called ‘id’ as ‘siteTable.’ This DIV encloses the data we need to extract.

Now let’s find the HTML tag(s) which has the links we need to extract. You can right-click on the link title in the browser and do inspect element again. It will open the HTML Content like before, and highlight the tag which holds the data you right clicked on

For each link displayed in Reddit, the data is present in a <a> tag. If you inspect the other links, you will see that they are also in a similar <a> tag which follows the same template. BeautifulSoup has a function that finds all tags with certain attributes in it. In this case, we are looking for all tags inside the div called site table, and has its class attribute contains ‘title.’

Now that we know where the data is present in the HTML, let’s get into extracting it.

The Code

Open up your favorite text editor or a Jupyter Notebook, and get ready start coding. Create an empty file called reddit_scraper.py and save it. After each block of code below, you can save the file and run the script using

python reddit_scraper.py

If your computer has multiple versions of python, you can do python3 reddit_scraper.py to make sure it runs using python3 and not python2.

Lets first import the libraries we need – urllib, beautifulsoup, and JSON.

import urllib.request
from bs4 import BeautifulSoup

Now, the code to send request and download the HTML Content of the URL

url = "https://old.reddit.com/top/"
#download the URL and extract the content to the variable html 
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()

The HTML of the URL is in the variable called html. Let’s pass the HTML Content to BeautifulSoup and construct a tree for us to parse. Beautiful soup can use different parsers like LXML, html.parser. For now, we will use Python’s built-in HTML Parser – ‘html.parser’.

First, we will isolate the <div> called siteTable into a variable – main_table as we are not interested in extracting the other links from Reddit. Next, we will find all the tags inside it that contains a CSS class name – ‘title’.

#pass the HTML to Beautifulsoup.
soup = BeautifulSoup(html,'html.parser')
#get the HTML of the table called site Table where all the links are displayed
main_table = soup.find("div",attrs={'id':'siteTable'})
#Now we go into main_table and get every a element in it which has a class "title" 
links = main_table.find_all("a",class_="title")

We’ll go through each link we found and extract the text and the URL from them.

#from each link extract the text of link and the link itself
#List to store a dict of the data we extracted 
extracted_records = []
for link in links: 
    title = link.textT
    url = link['href']
    record = {
        'title':title,
        'url':url
        }
    extracted_records.append(record)
print(extracted_records)

Here is how the extracted records look now.

[{'title': 'Dad prevents crash.', 'url': 'https://i.imgur.com/UDLTfSl.gifv'},
 {'title': 'When your cat crashes his bicycle in his dream...',
  'url': '/r/funny/comments/7sf1ra/when_your_cat_crashes_his_bicycle_in_his_dream/'},
 {'title': 'My friend playing Mario Odyssey during class',
  'url': '/r/gaming/comments/7sfd5v/my_friend_playing_mario_odyssey_during_class/'},
 {'title': 'So it snowed on the cabbage field...',
  'url': '/r/pics/comments/7sfhqk/so_it_snowed_on_the_cabbage_field/'},
 {'title': 'Pinterest is like a virus that infected the google image search.',
  'url': '/r/Showerthoughts/comments/7sd54f/pinterest_is_like_a_virus_that_infected_the/'},
 {'title': 'Synced videos of the Eagles fan running into the pillar',
  'url': '/r/sports/comments/7sgrbd/synced_videos_of_the_eagles_fan_running_into_the/'},
 {'title': 'My wife just shot this pic of a sleepy albino squirrel',
  'url': 'https://i.imgur.com/HPJ7eq7.jpg'},
 {'title': 'An old abandoned road slowly healing over and being reclaimed by nature.',
  'url': '/r/pics/comments/7sdwnt/an_old_abandoned_road_slowly_healing_over_and/'},
 {'title': 'I have a condition called dermatographism, where I can ‘write’ on my skin and it appears as a rash',
  'url': '/r/mildlyinteresting/comments/7scrlj/i_have_a_condition_called_dermatographism_where_i/'},
 {'title': 'Trying to pacifist run SuperHot is insane.',
  'url': 'https://gfycat.com/OrdinaryHalfArachnid'},
 {'title': 'His name is Buddha!',
  'url': '/r/aww/comments/7sdp8r/his_name_is_buddha/'},
 {'title': 'TIL almost 1 in 4 people with tattoos regret it, meaning about 7.5 million Americans',
  'url': 'http://www.medicaldaily.com/tattoos-affect-your-health-long-term-side-effects-ink-has-your-immune-system-404404'},
 {'title': 'Orangutan saves friend from drowning.',
  'url': 'https://i.imgur.com/QSYWdRh.gifv'},
 {'title': 'Shocking prison secrets that no-one tells you.',
  'url': '/r/funny/comments/7scz5e/shocking_prison_secrets_that_noone_tells_you/'},
 {'title': 'Dad reflexes prevent crash.',
  'url': 'https://i.imgur.com/UDLTfSl.gifv'},
 {'title': 'Heatmap of numbers found at the end of Reddit usernames [OC]',
  'url': '/r/dataisbeautiful/comments/7sewjx/heatmap_of_numbers_found_at_the_end_of_reddit/'},
 {'title': 'IBM Ball Head typewriter',
  'url': 'https://i.imgur.com/zCg1LX1.gifv'},
 {'title': 'M 8.0 earthquake in Alaska',
  'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/at00p3054t#executive'},
 {'title': 'These traffic lights in the Ukraine.',
  'url': 'http://i.imgur.com/OINyphR.jpg'},
 {'title': 'In 1974, 22-year-old Daniel Sorine trained his camera on two mime artists performing in New York’s Central Park. In 2013, Daniel was looking through his negatives and photographs when he realised one of the mimes was Oscar winning actor Robin Williams',
  'url': '/r/OldSchoolCool/comments/7sde78/in_1974_22yearold_daniel_sorine_trained_his/'},
 {'title': 'A tower of giraffes out for a run',
  'url': 'https://i.imgur.com/IVEl1WI.gifv'},
 {'title': 'Diver suspended in current.',
  'url': 'https://i.imgur.com/uPUoYjy.gifv'},
 {'title': '77% drop', 'url': '/r/funny/comments/7sgljt/77_drop/'},
 {'title': 'The wife and I went to the Grand Canyon this weekend. Top was Saturday, Bottom was Sunday.',
  'url': '/r/pics/comments/7shgzt/the_wife_and_i_went_to_the_grand_canyon_this/'},
 {'title': "LPT: College isn't the only way to start a good career. Apprenticeships, Trade Schools, and Military Training can be great alternatives in today's world.",
  'url': '/r/LifeProTips/comments/7sgpyf/lpt_college_isnt_the_only_way_to_start_a_good/'}]

If you look at some of the URLs above , they start with /r/pics/comments/7shgzt/the_wife_and_i_went_to_the_grand_canyon_this/. They are relative URLs, which are invalid if you paste them into a browser. We have to do some “cleaning” here. Let’s prepend https://www.reddit.com with the relative URL. But there are also absolute URLs in them, like this one https://i.imgur.com/HPJ7eq7.jpg. We only need to prepend if the link is relative.

Let’s go through each URL and check if they are absolute. For now, we’ll just check if it starts with HTTP. There are better ways to check if a URL is absolute in Python. For sake of simplicity, we’ll just stick to .startwith method of a string in python.

Let’s modify the code block above to add the cleaning process.

#from each link extract the text of link and the link itself
#List to store a dict of the data we extracted 
extracted_records = []
for link in links: 
    title = link.text
    url = link['href']
    #There are better ways to check if a URL is absolute in Python. For sake simplicity we'll just stick to .startwith method of a string 
    # https://stackoverflow.com/questions/8357098/how-can-i-check-if-a-url-is-absolute-using-python 
    if not url.startswith('http'):
        url = "https://reddit.com"+url 
    # You can join urls better using urlparse library of python. 
    # https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin 
    record = {
        'title':title,
        'url':url
        }
    extracted_records.append(record)
print(extracted_records)

The records now should look like this

[{'title': 'Dad prevents crash.', 'url': 'https://i.imgur.com/UDLTfSl.gifv'},
 {'title': 'When your cat crashes his bicycle in his dream...',
  'url': 'https://reddit.com/r/funny/comments/7sf1ra/when_your_cat_crashes_his_bicycle_in_his_dream/'},
 {'title': 'My friend playing Mario Odyssey during class',
  'url': 'https://reddit.com/r/gaming/comments/7sfd5v/my_friend_playing_mario_odyssey_during_class/'},
 {'title': 'So it snowed on the cabbage field...',
  'url': 'https://reddit.com/r/pics/comments/7sfhqk/so_it_snowed_on_the_cabbage_field/'},
 {'title': 'Pinterest is like a virus that infected the google image search.',
  'url': 'https://reddit.com/r/Showerthoughts/comments/7sd54f/pinterest_is_like_a_virus_that_infected_the/'},
 {'title': 'Synced videos of the Eagles fan running into the pillar',
  'url': 'https://reddit.com/r/sports/comments/7sgrbd/synced_videos_of_the_eagles_fan_running_into_the/'},
 {'title': 'My wife just shot this pic of a sleepy albino squirrel',
  'url': 'https://i.imgur.com/HPJ7eq7.jpg'},
 {'title': 'An old abandoned road slowly healing over and being reclaimed by nature.',
  'url': 'https://reddit.com/r/pics/comments/7sdwnt/an_old_abandoned_road_slowly_healing_over_and/'},
 {'title': 'I have a condition called dermatographism, where I can ‘write’ on my skin and it appears as a rash',
  'url': 'https://reddit.com/r/mildlyinteresting/comments/7scrlj/i_have_a_condition_called_dermatographism_where_i/'},
 {'title': 'Trying to pacifist run SuperHot is insane.',
  'url': 'https://gfycat.com/OrdinaryHalfArachnid'},
 {'title': 'His name is Buddha!',
  'url': 'https://reddit.com/r/aww/comments/7sdp8r/his_name_is_buddha/'},
 {'title': 'TIL almost 1 in 4 people with tattoos regret it, meaning about 7.5 million Americans',
  'url': 'http://www.medicaldaily.com/tattoos-affect-your-health-long-term-side-effects-ink-has-your-immune-system-404404'},
 {'title': 'Orangutan saves friend from drowning.',
  'url': 'https://i.imgur.com/QSYWdRh.gifv'},
 {'title': 'Shocking prison secrets that no-one tells you.',
  'url': 'https://reddit.com/r/funny/comments/7scz5e/shocking_prison_secrets_that_noone_tells_you/'},
 {'title': 'Dad reflexes prevent crash.',
  'url': 'https://i.imgur.com/UDLTfSl.gifv'},
 {'title': 'Heatmap of numbers found at the end of Reddit usernames [OC]',
  'url': 'https://reddit.com/r/dataisbeautiful/comments/7sewjx/heatmap_of_numbers_found_at_the_end_of_reddit/'},
 {'title': 'IBM Ball Head typewriter',
  'url': 'https://i.imgur.com/zCg1LX1.gifv'},
 {'title': 'M 8.0 earthquake in Alaska',
  'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/at00p3054t#executive'},
 {'title': 'These traffic lights in the Ukraine.',
  'url': 'http://i.imgur.com/OINyphR.jpg'},
 {'title': 'In 1974, 22-year-old Daniel Sorine trained his camera on two mime artists performing in New York’s Central Park. In 2013, Daniel was looking through his negatives and photographs when he realised one of the mimes was Oscar winning actor Robin Williams',
  'url': 'https://reddit.com/r/OldSchoolCool/comments/7sde78/in_1974_22yearold_daniel_sorine_trained_his/'},
 {'title': 'A tower of giraffes out for a run',
  'url': 'https://i.imgur.com/IVEl1WI.gifv'},
 {'title': 'Diver suspended in current.',
  'url': 'https://i.imgur.com/uPUoYjy.gifv'},
 {'title': '77% drop',
  'url': 'https://reddit.com/r/funny/comments/7sgljt/77_drop/'},
 {'title': 'The wife and I went to the Grand Canyon this weekend. Top was Saturday, Bottom was Sunday.',
  'url': 'https://reddit.com/r/pics/comments/7shgzt/the_wife_and_i_went_to_the_grand_canyon_this/'},
 {'title': "LPT: College isn't the only way to start a good career. Apprenticeships, Trade Schools, and Military Training can be great alternatives in today's world.",
  'url': 'https://reddit.com/r/LifeProTips/comments/7sgpyf/lpt_college_isnt_the_only_way_to_start_a_good/'}]

They are clean, and the URLs are valid now. Let’s use a JSON serializer and save this data into a JSON file. The code below creates and opens a file called data.json and writes the data into it.

with open('data.json', 'w') as outfile:
    json.dump(extracted_records, outfile)

That’s it. Here is the full code of the scraper. We have made some minor changes to the code, find them – they aren’t that hard to spot.

import urllib.request
from bs4 import BeautifulSoup
import json

url = "https://old.reddit.com/top/"
headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
request = urllib.request.Request(url,headers=headers)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
#First lets get the HTML of the table called site Table where all the links are displayed
main_table = soup.find("div",attrs={'id':'siteTable'})
#Now we go into main_table and get every a element in it which has a class "title" 
links = main_table.find_all("a",class_="title")
#List to store a dict of the data we extracted 
extracted_records = []
for link in links: 
    title = link.text
    url = link['href']
    #There are better ways to check if a URL is absolute in Python. For sake simplicity we'll just stick to .startwith method of a string 
    # https://stackoverflow.com/questions/8357098/how-can-i-check-if-a-url-is-absolute-using-python 
    if not url.startswith('http'):
        url = "https://reddit.com"+url 
    # You can join urls better using urlparse library of python. 
    # https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin 
    #Lets just print it 
    print("%s - %s"%(title,url))
    record = {
        'title':title,
        'url':url
        }
    extracted_records.append(record)
#Lets write these to a JSON file for now. 
with open('data.json', 'w') as outfile:
    json.dump(extracted_records, outfile, indent=4)

Execute the full code

python reddit_scraper.py

You should see a file called data.json in the same folder as the code, with the extracted data, if everything went right. That’s it – you have built a web scraper to get data from Reddit. Yes, its a very simple scraper, but good enough to demonstrate the basics of data scraping.

What’s next?

In next the post of this series, we will continue working on this scraper by going one page further into the comments page and extract more details like the number of upvotes, the number of comments and some top level comments. You can view the third part of Web Scraping – Beginners Guide in the post – Web Scraping Tutorial for Beginners – Part 3 – Navigating and Extracting Data.

8 comments on “Beginners guide to Web Scraping: Part 2 – Build a web scraper for Reddit using Python and BeautifulSoup

Mark Hafemann

Great tutorial, thank you!

Joshua T Rees

Really appreciate the tutorial, but ran into several issues. The ”

<

div> called siteTable” and so forth no longer exist on reddit but does on old.reddit. Besides that consistently getting “urllib.error.HTTPError: HTTP Error 429: Too Many Requests” which after some google I understand is an issue with not having a custom user-agent, not addressed here.

    ScrapeHero

    Great find. We just wanted our readers to figure this out on their own, before moving to the next part. If you look at Part 3 of this tutorial, we have added User Agents there. 🙂

Erica Anderson

Hi! Thank you for this tutorial. I have run into a small issue here:

File “reddit_scraper.py”, line 24
‘url’:url
^
SyntaxError: invalid syntax

I merely copy/pasted your code above, before you added the cleaning. Any idea what is going wrong here? I am using Python 3.6.5 on Ubuntu 18.04.1.

    Nick

    Hey Erica,

    You did not put the coma before your ‘url’.

    Missing coma:

    record = {
    ‘title’:title
    ‘url’:url
    }

    record = {
    ‘title’:title,
    ‘url’:url
    }

Erica Anderson

Hi, thank you for this tutorial. I previously tried to post a comment with a question, doesn’t look like it went through but it turned out to be a missing comma problem..

I added User Agents and also changed the URL to old.reddit.com, but I am now getting the following error:

File “reddit_scraper.py”, line 18, in
url = link[‘href’]
File “/usr/lib/python3/dist-packages/bs4/element.py”, line 1011, in getitem
return self.attrs[key]
KeyError: ‘href’

Any idea why this is happening and/or how to fix it?

Thanks again!

redzch

how do you scrape multiple pages into a large pandas dataframe?

Comments or Questions?

Turn the Internet into meaningful, structured and usable data