Web scraping using Python in Windows can be tough. In this tutorial follow the steps to setup python 3 and python packages on your Windows 10 computer for web scraping in Windows 10.
This article is the continuation of our Beginners Guide to Web Scraping Series. In this tutorial we will learn how to create a web scraper using Python and BeautifulSoup in order scrape Reddit.
Let’s build a very basic web scraper using Python and BeautifulSoup and scrape the top links from Reddit.com.
Steps for web scraping Reddit
- Send a request to https://old.reddit.com/top/ and download the HTML Content of the page. This scraper does not need a web crawling component as we are only extracting data from a single link. In later posts of this series, we show you how to build more complex scrapers that need web crawlers. We will use Python’s built-in URL handling library – urllib to download the HTML of the web page.
- Parse the downloaded data using an HTML Parser to extract some data. ( The scraper’s parser module ). For parsing the HTML, we will use BeautifulSoup 4, a library used for pulling data out of HTML and XML files. It works with HTML parsers to provide idiomatic ways of navigating, searching and modifying the parse tree. For this tutorial, we will use Beautifulsoup along with Python’s own HTML Parser library.
- Transform the data into a usable format – The data transformation and cleaning module of this scraper. We don’t need any special packages to transform data in this scraper.
- Print the extracted data into the terminal ( or console ) and also save the data to a JSON file (our Data Serialisation and Storage Module). We will use python’s inbuilt JSON library to serialize the data into JSON and write it into a file.
Reddit(old.reddit.com) is used here as it is a popular website. If you are looking for reddit data use the reddit API instead.
How to set up your computer for web scraping Reddit
We will use Python 3 for this tutorial. The code will not run if you are using Python 2.7. To start, you need a computer with Python 3 and PIP installed in it.
Most UNIX operating systems like Linux and Mac OS comes with Python pre-installed. But, not all the Linux Operating Systems ship with Python 3 by default.
Let’s check your python version. Open a terminal ( in Linux and Mac OS ) or Command Prompt ( on Windows ) and type
python --version
and press enter. If the output looks something like Python 3.x.x, you have Python 3 installed. If it says Python 2.x.x you have Python 2. If it prints an error, you don’t probably have python installed.
If you don’t have Python 3, install it first.
Install Python 3 and Pip
Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/
Mac Users can follow this guide – http://docs.python-guide.org/en/latest/starting/install3/osx/
Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/
Install Packages
Two of our requirements – JSON and urllib libraries come pre-installed with Python 3. We just have to install BeautifulSoup. In Linux and Mac OS, open a terminal and run
sudo pip install beautifulsoup4
In Windows, open Command Prompt or Power Shell and run
pip install beautifulsoup4
You have all the libraries required to build and run the scraper now.
What data are we scraping from Reddit?
We will extract the following data fields from this page – https://www.reddit.com/top/ :
- Title of the link
- The URL where it points to
Finding the Data
Before we start building the scraper, we need to find where the data is present in the web page’s HTML Tags. You need to understand the HTML tags inside the page’s content to do so.
We assume you already understand HTML and know to code in Python. You don’t need advanced programming skills for the most part of this tutorial,
If you don’t know much about HTML and Python, spend some time reading Getting started with HTML – Mozilla Developer Network and https://www.programiz.com/python-programming
Let’s inspect the HTML of the web page and find out where the data is. Here is our logic
- Find the tag that encloses the list of links
- Get links from it and extract data
Inspecting the HTML
Open a browser ( we are using Chrome here ) and go to https://old.reddit.com/top/
Right-click on any link on the page and choose – Inspect Element. The browser will open a toolbar and show the HTML Content of the Web Page, formatted nicely.
If you look closely at the GIF above, there is DIV tag, with its attribute called ‘id’ as ‘siteTable.’ This DIV encloses the data we need to extract.
Now let’s find the HTML tag(s) which has the links we need to extract. You can right-click on the link title in the browser and do inspect element again. It will open the HTML Content like before, and highlight the tag which holds the data you right clicked on
For each link displayed in Reddit, the data is present in a <a> tag. If you inspect the other links, you will see that they are also in a similar <a> tag which follows the same template. BeautifulSoup has a function that finds all tags with certain attributes in it. In this case, we are looking for all tags inside the div called site table, and has its class attribute contains ‘title.’
Now that we know where the data is present in the HTML, let’s get into extracting it.
The Code
Open up your favorite text editor or a Jupyter Notebook, and get ready start coding. Create an empty file called reddit_scraper.py and save it. After each block of code below, you can save the file and run the script using
python reddit_scraper.py
If your computer has multiple versions of python, you can do python3 reddit_scraper.py to make sure it runs using python3 and not python2.
Lets first import the libraries we need – urllib, beautifulsoup, and JSON.
import urllib.request from bs4 import BeautifulSoup
Now, the code to send request and download the HTML Content of the URL
url = "https://old.reddit.com/top/" #download the URL and extract the content to the variable html request = urllib.request.Request(url) html = urllib.request.urlopen(request).read()
The HTML of the URL is in the variable called html. Let’s pass the HTML Content to BeautifulSoup and construct a tree for us to parse. Beautiful soup can use different parsers like LXML, html.parser. For now, we will use Python’s built-in HTML Parser – ‘html.parser’.
First, we will isolate the <div> called siteTable into a variable – main_table as we are not interested in extracting the other links from Reddit. Next, we will find all the tags inside it that contains a CSS class name – ‘title’.
#pass the HTML to Beautifulsoup. soup = BeautifulSoup(html,'html.parser') #get the HTML of the table called site Table where all the links are displayed main_table = soup.find("div",attrs={'id':'siteTable'}) #Now we go into main_table and get every a element in it which has a class "title" links = main_table.find_all("a",class_="title")
We’ll go through each link we found and extract the text and the URL from them.
#from each link extract the text of link and the link itself #List to store a dict of the data we extracted extracted_records = [] for link in links: title = link.text url = link['href'] record = { 'title':title, 'url':url } extracted_records.append(record) print(extracted_records)
Here is how the extracted records look now.
[{'title': 'Dad prevents crash.', 'url': 'https://i.imgur.com/UDLTfSl.gifv'}, {'title': 'When your cat crashes his bicycle in his dream...', 'url': '/r/funny/comments/7sf1ra/when_your_cat_crashes_his_bicycle_in_his_dream/'}, {'title': 'My friend playing Mario Odyssey during class', 'url': '/r/gaming/comments/7sfd5v/my_friend_playing_mario_odyssey_during_class/'}, {'title': 'So it snowed on the cabbage field...', 'url': '/r/pics/comments/7sfhqk/so_it_snowed_on_the_cabbage_field/'}, {'title': 'Pinterest is like a virus that infected the google image search.', 'url': '/r/Showerthoughts/comments/7sd54f/pinterest_is_like_a_virus_that_infected_the/'}, {'title': 'Synced videos of the Eagles fan running into the pillar', 'url': '/r/sports/comments/7sgrbd/synced_videos_of_the_eagles_fan_running_into_the/'}, {'title': 'My wife just shot this pic of a sleepy albino squirrel', 'url': 'https://i.imgur.com/HPJ7eq7.jpg'}, {'title': 'An old abandoned road slowly healing over and being reclaimed by nature.', 'url': '/r/pics/comments/7sdwnt/an_old_abandoned_road_slowly_healing_over_and/'}, {'title': 'I have a condition called dermatographism, where I can ‘write’ on my skin and it appears as a rash', 'url': '/r/mildlyinteresting/comments/7scrlj/i_have_a_condition_called_dermatographism_where_i/'}, {'title': 'Trying to pacifist run SuperHot is insane.', 'url': 'https://gfycat.com/OrdinaryHalfArachnid'}, {'title': 'His name is Buddha!', 'url': '/r/aww/comments/7sdp8r/his_name_is_buddha/'}, {'title': 'TIL almost 1 in 4 people with tattoos regret it, meaning about 7.5 million Americans', 'url': 'http://www.medicaldaily.com/tattoos-affect-your-health-long-term-side-effects-ink-has-your-immune-system-404404'}, {'title': 'Orangutan saves friend from drowning.', 'url': 'https://i.imgur.com/QSYWdRh.gifv'}, {'title': 'Shocking prison secrets that no-one tells you.', 'url': '/r/funny/comments/7scz5e/shocking_prison_secrets_that_noone_tells_you/'}, {'title': 'Dad reflexes prevent crash.', 'url': 'https://i.imgur.com/UDLTfSl.gifv'}, {'title': 'Heatmap of numbers found at the end of Reddit usernames [OC]', 'url': '/r/dataisbeautiful/comments/7sewjx/heatmap_of_numbers_found_at_the_end_of_reddit/'}, {'title': 'IBM Ball Head typewriter', 'url': 'https://i.imgur.com/zCg1LX1.gifv'}, {'title': 'M 8.0 earthquake in Alaska', 'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/at00p3054t#executive'}, {'title': 'These traffic lights in the Ukraine.', 'url': 'http://i.imgur.com/OINyphR.jpg'}, {'title': 'In 1974, 22-year-old Daniel Sorine trained his camera on two mime artists performing in New York’s Central Park. In 2013, Daniel was looking through his negatives and photographs when he realised one of the mimes was Oscar winning actor Robin Williams', 'url': '/r/OldSchoolCool/comments/7sde78/in_1974_22yearold_daniel_sorine_trained_his/'}, {'title': 'A tower of giraffes out for a run', 'url': 'https://i.imgur.com/IVEl1WI.gifv'}, {'title': 'Diver suspended in current.', 'url': 'https://i.imgur.com/uPUoYjy.gifv'}, {'title': '77% drop', 'url': '/r/funny/comments/7sgljt/77_drop/'}, {'title': 'The wife and I went to the Grand Canyon this weekend. Top was Saturday, Bottom was Sunday.', 'url': '/r/pics/comments/7shgzt/the_wife_and_i_went_to_the_grand_canyon_this/'}, {'title': "LPT: College isn't the only way to start a good career. Apprenticeships, Trade Schools, and Military Training can be great alternatives in today's world.", 'url': '/r/LifeProTips/comments/7sgpyf/lpt_college_isnt_the_only_way_to_start_a_good/'}]
If you look at some of the URLs above , they start with /r/pics/comments/7shgzt/the_wife_and_i_went_to_the_grand_canyon_this/. They are relative URLs, which are invalid if you paste them into a browser. We have to do some “cleaning” here. Let’s prepend https://www.reddit.com with the relative URL. But there are also absolute URLs in them, like this one https://i.imgur.com/HPJ7eq7.jpg. We only need to prepend if the link is relative.
Let’s go through each URL and check if they are absolute. For now, we’ll just check if it starts with HTTP. There are better ways to check if a URL is absolute in Python. For sake of simplicity, we’ll just stick to .startwith method of a string in python.
Let’s modify the code block above to add the cleaning process.
#from each link extract the text of link and the link itself #List to store a dict of the data we extracted extracted_records = [] for link in links: title = link.text url = link['href'] #There are better ways to check if a URL is absolute in Python. For sake simplicity we'll just stick to .startwith method of a string # https://stackoverflow.com/questions/8357098/how-can-i-check-if-a-url-is-absolute-using-python if not url.startswith('http'): url = "https://reddit.com"+url # You can join urls better using urlparse library of python. # https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin record = { 'title':title, 'url':url } extracted_records.append(record) print(extracted_records)
The records now should look like this
[{'title': 'Dad prevents crash.', 'url': 'https://i.imgur.com/UDLTfSl.gifv'}, {'title': 'When your cat crashes his bicycle in his dream...', 'url': 'https://reddit.com/r/funny/comments/7sf1ra/when_your_cat_crashes_his_bicycle_in_his_dream/'}, {'title': 'My friend playing Mario Odyssey during class', 'url': 'https://reddit.com/r/gaming/comments/7sfd5v/my_friend_playing_mario_odyssey_during_class/'}, {'title': 'So it snowed on the cabbage field...', 'url': 'https://reddit.com/r/pics/comments/7sfhqk/so_it_snowed_on_the_cabbage_field/'}, {'title': 'Pinterest is like a virus that infected the google image search.', 'url': 'https://reddit.com/r/Showerthoughts/comments/7sd54f/pinterest_is_like_a_virus_that_infected_the/'}, {'title': 'Synced videos of the Eagles fan running into the pillar', 'url': 'https://reddit.com/r/sports/comments/7sgrbd/synced_videos_of_the_eagles_fan_running_into_the/'}, {'title': 'My wife just shot this pic of a sleepy albino squirrel', 'url': 'https://i.imgur.com/HPJ7eq7.jpg'}, {'title': 'An old abandoned road slowly healing over and being reclaimed by nature.', 'url': 'https://reddit.com/r/pics/comments/7sdwnt/an_old_abandoned_road_slowly_healing_over_and/'}, {'title': 'I have a condition called dermatographism, where I can ‘write’ on my skin and it appears as a rash', 'url': 'https://reddit.com/r/mildlyinteresting/comments/7scrlj/i_have_a_condition_called_dermatographism_where_i/'}, {'title': 'Trying to pacifist run SuperHot is insane.', 'url': 'https://gfycat.com/OrdinaryHalfArachnid'}, {'title': 'His name is Buddha!', 'url': 'https://reddit.com/r/aww/comments/7sdp8r/his_name_is_buddha/'}, {'title': 'TIL almost 1 in 4 people with tattoos regret it, meaning about 7.5 million Americans', 'url': 'http://www.medicaldaily.com/tattoos-affect-your-health-long-term-side-effects-ink-has-your-immune-system-404404'}, {'title': 'Orangutan saves friend from drowning.', 'url': 'https://i.imgur.com/QSYWdRh.gifv'}, {'title': 'Shocking prison secrets that no-one tells you.', 'url': 'https://reddit.com/r/funny/comments/7scz5e/shocking_prison_secrets_that_noone_tells_you/'}, {'title': 'Dad reflexes prevent crash.', 'url': 'https://i.imgur.com/UDLTfSl.gifv'}, {'title': 'Heatmap of numbers found at the end of Reddit usernames [OC]', 'url': 'https://reddit.com/r/dataisbeautiful/comments/7sewjx/heatmap_of_numbers_found_at_the_end_of_reddit/'}, {'title': 'IBM Ball Head typewriter', 'url': 'https://i.imgur.com/zCg1LX1.gifv'}, {'title': 'M 8.0 earthquake in Alaska', 'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/at00p3054t#executive'}, {'title': 'These traffic lights in the Ukraine.', 'url': 'http://i.imgur.com/OINyphR.jpg'}, {'title': 'In 1974, 22-year-old Daniel Sorine trained his camera on two mime artists performing in New York’s Central Park. In 2013, Daniel was looking through his negatives and photographs when he realised one of the mimes was Oscar winning actor Robin Williams', 'url': 'https://reddit.com/r/OldSchoolCool/comments/7sde78/in_1974_22yearold_daniel_sorine_trained_his/'}, {'title': 'A tower of giraffes out for a run', 'url': 'https://i.imgur.com/IVEl1WI.gifv'}, {'title': 'Diver suspended in current.', 'url': 'https://i.imgur.com/uPUoYjy.gifv'}, {'title': '77% drop', 'url': 'https://reddit.com/r/funny/comments/7sgljt/77_drop/'}, {'title': 'The wife and I went to the Grand Canyon this weekend. Top was Saturday, Bottom was Sunday.', 'url': 'https://reddit.com/r/pics/comments/7shgzt/the_wife_and_i_went_to_the_grand_canyon_this/'}, {'title': "LPT: College isn't the only way to start a good career. Apprenticeships, Trade Schools, and Military Training can be great alternatives in today's world.", 'url': 'https://reddit.com/r/LifeProTips/comments/7sgpyf/lpt_college_isnt_the_only_way_to_start_a_good/'}]
They are clean, and the URLs are valid now. Let’s use a JSON serializer and save this data into a JSON file. The code below creates and opens a file called data.json and writes the data into it.
with open('data.json', 'w') as outfile: json.dump(extracted_records, outfile)
That’s it. Here is the full code of the scraper. We have made some minor changes to the code, find them – they aren’t that hard to spot.
import urllib.request from bs4 import BeautifulSoup import json url = "https://old.reddit.com/top/" headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'} request = urllib.request.Request(url,headers=headers) html = urllib.request.urlopen(request).read() soup = BeautifulSoup(html,'html.parser') #First lets get the HTML of the table called site Table where all the links are displayed main_table = soup.find("div",attrs={'id':'siteTable'}) #Now we go into main_table and get every a element in it which has a class "title" links = main_table.find_all("a",class_="title") #List to store a dict of the data we extracted extracted_records = [] for link in links: title = link.text url = link['href'] #There are better ways to check if a URL is absolute in Python. For sake simplicity we'll just stick to .startwith method of a string # https://stackoverflow.com/questions/8357098/how-can-i-check-if-a-url-is-absolute-using-python if not url.startswith('http'): url = "https://reddit.com"+url # You can join urls better using urlparse library of python. # https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin #Lets just print it print("%s - %s"%(title,url)) record = { 'title':title, 'url':url } extracted_records.append(record) #Lets write these to a JSON file for now. with open('data.json', 'w') as outfile: json.dump(extracted_records, outfile, indent=4)
Execute the full code
python reddit_scraper.py
You should see a file called data.json in the same folder as the code, with the extracted data, if everything went right. That’s it – you have built a web scraper to get data from Reddit. Yes, its a very simple scraper, but good enough to demonstrate the basics of data scraping.
What’s next?
In next the post of this series, we will continue working on this scraper by going one page further into the comments page and extract more details like the number of upvotes, the number of comments and some top level comments. You can view the third part of Web Scraping – Beginners Guide in the post – Web Scraping Tutorial for Beginners – Part 3 – Navigating and Extracting Data.
Responses
Great tutorial, thank you!
Mark, Thank you for the feedback.
Really appreciate the tutorial, but ran into several issues. The ”
<
div> called siteTable” and so forth no longer exist on reddit but does on old.reddit. Besides that consistently getting “urllib.error.HTTPError: HTTP Error 429: Too Many Requests” which after some google I understand is an issue with not having a custom user-agent, not addressed here.
Great find. We just wanted our readers to figure this out on their own, before moving to the next part. If you look at Part 3 of this tutorial, we have added User Agents there. 🙂
Hi! Thank you for this tutorial. I have run into a small issue here:
File “reddit_scraper.py”, line 24
‘url’:url
^
SyntaxError: invalid syntax
I merely copy/pasted your code above, before you added the cleaning. Any idea what is going wrong here? I am using Python 3.6.5 on Ubuntu 18.04.1.
Hey Erica,
You did not put the coma before your ‘url’.
Missing coma:
record = {
‘title’:title
‘url’:url
}
record = {
‘title’:title,
‘url’:url
}
Hi, thank you for this tutorial. I previously tried to post a comment with a question, doesn’t look like it went through but it turned out to be a missing comma problem..
I added User Agents and also changed the URL to old.reddit.com, but I am now getting the following error:
File “reddit_scraper.py”, line 18, in
url = link[‘href’]
File “/usr/lib/python3/dist-packages/bs4/element.py”, line 1011, in getitem
return self.attrs[key]
KeyError: ‘href’
Any idea why this is happening and/or how to fix it?
Thanks again!
how do you scrape multiple pages into a large pandas dataframe?
print(extracted_records)
[{‘title’: None, ‘url’: ‘https://gfycat.com/cleanspiffyleafwing’}]
It’s not giving expected output.
In their extracted_records code, they have the following: title = link.textT
It should be: title = link.text
Thank you for this! FYI there is a typo in this line : title = link.textT should be title = link.text
Thanks Joel.
Good catch!
We will fix it.
The typo has been fixed.
Thanks for pointing it out.
Hello, I’m having some trouble scraping anything other than links. From my understanding the “a” tag is for links and so for links = main_table.find_all(“a”,class_=”title”) I would just need to replace “a” with the correct tag and “title” with the name. The problem I have is I can’t seem to figure out what the tag is for the thing I’m trying to scrape.
https://imgur.com/a/oEA8TRH
I want to scrape the last price and when I inspect it I only see a span class, and when I replace “a” with “span” it gives me an error. Here is the code I am using:
import urllib.request
from bs4 import BeautifulSoup
import json
url = “https://www.tradingview.com/”
request = urllib.request.Request(url)
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,’html.parser’)
main_table = soup.find(“div”,attrs={‘id’:’last-block’})
price = main_table.find_all(“span”,class_=”last”)
print(price)
Brother Don’t do copy paste and edit code.
First try to understand, what they did. Read again whole tut with each example code.. And print everything separatel. You will got what you do,
Great tutorial. Insightful and easy to follow through. Thanks
how to save title and url in csv or excel file?
Comments are closed.