Web Scraping Tutorial for Beginners – Part 3 – Navigating and Extracting Data

In this part of our Web Scraping – Beginners Guide series we’ll show you how to navigate web pages and parse and extract data from them. Let’s continue from where we left off in the previous post – Beginners guide to Web Scraping: Part 2  – Build a web scraper for Reddit using Python and BeautifulSoup. If you are new to this series, we recommend that you start from What is web scraping: Part 1 – Beginner’s guide.

Reddit(old.reddit.com) is used here as it is a popular website. If you are looking for reddit data use the reddit API instead.

Navigating Pages using a simple Crawler

Now that we have extracted titles and URLs to the top links let’s go further. Instead of grabbing the title and URL of each link, we’ll go to the comments page for each link. Like before, right click on any comments link and choose Inspect Element. Chrome developer toolbar should popup highlighting the `a` tag that contains the URL to the comments page.
scrape-reddit-link-to-comments-from-top-page

You can see that they have a class attribute `class=’bylink comments may-blank’`. But before we decide, let’s make sure the a tags to comments for the other links also have the same class.

a href="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/" data-inbound-
url="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/?
utm_content=comments&utm_medium=browse&utm_source=reddit&utm_name=frontpage" data-href-
url="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/" data-event-action="comments"
class="bylink comments may-blank" rel="nofollow">2876 comments/a
If all the comment links don’t have at least a common class, we need to think of another attribute to use for selecting these tags. Fortunately, in this case, all the comments links have three common classes – bylink, comments, and may-blank.
Let’s modify the code from the previous scraper. We are removing everything after isolating the `div` called `siteTable`.
import urllib.request
from bs4 import BeautifulSoup
import json

url = "https://old.reddit.com/top/"
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
#First lets get the HTML of the table called site Table where all the links are displayed
main_table = soup.find("div",attrs={'id':'siteTable'})
Let’s select all the a tags with the class attribute equal to bylink comments may-blank using find_all from the HTML we had isolated into main_table
comment_a_tags = main_table.find_all('a',attrs={'class':'bylink comments may-blank'})
Now that we have all the a tags with comment links, let’s go ahead and extract the href attribute from them. There are some absolute links and some relative links among the values. So lets clean them by prepending https://www.reddit.com/ before the URL to make them valid.
#Blank list to store the urls as they are extracted
urls = [] 
for a_tag in comment_a_tags:
    url = a_tag['href']
    if not url.startswith('http'):
        url = "https://reddit.com"+url
    urls.append(url)
urls
['https://www.reddit.com/r/pics/comments/87bb1m/an_iranian_teacher_visits_his_cancerstricken/',
     'https://www.reddit.com/r/gifs/comments/878eka/a_textbook_example_of_how_to_cross_a_road_safely/',
     'https://www.reddit.com/r/funny/comments/87agu2/special_needs_teachers_put_this_up_today/',
     'https://www.reddit.com/r/gifs/comments/87a9dz/dog_thrown_from_plane_into_the_abyss_below/',
     'https://www.reddit.com/r/worldnews/comments/87bbio/facebook_has_lost_100_billion_in_10_days_and_now/',
     'https://www.reddit.com/r/aww/comments/879b4q/this_is_the_life_we_should_all_aspire_to/',
     'https://www.reddit.com/r/todayilearned/comments/878ei8/til_in_1994_the_kkk_applied_to_sponsor_a_section/',
     'https://www.reddit.com/r/gifs/comments/87cw3o/new_silicon_valley_intro_throws_shade_at_facebook/',
     'https://www.reddit.com/r/aww/comments/87bj4l/dad_get_the_ball_plz/',
     'https://www.reddit.com/r/mildlyinteresting/comments/879wh7/the_wear_indicators_on_my_tires_show_a_percentage/',
     'https://www.reddit.com/r/pics/comments/8784r6/tunnel_vision/',
     'https://www.reddit.com/r/worldnews/comments/8787jz/prime_minister_shinzo_abe_wants_to_repeal_a/',
     'https://www.reddit.com/r/wholesomememes/comments/878zkf/ryan_being_an_awesome_person/',
     'https://www.reddit.com/r/todayilearned/comments/87brza/til_theres_a_type_of_honey_called_mad_honey_which/',
     'https://www.reddit.com/r/Tinder/comments/879g0h/her_bio_said_if_you_peaked_in_high_school_stay/',
     'https://www.reddit.com/r/news/comments/8791fr/spy_poisoning_us_to_expel_60_russian_diplomats/',
     'https://www.reddit.com/r/news/comments/87adao/facebook_confirms_it_records_call_history_stoking/',
     'https://www.reddit.com/r/OldSchoolCool/comments/878prh/johnny_cash_playing_at_folsom_prison_1968/',
     'https://www.reddit.com/r/MadeMeSmile/comments/877qz8/dads_wise_words/',
     'https://www.reddit.com/r/gaming/comments/87bk5l/this_is_where_the_real_fun_begins/',
     'https://www.reddit.com/r/gaming/comments/877jq6/unreal_engine_4_showcase_with_andy_serkis/',
     'https://www.reddit.com/r/worldnews/comments/87a4la/canada_is_all_set_to_legalize_marijuana_by_the/',
     'https://www.reddit.com/r/iamverysmart/comments/878mki/tech_wannabe_shutdown_by_the_codes_author/',
     'https://www.reddit.com/r/CrappyDesign/comments/8789y7/apparently_incest_is_perfectly_fine/',
     'https://www.reddit.com/r/funny/comments/87c7p3/fantastic_mr_fox_snatches_wallet/']

We’ve extracted and cleaned the URLs to the comments page.

Before we go into downloading all the URLs, let’s make a function to extract data from each page. We’ll first download one comment page https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/, extract the data from it and then turn it into a function.

Let’s take a look at a comments page and find out where the data we are looking for is present in the HTML. Here are the fields

What data are we extracting?

details-to-extracts-reddit-comment-pages

We’ll extract these fields from the comments page

  • Title
  • Permalink
  • Original Poster
  • Number of Upvotes
  • Number of Comments
  • Comments ( all visible ones )
    • commenter
    • comment text
    • permalink to comment

Finding the data in the HTML

To find where each field we need in the HTML is, let’s do what we always do – right click the detail and inspect the element. Let’s download and get the HTML body for one URL first. We will later add this into the for loop above.

Download the Page Content

request = urllib.request.Request('https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/')
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')

Defining selection criteria

Selecting the Title

Right click on the post title and inspect the element. The a tag enclosing the title looks like this
a class="title may-blank outbound" data-event-action="title" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" 
tabindex="1" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" 
data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&
amp;token=AQAAZyBzWjNbF5yCNEDFBuOEFiTElADct-PZ77Hq713JVTrIro5h&app_name=reddit.com" data-outbound-expiration="1517494375000" rel="">Cancer ‘vaccine’ eliminates 
tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.
It looks like we can use the `class` attribute to select this like before. You could also use the `data-event-action` attribute to get the title.
Let’s test it out
soup.find('a',attrs={'class':'title may-blank outbound'}).text
soup.find('a',attrs={'data-event-action':'title'}).text
Let’s check if there are multiple elements that match the selector we have used. Ideally, we need the selector to match only one element – the one we need to extract. The `find` method of BeautifulSoup takes the first element that matches the selector. So the selection criteria we choose must be unique to the element (preferred) or should be the first match.
The `find_all` method of BeautifulSoup selects all the elements that match the selection criteria. We had used `find_all` to get the `a` elements that had links to the comments page above.
Let’s take a look at each of these functions and what elements it would return if we only used class for getting the title.
soup.find('a',attrs={'class':'title'})
Okay, `find` method picked the right element. Let’s just check if there are multiple matches by using `find_all`
soup.find_all('a',attrs={'class':'title'})
Looks like there aren’t any other `a` tags which have class as `title`. Let’s check if there are any tags other than an `a` tag with the same class. We will leave the `name` argument or the first parameter blank, so that it will match all tags – not just `a` tags.
soup.find_all(attrs={'class':'title'})
[<span class="selected title">my subreddits</span>,
 <div class="title"><h1>MODERATORS</h1></div>,
 <p class="title"><a class="title may-blank outbound" data-event-action="title" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" data-outbound-expiration="1522243977000" data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&amp;token=AQAAiZm7WrdT9vPQgZ6vBmiIlKk93kvhkKS2ZV3jkg2ytvYJ3LUm&amp;app_name=reddit.com" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" rel="" tabindex="1">Cancer ‘vaccine’ eliminates tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.</a> <span class="domain">(<a href="/domain/med.stanford.edu/">med.stanford.edu</a>)</span></p>,
 <a class="title may-blank outbound" data-event-action="title" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" data-outbound-expiration="1522243977000" data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&amp;token=AQAAiZm7WrdT9vPQgZ6vBmiIlKk93kvhkKS2ZV3jkg2ytvYJ3LUm&amp;app_name=reddit.com" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" rel="" tabindex="1">Cancer ‘vaccine’ eliminates tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.</a>,
 <span class="title">top 200 comments</span>,
 <li class="flat-vert title">about</li>,
 <li class="flat-vert title">help</li>,
 <li class="flat-vert title">apps &amp; tools</li>,
 <li class="flat-vert title">&lt;3</li>]
There we go – 9 elements have `title` as a class attribute. We are lucky that there is only one `a` tag in it. The reason we showed this is to reinforce that you should always try to define a unique selection criteria, that will not match anything other than what we are looking for.
To get the title, we can use:
soup.find('a',attrs={'class':'title'}).text

or

soup.find('a',attrs={'class':'title may-blank outbound'}).text

or

soup.find('a',attrs={'data-event-action':'title'}).text
The lesson here is that there are always multiple ways to select the same element, and there is no single correct way. Just pick one that works with multiple pages of the same type by testing the same criteria in different URLs of the same kind of page. In this case – few other comment pages
Now that we have shown in detail how to extract the title, we’ll get few other fields without explaining a lot. It’s easy to figure out.
title = soup.find('a',attrs={'class':'title'}).text

Selecting Number of UpVotes

Inspect the UpVote count. You should something similar to the one below
<div class="midcol unvoted">
    <div class="arrow up login-required access-required" data-event-action="upvote" role="button" aria-label="upvote" tabindex="0"></div>
    <div class="score dislikes" title="88324">88.3k</div>
    <!-- THE DIV BELOW WOULD BE HIGHLIGHTED --> 
    <div class="score unvoted" title="88325">88.3k</div>
    <div class="score likes" title="88326">88.3k</div>
    <div class="arrow down login-required access-required" data-event-action="downvote" role="button" aria-label="downvote" tabindex="0"></div>
</div>
Reddit seems to display 3 divs. One for a person who has downvoted in the div with `<div class=”score dislikes” title=”88324″>88.3k</div>` , for a person who hasn’t voted – `<div class=”score unvoted” title=”88325″>88.3k</div>` and for the upvoter `<div class=”score likes” title=”88326″>88.3k</div>`. Since our scraper doesn’t plan to login or upvote, we fall into the second category – the unvoted. So let’s pick that as our selector. We can look for `div` with `class` as `score unvoted` and get the text from it. First let’s verify if there are multiple matches for our criteria.
Next:

Posted in:   Beginners Guide to Web Scraping, Web Scraping Tutorials

Responses

Joshua T Rees June 14, 2018

Pretty same comment as Part 2, appreciate the effort and examples. But due to reddit’s popularity these guides are deprecated and require some updates either using PRAW wrapper, old.reddit rather than reddit.com, and including information about user agents to avoid error 429 (requesting overloading).

Reply

    ScrapeHero June 14, 2018

    Thanks Joshua.
    We will get these updated as soon as possible.

    Reply

jay November 9, 2018

I’m not clear what’s being typed into the command line and what’s being typed into the file.

Reply

Eve Hunt February 4, 2019

Fine way of describing, and pleasant piece of writing to get information concerning my presentation subject

Reply

Comments or Questions?

Turn the Internet into meaningful, structured and usable data