Web Scraping Tutorial for Beginners – Part 3 – Navigating and Extracting Data

In this part of our Web Scraping – Beginners Guide tutorial series we’ll show you how to scrape Reddit comments, navigate profile pages and parse and extract data from them. Let’s continue from where we left off in the previous post – Web scraping Guide : Part 2  – Build a web scraper for Reddit using Python and BeautifulSoup.

Reddit (old.reddit.com) is used here as it is a popular website. If you are looking for Reddit data use the Reddit API instead.

Navigating Pages using a simple Reddit Crawler

Now that we have extracted titles and URLs to the top links let’s go further. Instead of grabbing the title and URL of each link, we’ll go to the comments page for each link. Like before, right-click on any comments link and choose ‘Inspect Element’. Chrome developer toolbar should popup highlighting the a tag that contains the URL to the comments page.

web-scraping

You can see that they have a class attribute class='bylink comments may-blank'. But before we decide, let’s make sure the a tags to comments for the other links also have the same class.

a href="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/" data-inbound-
url="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/?
utm_content=comments&utm_medium=browse&utm_source=reddit&utm_name=frontpage" data-href-
url="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/" data-event-action="comments"
class="bylink comments may-blank" rel="nofollow">2876 comments/a

If all the comment links don’t have at least a common class, we need to think of another attribute to use for selecting these tags. Fortunately, in this case, all the comments links have three common classes – bylink, comments, and may-blank.

Let’s modify the code from the previous web scraper. We are removing everything after isolating the div called siteTable.

import urllib.request
from bs4 import BeautifulSoup
import json

url = "https://old.reddit.com/top/"
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
#First lets get the HTML of the table called site Table where all the links are displayed
main_table = soup.find("div",attrs={'id':'siteTable'})

Let’s select all the a tags with the class attribute equal to bylink comments may-blank using find_all from the HTML we had isolated into main_table

comment_a_tags = main_table.find_all('a',attrs={'class':'bylink comments may-blank'})

Now that we have all the a tags with comment links, let’s go ahead and extract the href attribute from them. There are some absolute links and some relative links among the values. So lets clean them by prepending https://www.reddit.com/ before the URL to make them valid.

#Blank list to store the urls as they are extracted
urls = [] 
for a_tag in comment_a_tags:
    url = a_tag['href']
    if not url.startswith('http'):
        url = "https://reddit.com"+url
    urls.append(url)
urls
['https://www.reddit.com/r/pics/comments/87bb1m/an_iranian_teacher_visits_his_cancerstricken/',
     'https://www.reddit.com/r/gifs/comments/878eka/a_textbook_example_of_how_to_cross_a_road_safely/',
     'https://www.reddit.com/r/funny/comments/87agu2/special_needs_teachers_put_this_up_today/',
     'https://www.reddit.com/r/gifs/comments/87a9dz/dog_thrown_from_plane_into_the_abyss_below/',
     'https://www.reddit.com/r/worldnews/comments/87bbio/facebook_has_lost_100_billion_in_10_days_and_now/',
     'https://www.reddit.com/r/aww/comments/879b4q/this_is_the_life_we_should_all_aspire_to/',
     'https://www.reddit.com/r/todayilearned/comments/878ei8/til_in_1994_the_kkk_applied_to_sponsor_a_section/',
     'https://www.reddit.com/r/gifs/comments/87cw3o/new_silicon_valley_intro_throws_shade_at_facebook/',
     'https://www.reddit.com/r/aww/comments/87bj4l/dad_get_the_ball_plz/',
     'https://www.reddit.com/r/mildlyinteresting/comments/879wh7/the_wear_indicators_on_my_tires_show_a_percentage/',
     'https://www.reddit.com/r/pics/comments/8784r6/tunnel_vision/',
     'https://www.reddit.com/r/worldnews/comments/8787jz/prime_minister_shinzo_abe_wants_to_repeal_a/',
     'https://www.reddit.com/r/wholesomememes/comments/878zkf/ryan_being_an_awesome_person/',
     'https://www.reddit.com/r/todayilearned/comments/87brza/til_theres_a_type_of_honey_called_mad_honey_which/',
     'https://www.reddit.com/r/Tinder/comments/879g0h/her_bio_said_if_you_peaked_in_high_school_stay/',
     'https://www.reddit.com/r/news/comments/8791fr/spy_poisoning_us_to_expel_60_russian_diplomats/',
     'https://www.reddit.com/r/news/comments/87adao/facebook_confirms_it_records_call_history_stoking/',
     'https://www.reddit.com/r/OldSchoolCool/comments/878prh/johnny_cash_playing_at_folsom_prison_1968/',
     'https://www.reddit.com/r/MadeMeSmile/comments/877qz8/dads_wise_words/',
     'https://www.reddit.com/r/gaming/comments/87bk5l/this_is_where_the_real_fun_begins/',
     'https://www.reddit.com/r/gaming/comments/877jq6/unreal_engine_4_showcase_with_andy_serkis/',
     'https://www.reddit.com/r/worldnews/comments/87a4la/canada_is_all_set_to_legalize_marijuana_by_the/',
     'https://www.reddit.com/r/iamverysmart/comments/878mki/tech_wannabe_shutdown_by_the_codes_author/',
     'https://www.reddit.com/r/CrappyDesign/comments/8789y7/apparently_incest_is_perfectly_fine/',
     'https://www.reddit.com/r/funny/comments/87c7p3/fantastic_mr_fox_snatches_wallet/']

We’ve extracted and cleaned the URLs to the comments page.

Before we go into downloading all the URLs, let’s make a function to extract data from each web page. We’ll first download one comment page – https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/, extract the data from it and then turn it into a function.

Let’s take a look at a comments page and find out where the data we are looking for is present in the HTML. Here are the fields

What data are we Scraping from Reddit?

data-extracton-services

We’ll extract these fields from the comments page

  • Title
  • Permalink
  • Original Poster
  • Number of Upvotes
  • Number of Comments
  • Comments ( all visible ones )
    • commenter
    • comment text
    • permalink to comment

Finding the data in the HTML

To find where each field we need in the HTML is, let’s do what we always do – right-click the detail and inspect the element. Let’s download and get the HTML body for one URL first. We will later add this into the for loop above.

Download the Page Content

request = urllib.request.Request('https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/')
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')

Defining selection criteria

Here we will select the elements we want to extract for this web scraper.

Selecting the Title

Right-click on the post title and inspect the element. The a tag enclosing the title looks like this

a class="title may-blank outbound" data-event-action="title" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" 
tabindex="1" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" 
data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&
amp;token=AQAAZyBzWjNbF5yCNEDFBuOEFiTElADct-PZ77Hq713JVTrIro5h&app_name=reddit.com" data-outbound-expiration="1517494375000" rel="">Cancer ‘vaccine’ eliminates 
tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.

It looks like we can use the class attribute to select this like before. You could also use the data-event-action attribute to get the title.

Let’s test it out

soup.find('a',attrs={'class':'title may-blank outbound'}).text
soup.find('a',attrs={'data-event-action':'title'}).text

Let’s check if there are multiple elements that match the selector we have used. Ideally, we need the selector to match only one element – the one we need to extract. The find method of BeautifulSoup takes the first element that matches the selector. So the selection criteria we choose must be unique to the element (preferred) or should be the first match.

The find_all method of BeautifulSoup selects all the elements that match the selection criteria. We had used find_all to get the a elements that had links to the comments page above.

Let’s take a look at each of these functions and what elements it would return if we only used class for getting the title.

soup.find('a',attrs={'class':'title'})

Okay, find method picked the right element. So let’s just check if there are multiple matches by using find_all

soup.find_all('a',attrs={'class':'title'})

Looks like there aren’t any other a tags which have class as title. Let’s check if there are any tags other than an a tag with the same class. We will leave the name argument or the first parameter blank, so that it will match all tags – not just a tags.

soup.find_all(attrs={'class':'title'})
[<span class="selected title">my subreddits</span>,
 <div class="title"><h1>MODERATORS</h1></div>,
 <p class="title"><a class="title may-blank outbound" data-event-action="title" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" data-outbound-expiration="1522243977000" data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&amp;token=AQAAiZm7WrdT9vPQgZ6vBmiIlKk93kvhkKS2ZV3jkg2ytvYJ3LUm&amp;app_name=reddit.com" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" rel="" tabindex="1">Cancer ‘vaccine’ eliminates tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.</a> <span class="domain">(<a href="/domain/med.stanford.edu/">med.stanford.edu</a>)</span></p>,
 <a class="title may-blank outbound" data-event-action="title" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" data-outbound-expiration="1522243977000" data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&amp;token=AQAAiZm7WrdT9vPQgZ6vBmiIlKk93kvhkKS2ZV3jkg2ytvYJ3LUm&amp;app_name=reddit.com" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" rel="" tabindex="1">Cancer ‘vaccine’ eliminates tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.</a>,
 <span class="title">top 200 comments</span>,
 <li class="flat-vert title">about</li>,
 <li class="flat-vert title">help</li>,
 <li class="flat-vert title">apps &amp; tools</li>,
 <li class="flat-vert title">&lt;3</li>]

There we go – 9 elements have `title` as a class attribute. We are lucky that there is only one `a` tag in it. The reason we showed this is to reinforce that you should always try to define a unique selection criteria, that will not match anything other than what we are looking for.

To get the title, we can use:

soup.find('a',attrs={'class':'title'}).text

or

soup.find('a',attrs={'class':'title may-blank outbound'}).text

or

soup.find('a',attrs={'data-event-action':'title'}).text

The lesson here is that there are always multiple ways to select the same element, and there is no single correct way. Just pick one that works with multiple pages of the same type by testing the same criteria in different URLs of the same kind of page. In this case – few other comment pages
Now that we have shown in detail how to extract the title, we’ll get few other fields without explaining a lot. It’s easy to figure out.

title = soup.find('a',attrs={'class':'title'}).text

Selecting Number of UpVotes

Inspect the UpVote count. You should something similar to the one below

<div class="midcol unvoted">
    <div class="arrow up login-required access-required" data-event-action="upvote" role="button" aria-label="upvote" tabindex="0"></div>
    <div class="score dislikes" title="88324">88.3k</div>
    <!-- THE DIV BELOW WOULD BE HIGHLIGHTED --> 
    <div class="score unvoted" title="88325">88.3k</div>
    <div class="score likes" title="88326">88.3k</div>
    <div class="arrow down login-required access-required" data-event-action="downvote" role="button" aria-label="downvote" tabindex="0"></div>
</div>
Reddit seems to display 3 divs. One for a person who has downvoted in the div with `<div class=”score dislikes” title=”88324″>88.3k</div>` , for a person who hasn’t voted – `<div class=”score unvoted” title=”88325″>88.3k</div>` and for the upvoter `<div class=”score likes” title=”88326″>88.3k</div>`. Since our scraper doesn’t plan to login or upvote, we fall into the second category – the unvoted. So let’s pick that as our selector. We can look for `div` with `class` as `score unvoted` and get the text from it. First, let’s verify if there are multiple matches for our criteria.

Posted in:   Beginners Guide to Web Scraping and Data Extraction, Web Scraping Tutorials

Responses

Joshua T Rees June 14, 2018

Pretty same comment as Part 2, appreciate the effort and examples. But due to reddit’s popularity these guides are deprecated and require some updates either using PRAW wrapper, old.reddit rather than reddit.com, and including information about user agents to avoid error 429 (requesting overloading).


    ScrapeHero June 14, 2018

    Thanks Joshua.
    We will get these updated as soon as possible.


Urads withus October 17, 2018

Go through your tutorials meticulously. Got the result for the example was page i.e reddit.com. but when i want to use the same method in other site i am not able to find out the targeted main table or container and its element. pleas e help me finding the css element in my targeted web address which (https://www.amazon.in/digital-slr-cameras/b/ref=sd_allcat_sbc_tvelec_dslr?ie=UTF8&node=1389177031) where i wanna to list out the best DSR camera details.


jay November 9, 2018

I’m not clear what’s being typed into the command line and what’s being typed into the file.


Comments are closed.

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?