Web Scraping Tutorial for Beginners – Part 3 – Navigating and Extracting Data

In this part of our Web Scraping – Beginners Guide series we’ll show you how to navigate web pages and parse and extract data from them. Let’s continue from where we left off in the previous post – Beginners guide to Web Scraping: Part 2  – Build a web scraper for Reddit using Python and BeautifulSoup. If you are new to this series, we recommend that you start from What is web scraping: Part 1 – Beginner’s guide.

Reddit(old.reddit.com) is used here as it is a popular website. If you are looking for reddit data use the reddit API instead.

Navigating Pages using a simple Crawler

Now that we have extracted titles and URLs to the top links let’s go further. Instead of grabbing the title and URL of each link, we’ll go to the comments page for each link. Like before, right click on any comments link and choose Inspect Element. Chrome developer toolbar should popup highlighting the a<a><span></span><span></span><span></span><span></span> tag that contains the URL to the comments page.

scrape-reddit-link-to-comments-from-top-page

You can see that they have a class attribute class='bylink comments may-blank'. But before we decide, let’s make sure the a tags to comments for the other links also have the same class.

a href="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/" data-inbound-
url="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/?
utm_content=comments&amp;utm_medium=browse&amp;utm_source=reddit&amp;utm_name=frontpage" data-href-
url="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/" data-event-action="comments"
class="bylink comments may-blank" rel="nofollow">2876 comments/a

If all the comment links don’t have at least a common class, we need to think of another attribute to use for selecting these tags. Fortunately, in this case, all the comments links have three common classes – bylink, comments, and may-blank.

Let’s modify the code from the previous scraper. We are removing everything after isolating the div called siteTable.

import urllib.request
from bs4 import BeautifulSoup
import json

url = "https://old.reddit.com/top/"
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
#First lets get the HTML of the table called site Table where all the links are displayed
main_table = soup.find("div",attrs={'id':'siteTable'})

Let’s select all the a tags with the class attribute equal to bylink comments may-blank using find_all from the HTML we had isolated into main_table

comment_a_tags = main_table.find_all('a',attrs={'class':'bylink comments may-blank'})

Now that we have all the a tags with comment links, let’s go ahead and extract the href attribute from them. There are some absolute links and some relative links among the values. So lets clean them by prepending https://www.reddit.com/ before the URL to make them valid.

#Blank list to store the urls as they are extracted
urls = [] 
for a_tag in comment_a_tags:
    url = a_tag['href']
    if not url.startswith('http'):
        url = "https://reddit.com"+url
    urls.append(url)
urls
['https://www.reddit.com/r/pics/comments/87bb1m/an_iranian_teacher_visits_his_cancerstricken/',
     'https://www.reddit.com/r/gifs/comments/878eka/a_textbook_example_of_how_to_cross_a_road_safely/',
     'https://www.reddit.com/r/funny/comments/87agu2/special_needs_teachers_put_this_up_today/',
     'https://www.reddit.com/r/gifs/comments/87a9dz/dog_thrown_from_plane_into_the_abyss_below/',
     'https://www.reddit.com/r/worldnews/comments/87bbio/facebook_has_lost_100_billion_in_10_days_and_now/',
     'https://www.reddit.com/r/aww/comments/879b4q/this_is_the_life_we_should_all_aspire_to/',
     'https://www.reddit.com/r/todayilearned/comments/878ei8/til_in_1994_the_kkk_applied_to_sponsor_a_section/',
     'https://www.reddit.com/r/gifs/comments/87cw3o/new_silicon_valley_intro_throws_shade_at_facebook/',
     'https://www.reddit.com/r/aww/comments/87bj4l/dad_get_the_ball_plz/',
     'https://www.reddit.com/r/mildlyinteresting/comments/879wh7/the_wear_indicators_on_my_tires_show_a_percentage/',
     'https://www.reddit.com/r/pics/comments/8784r6/tunnel_vision/',
     'https://www.reddit.com/r/worldnews/comments/8787jz/prime_minister_shinzo_abe_wants_to_repeal_a/',
     'https://www.reddit.com/r/wholesomememes/comments/878zkf/ryan_being_an_awesome_person/',
     'https://www.reddit.com/r/todayilearned/comments/87brza/til_theres_a_type_of_honey_called_mad_honey_which/',
     'https://www.reddit.com/r/Tinder/comments/879g0h/her_bio_said_if_you_peaked_in_high_school_stay/',
     'https://www.reddit.com/r/news/comments/8791fr/spy_poisoning_us_to_expel_60_russian_diplomats/',
     'https://www.reddit.com/r/news/comments/87adao/facebook_confirms_it_records_call_history_stoking/',
     'https://www.reddit.com/r/OldSchoolCool/comments/878prh/johnny_cash_playing_at_folsom_prison_1968/',
     'https://www.reddit.com/r/MadeMeSmile/comments/877qz8/dads_wise_words/',
     'https://www.reddit.com/r/gaming/comments/87bk5l/this_is_where_the_real_fun_begins/',
     'https://www.reddit.com/r/gaming/comments/877jq6/unreal_engine_4_showcase_with_andy_serkis/',
     'https://www.reddit.com/r/worldnews/comments/87a4la/canada_is_all_set_to_legalize_marijuana_by_the/',
     'https://www.reddit.com/r/iamverysmart/comments/878mki/tech_wannabe_shutdown_by_the_codes_author/',
     'https://www.reddit.com/r/CrappyDesign/comments/8789y7/apparently_incest_is_perfectly_fine/',
     'https://www.reddit.com/r/funny/comments/87c7p3/fantastic_mr_fox_snatches_wallet/']

We’ve extracted and cleaned the URLs to the comments page.

Before we go into downloading all the URLs, let’s make a function to extract data from each page. We’ll first download one comment page https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/, extract the data from it and then turn it into a function.

Let’s take a look at a comments page and find out where the data we are looking for is present in the HTML. Here are the fields

What data are we extracting?

details-to-extracts-reddit-comment-pages

We’ll extract these fields from the comments page

  • Title
  • Permalink
  • Original Poster
  • Number of Upvotes
  • Number of Comments
  • Comments ( all visible ones )
    • commenter
    • comment text
    • permalink to comment

Finding the data in the HTML

To find where each field we need in the HTML is, let’s do what we always do – right click the detail and inspect the element. Let’s download and get the HTML body for one URL first. We will later add this into the for loop above.

Download the Page Content

request = urllib.request.Request('https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/')
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')

Defining selection criteria

Selecting the Title

Right click on the post title and inspect the element. The a tag enclosing the title looks like this

a class="title may-blank outbound" data-event-action="title" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" 
tabindex="1" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" 
data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&
amp;token=AQAAZyBzWjNbF5yCNEDFBuOEFiTElADct-PZ77Hq713JVTrIro5h&amp;app_name=reddit.com" data-outbound-expiration="1517494375000" rel="">Cancer ‘vaccine’ eliminates 
tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.

It looks like we can use the class attribute to select this like before. You could also use the data-event-action attribute to get the title.

Let’s test it out

soup.find('a',attrs={'class':'title may-blank outbound'}).text
soup.find('a',attrs={'data-event-action':'title'}).text

Let’s check if there are multiple elements that match the selector we have used. Ideally, we need the selector to match only one element – the one we need to extract. The find method of BeautifulSoup takes the first element that matches the selector. So the selection criteria we choose must be unique to the element (preferred) or should be the first match.

The find_all method of BeautifulSoup selects all the elements that match the selection criteria. We had used find_all to get the a elements that had links to the comments page above.

Let’s take a look at each of these functions and what elements it would return if we only used class for getting the title.

soup.find('a',attrs={'class':'title'})

Okay, find method picked the right element. Let’s just check if there are multiple matches by using find_all

soup.find_all('a',attrs={'class':'title'})

Looks like there aren’t any other a tags which have class as title. Let’s check if there are any tags other than an a tag with the same class. We will leave the name argument or the first parameter blank, so that it will match all tags – not just a tags.

soup.find_all(attrs={'class':'title'})

<

div class=”output_subarea output_text output_result”>

[<span class="selected title">my subreddits</span>,
 <div class="title"><h1>MODERATORS</h1></div>,
 <p class="title"><a class="title may-blank outbound" data-event-action="title" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" data-outbound-expiration="1522243977000" data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&amp;token=AQAAiZm7WrdT9vPQgZ6vBmiIlKk93kvhkKS2ZV3jkg2ytvYJ3LUm&amp;app_name=reddit.com" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" rel="" tabindex="1">Cancer ‘vaccine’ eliminates tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.</a> <span class="domain">(<a href="/domain/med.stanford.edu/">med.stanford.edu</a>)</span></p>,
 <a class="title may-blank outbound" data-event-action="title" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" data-outbound-expiration="1522243977000" data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&amp;token=AQAAiZm7WrdT9vPQgZ6vBmiIlKk93kvhkKS2ZV3jkg2ytvYJ3LUm&amp;app_name=reddit.com" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" rel="" tabindex="1">Cancer ‘vaccine’ eliminates tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.</a>,
 <span class="title">top 200 comments</span>,
 <li class="flat-vert title">about</li>,
 <li class="flat-vert title">help</li>,
 <li class="flat-vert title">apps &amp; tools</li>,
 <li class="flat-vert title">&lt;3</li>]

There we go – 9 elements have title as a class attribute. We are lucky that there is only one a tag in it. The reason we showed this is to reinforce that you should always try to define a unique selection criteria, that will not match anything other than what we are looking for.

To get the title, we can use:

soup.find('a',attrs={'class':'title'}).text

or

soup.find('a',attrs={'class':'title may-blank outbound'}).text

or

soup.find('a',attrs={'data-event-action':'title'}).text

The lesson here is that there are always multiple ways to select the same element, and there is no single correct way. Just pick one that works with multiple pages of the same type by testing the same criteria in different URLs of the same kind of page. In this case – few other comment pages
Now that we have shown in detail how to extract the title, we’ll get few other fields without explaining a lot. It’s easy to figure out.

title = soup.find('a',attrs={'class':'title'}).text

Selecting Number of UpVotes

Inspect the UpVote count. You should something similar to the one below

<div class="midcol unvoted">
    <div class="arrow up login-required access-required" data-event-action="upvote" role="button" aria-label="upvote" tabindex="0"></div>
    <div class="score dislikes" title="88324">88.3k</div>
    <!-- THE DIV BELOW WOULD BE HIGHLIGHTED --> 
    <div class="score unvoted" title="88325">88.3k</div>
    <div class="score likes" title="88326">88.3k</div>
    <div class="arrow down login-required access-required" data-event-action="downvote" role="button" aria-label="downvote" tabindex="0"></div>
</div>
Reddit seems to display 3 divs. One for a person who has downvoted in the div with `<div class=”score dislikes” title=”88324″>88.3k</div>` , for a person who hasn’t voted – `<div class=”score unvoted” title=”88325″>88.3k</div>` and for the upvoter `<div class=”score likes” title=”88326″>88.3k</div>`. Since our scraper doesn’t plan to login or upvote, we fall into the second category – the unvoted. So let’s pick that as our selector. We can look for `div` with `class` as `score unvoted` and get the text from it. First let’s verify if there are multiple matches for our criteria.

<

div class=”midcol unvoted”>

<

div class=”score likes” title=”88326″>

soup.find_all('div',attrs={'class':'score unvoted'})
[<div class="score unvoted" title="128398">128k</div>]

Our selector seems to be right. Try removing the div criteria, and you’ll see there are almost 200 elements that match the class names. We’ll leave that to you. Let’s go ahead and save this into a variable

<

div class=”arrow down login-required access-required” data-event-action=”downvote” aria-label=”downvote”>

upvotes = soup.find('div',attrs={'class':'score unvoted'}).text
upvotes

Selecting Original Poster

Original poster or the username of the person who submitted the link is inside an &lt;a&gt; tag that looks like

<a href="https://www.reddit.com/user/SmartassRemarks" class="author may-blank id-t2_as4uv">SmartassRemarks</a>

Now you might be tempted to use the classes as is for the selection criteria. id-t2_as4uv looks like a randomly generated value. Let’s go to another comments page ( https://www.reddit.com/r/news/comments/7ud052/san_francisco_plans_to_wipe_out_thousands_of_pot/ ) and see if the classes remain the same.

<a href="https://www.reddit.com/user/Kobobzane" class="author may-blank id-t2_1389xh">Kobobzane</a>

Nope, it does. The id-t2_ has another random value. Let’s try selecting &lt;a&gt; tags with author as a class. First, let’s check for multiple matches.

len(soup.find_all('a',attrs={'class':'author'}))

len() tells you how many items are present in the list. There are 209 – Not a unique selection criteria. Our current criteria selects every username present in the page, we just need the one of the person who submitted the link.

This calls for more isolation. Let’s go a few levels above the a tag and find a way to isolate this a from the rest.
reddit-isolate

All of the details for the link is enclosed in a div with an ID siteTable. Let’s go ahead and grab it, then extract the rest from the isolated HTML.

main_post = soup.find('div',attrs={'id':'siteTable'})

Let’s check if there are multiple matches here as well. Note that we are using main_post instead of soup here. We isolated the HTML into main_post and we only need to look inside it.

len(main_post.find_all('a',attrs={'class':'author'}))

Great. Just one. We might as well modify the other fields above to just look inside the main_post.

title = main_post.find('a',attrs={'class':'title'}).text
upvotes = main_post.find('div',attrs={'class':'score unvoted'}).text
original_poster = main_post.find('a',attrs={'class':'author'}).text
print(title)
print(upvotes)
print(original_poster)

Selecting Number of Comments

Let’s stick to extracting from main_post. Number of comments is in

<a href="/r/news/comments/7ud052/san_francisco_plans_to_wipe_out_thousands_of_pot/" 
data-inbound-url="/r/news/comments/7ud052/san_francisco_plans_to_wipe_out_thousands_of_pot/?utm_content=comments&amp;utm_medium=front&amp;utm_source=reddit&amp;
utm_name=news" data-href-url="/r/news/comments/7ud052/san_francisco_plans_to_wipe_out_thousands_of_pot/" data-event-action="comments" class="bylink comments 
may-blank" rel="nofollow">1825 comments</a>

We can select this by using the class or the data-event-action attributes. We won’t run into multiple matches are we are in the isolated part of the HTML

comments_count = main_post.find('a',attrs={'class':'bylink comments may-blank'}).text
print(comments_count)

Selecting comments

We have so many comments, and then there are details we need to extract from each comment. Lets first isolate the comments section to prevent matches with the main post. They are in a div with class as commentarea.

comment_area = soup.find('div',attrs={'class':'commentarea'})

Let’s find all comments. If you inspect a single comment and move a few tags upwards, you can see that each comment is in a dive with classes entry unvoted. We can use the same to grab and isolate each comment.

comments = comment_area.find_all('div', attrs={'class':'entry unvoted'})
len(comments)

Alright, we have grabbed 379 comments. You can print the variable to see each of them. Lets loop through each comment, extract the commenter, comment text and the permalink to the comment. We’ll skip the inspection and selection as we have been over it for a while.

extracted_comments = []
for comment in comments: 
    commenter = comment.find('a',attrs={'class':'author'}).text
    comment_text = comment.find('div',attrs={'class':'md'}).text
    permalink = comment.find('a',attrs={'class':'bylink'})['href']
    extracted_comments.append({'commenter':commenter,'commenter_profile':commenter_profile,'comment_text':comment_text,'permalink':permalink})
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-fc1ffc67af14> in <module>()
      4     comment_text = comment.find('div',attrs={'class':'md'}).text
      5     permalink = comment.find('a',attrs={'class':'bylink'})['href']
----> 6extracted_comments.append({'commenter':commenter,'commenter_profile':commenter_profile,'comment_text':comment_text,'permalink':permalink})

NameError: name 'commenter_profile' is not defined

Now that’s bad. Let’s find out why we got this error. Print the comment variable and look at the HTML.

commment

Its because there is a load more comments box after many comments. We need to filter those out when we look for comments. If you inspect the HTML of any comment, you can see it has a form element in it. So let’s filter out anything that doesn’t have a form element in it. Let’s rewrite the code block above to look like this

extracted_comments = []
for comment in comments: 
    if comment.find('form'):
        commenter = comment.find('a',attrs={'class':'author'}).text
        comment_text = comment.find('div',attrs={'class':'md'}).text
        permalink = comment.find('a',attrs={'class':'bylink'})['href']
        extracted_comments.append({'commenter':commenter,'comment_text':comment_text,'permalink':permalink})

extracted_comments should now contain a list of dicts like

{'comment_text': 'TL;DR Two immune stimulating agents injected into the tumors of mice eliminated all traces of cancer as well as distant, untreated metastases. 87 of 90 mice were cured of the cancer.\nAlthough the cancer returned in three mice, they again regressed after a second treatment.\nThe researchers saw similar results in mice bearing breast, colon, and melanoma tumors. Treating the first tumor that arose often prevented the occurrence of future tumors.\nOne agent is already approved for human use and the other has been tested in other unrelated clinical trials. A clinical trial was launched in January to test the effect of the treatment in patients with lymphoma.\n',
  'commenter': 'Robobvious',
  'permalink': 'https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/dtk7tie/'}

Permalink of Post

The only field left is the permalink of the post, which we will get directly from the URL in the response object.

permalink = request.full_url

 Putting it all together into a function

We’ll make a function that receives a comment page URL as an input and then extract the data from it, and finally return a dict with the data.

def parse_comment_page(page_url):
    #Adding a User-Agent String in the request to prevent getting blocked while scraping 
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    request = urllib.request.Request(page_url,headers={'User-Agent': user_agent})
    html = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html,'html.parser')
    main_post = soup.find('div',attrs={'id':'siteTable'})
    title = main_post.find('a',attrs={'class':'title'}).text
    upvotes = main_post.find('div',attrs={'class':'score unvoted'}).text
    original_poster = main_post.find('a',attrs={'class':'author'}).text
    comments_count = main_post.find('a',attrs={'class':'bylink comments may-blank'}).text
    comment_area = soup.find('div',attrs={'class':'commentarea'})
    comments = comment_area.find_all('div', attrs={'class':'entry unvoted'})
    extracted_comments = []
    for comment in comments: 
        if comment.find('form'):
            #We are now looking for any element with a class of author in the comment, instead of just looking for a tags. 
            #We noticed some comments whose authors have deleted their account shows up with a span tag instead of an a 
            commenter = comment.find(attrs={'class':'author'}).text
            comment_text = comment.find('div',attrs={'class':'md'}).text.strip()
            permalink = comment.find('a',attrs={'class':'bylink'})['href']
            extracted_comments.append({'commenter':commenter,'comment_text':comment_text,'permalink':permalink})
    #Lets put the data in dict 
    post_data = {
        'title':title,
        'no_of_upvotes':upvotes,
        'poster':original_poster,
        'no_of_comments':comments_count,
        'comments':extracted_comments
    }
    return post_data

Let’s test this function out with another URL

parse_comment_page('https://www.reddit.com/r/aww/comments/87bj4l/dad_get_the_ball_plz/')

So our parser for the comments page is ready. Let’s get back to navigating.

Navigating Pages

Let’s call this function in the for loop we used to pull the comment page URL. For now, we’ll stick to extracting data from 10 links.

import urllib.request
from bs4 import BeautifulSoup
import json
from time import sleep 
url = "https://www.reddit.com/top/"
#Adding a User-Agent String in the request to prevent getting blocked while scraping 
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
#First lets get the HTML of the table called site Table where all the links are displayed
main_table = soup.find("div",attrs={'id':'siteTable'})
comment_a_tags = main_table.find_all('a',attrs={'class':'bylink comments may-blank'})
extracted_data = []
#Remove[:10] to scrape all links 
for a_tag in comment_a_tags[:10]:
    url = a_tag['href']
    if not url.startswith('http'):
        url = "https://reddit.com"+url
    print('Extracting data from %s'%url)
    #Lets wait for 5 seconds before we make requests, so that we don't get blocked while scraping. 
    #if you see errors that say HTTPError: HTTP Error 429: Too Many Requests , increase this value by 1 second, till you get the data. 
    extracted_data.append(parse_comment_page(url))
    #Lets wait for 10 seconds before we make requests, so that we don't get blocked while scraping
Extracting data from https://www.reddit.com/r/pics/comments/87bb1m/an_iranian_teacher_visits_his_cancerstricken/
    Extracting data from https://www.reddit.com/r/gifs/comments/878eka/a_textbook_example_of_how_to_cross_a_road_safely/
    Extracting data from https://www.reddit.com/r/funny/comments/87agu2/special_needs_teachers_put_this_up_today/
    Extracting data from https://www.reddit.com/r/gifs/comments/87a9dz/dog_thrown_from_plane_into_the_abyss_below/
    Extracting data from https://www.reddit.com/r/aww/comments/879b4q/this_is_the_life_we_should_all_aspire_to/
    Extracting data from https://www.reddit.com/r/worldnews/comments/87bbio/facebook_has_lost_100_billion_in_10_days_and_now/
    Extracting data from https://www.reddit.com/r/todayilearned/comments/878ei8/til_in_1994_the_kkk_applied_to_sponsor_a_section/
    Extracting data from https://www.reddit.com/r/aww/comments/87bj4l/dad_get_the_ball_plz/
    Extracting data from https://www.reddit.com/r/gifs/comments/87cw3o/new_silicon_valley_intro_throws_shade_at_facebook/
    Extracting data from https://www.reddit.com/r/mildlyinteresting/comments/879wh7/the_wear_indicators_on_my_tires_show_a_percentage/

The data is quite large, let’s just look at one record we extracted.

print(extracted_data[0])

You can view the output in the gist link here – https://gist.github.com/scrapehero/96840596781006fdb9ba21e4429fc6df

Page Blocked?

If you see errors that say HTTPError: HTTP Error 429: Too Many Requests it is because the website has detected you were making many requests or have identified the script as a bot. You can read more on how to prevent getting blocked while scraping here.

What’s next

This scraper is still incomplete. It can’t save data to a file yet. We will leave that as an exercise to the reader. You can read more about writing files to JSON here https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/In the next part of this tutorial, we will show you how to use a web scraping framework to build a complete scraper that can scale up locally, save data to files, start, stop and resume execution etc.

Next:
How to scrape Alibaba.com product data using Scrapy Web Scraping Framework

Posted in:   Beginners Guide to Web Scraping, Web Scraping Tutorials

Responses

Joshua T Rees June 14, 2018

Pretty same comment as Part 2, appreciate the effort and examples. But due to reddit’s popularity these guides are deprecated and require some updates either using PRAW wrapper, old.reddit rather than reddit.com, and including information about user agents to avoid error 429 (requesting overloading).

Reply

    ScrapeHero June 14, 2018

    Thanks Joshua.
    We will get these updated as soon as possible.

    Reply

jay November 9, 2018

I’m not clear what’s being typed into the command line and what’s being typed into the file.

Reply

Comments or Questions?

Turn the Internet into meaningful, structured and usable data   

Enjoying our Tutorials?

Subscribe to our weekly updates on the latest tutorials in Web Scraping and Data Extraction