In this part of our Web Scraping – Beginners Guide tutorial series we’ll show you how to scrape Reddit comments, navigate profile pages and parse and extract data from them. Let’s continue from where we left off in the previous post – Web scraping Guide : Part 2 – Build a web scraper for Reddit using Python and BeautifulSoup.
Reddit (old.reddit.com) is used here as it is a popular website. If you are looking for Reddit data use the Reddit API instead.
Navigating Pages using a simple Reddit Crawler
Now that we have extracted titles and URLs to the top links let’s go further. Instead of grabbing the title and URL of each link, we’ll go to the comments page for each link. Like before, right-click on any comments link and choose ‘Inspect Element’. Chrome developer toolbar should popup highlighting the a
tag that contains the URL to the comments page.

You can see that they have a class attribute class='bylink comments may-blank'
. But before we decide, let’s make sure the a
tags to comments for the other links also have the same class.
a href="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/" data-inbound-
url="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/?
utm_content=comments&utm_medium=browse&utm_source=reddit&utm_name=frontpage" data-href-
url="/r/mildlyinteresting/comments/7ub9ax/local_mexican_restaurant_used_to_be_a_chinese/" data-event-action="comments"
class="bylink comments may-blank" rel="nofollow">2876 comments/a
If all the comment links don’t have at least a common class, we need to think of another attribute to use for selecting these tags. Fortunately, in this case, all the comments links have three common classes – bylink, comments, and may-blank.
Let’s modify the code from the previous web scraper. We are removing everything after isolating the div
called siteTable
.
import urllib.request
from bs4 import BeautifulSoup
import json
url = "https://old.reddit.com/top/"
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
#First lets get the HTML of the table called site Table where all the links are displayed
main_table = soup.find("div",attrs={'id':'siteTable'})
Let’s select all the a
tags with the class attribute equal to bylink comments may-blank
using find_all
from the HTML we had isolated into main_table
comment_a_tags = main_table.find_all('a',attrs={'class':'bylink comments may-blank'})
Now that we have all the a
tags with comment links, let’s go ahead and extract the href
attribute from them. There are some absolute links and some relative links among the values. So lets clean them by prepending https://www.reddit.com/ before the URL to make them valid.
#Blank list to store the urls as they are extracted
urls = []
for a_tag in comment_a_tags:
url = a_tag['href']
if not url.startswith('http'):
url = "https://reddit.com"+url
urls.append(url)
urls
['https://www.reddit.com/r/pics/comments/87bb1m/an_iranian_teacher_visits_his_cancerstricken/',
'https://www.reddit.com/r/gifs/comments/878eka/a_textbook_example_of_how_to_cross_a_road_safely/',
'https://www.reddit.com/r/funny/comments/87agu2/special_needs_teachers_put_this_up_today/',
'https://www.reddit.com/r/gifs/comments/87a9dz/dog_thrown_from_plane_into_the_abyss_below/',
'https://www.reddit.com/r/worldnews/comments/87bbio/facebook_has_lost_100_billion_in_10_days_and_now/',
'https://www.reddit.com/r/aww/comments/879b4q/this_is_the_life_we_should_all_aspire_to/',
'https://www.reddit.com/r/todayilearned/comments/878ei8/til_in_1994_the_kkk_applied_to_sponsor_a_section/',
'https://www.reddit.com/r/gifs/comments/87cw3o/new_silicon_valley_intro_throws_shade_at_facebook/',
'https://www.reddit.com/r/aww/comments/87bj4l/dad_get_the_ball_plz/',
'https://www.reddit.com/r/mildlyinteresting/comments/879wh7/the_wear_indicators_on_my_tires_show_a_percentage/',
'https://www.reddit.com/r/pics/comments/8784r6/tunnel_vision/',
'https://www.reddit.com/r/worldnews/comments/8787jz/prime_minister_shinzo_abe_wants_to_repeal_a/',
'https://www.reddit.com/r/wholesomememes/comments/878zkf/ryan_being_an_awesome_person/',
'https://www.reddit.com/r/todayilearned/comments/87brza/til_theres_a_type_of_honey_called_mad_honey_which/',
'https://www.reddit.com/r/Tinder/comments/879g0h/her_bio_said_if_you_peaked_in_high_school_stay/',
'https://www.reddit.com/r/news/comments/8791fr/spy_poisoning_us_to_expel_60_russian_diplomats/',
'https://www.reddit.com/r/news/comments/87adao/facebook_confirms_it_records_call_history_stoking/',
'https://www.reddit.com/r/OldSchoolCool/comments/878prh/johnny_cash_playing_at_folsom_prison_1968/',
'https://www.reddit.com/r/MadeMeSmile/comments/877qz8/dads_wise_words/',
'https://www.reddit.com/r/gaming/comments/87bk5l/this_is_where_the_real_fun_begins/',
'https://www.reddit.com/r/gaming/comments/877jq6/unreal_engine_4_showcase_with_andy_serkis/',
'https://www.reddit.com/r/worldnews/comments/87a4la/canada_is_all_set_to_legalize_marijuana_by_the/',
'https://www.reddit.com/r/iamverysmart/comments/878mki/tech_wannabe_shutdown_by_the_codes_author/',
'https://www.reddit.com/r/CrappyDesign/comments/8789y7/apparently_incest_is_perfectly_fine/',
'https://www.reddit.com/r/funny/comments/87c7p3/fantastic_mr_fox_snatches_wallet/']
We’ve extracted and cleaned the URLs to the comments page.
Before we go into downloading all the URLs, let’s make a function to extract data from each web page. We’ll first download one comment page – https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/, extract the data from it and then turn it into a function.
Let’s take a look at a comments page and find out where the data we are looking for is present in the HTML. Here are the fields

We’ll extract these fields from the comments page
- Title
- Permalink
- Original Poster
- Number of Upvotes
- Number of Comments
- Comments ( all visible ones )
- commenter
- comment text
- permalink to comment
Finding the data in the HTML
To find where each field we need in the HTML is, let’s do what we always do – right-click the detail and inspect the element. Let’s download and get the HTML body for one URL first. We will later add this into the for loop above.
Download the Page Content
request = urllib.request.Request('https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/')
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
Defining selection criteria
Here we will select the elements we want to extract for this web scraper.
Selecting the Title
Right-click on the post title and inspect the element. The a
tag enclosing the title looks like this
a class="title may-blank outbound" data-event-action="title" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html"
tabindex="1" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html"
data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&
amp;token=AQAAZyBzWjNbF5yCNEDFBuOEFiTElADct-PZ77Hq713JVTrIro5h&app_name=reddit.com" data-outbound-expiration="1517494375000" rel="">Cancer ‘vaccine’ eliminates
tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.
It looks like we can use the class
attribute to select this like before. You could also use the data-event-action
attribute to get the title.
Let’s test it out
soup.find('a',attrs={'class':'title may-blank outbound'}).text
soup.find('a',attrs={'data-event-action':'title'}).text
Let’s check if there are multiple elements that match the selector we have used. Ideally, we need the selector to match only one element – the one we need to extract. The find
method of BeautifulSoup takes the first element that matches the selector. So the selection criteria we choose must be unique to the element (preferred) or should be the first match.
The find_all
method of BeautifulSoup selects all the elements that match the selection criteria. We had used find_all
to get the a
elements that had links to the comments page above.
Let’s take a look at each of these functions and what elements it would return if we only used class for getting the title.
soup.find('a',attrs={'class':'title'})
Okay, find
method picked the right element. So let’s just check if there are multiple matches by using find_all
soup.find_all('a',attrs={'class':'title'})
Looks like there aren’t any other a
tags which have class as title
. Let’s check if there are any tags other than an a
tag with the same class. We will leave the name
argument or the first parameter blank, so that it will match all tags – not just a
tags.
soup.find_all(attrs={'class':'title'})
[<span class="selected title">my subreddits</span>,
<div class="title"><h1>MODERATORS</h1></div>,
<p class="title"><a class="title may-blank outbound" data-event-action="title" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" data-outbound-expiration="1522243977000" data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&token=AQAAiZm7WrdT9vPQgZ6vBmiIlKk93kvhkKS2ZV3jkg2ytvYJ3LUm&app_name=reddit.com" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" rel="" tabindex="1">Cancer ‘vaccine’ eliminates tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.</a> <span class="domain">(<a href="/domain/med.stanford.edu/">med.stanford.edu</a>)</span></p>,
<a class="title may-blank outbound" data-event-action="title" data-href-url="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" data-outbound-expiration="1522243977000" data-outbound-url="https://out.reddit.com/t3_7ug4pb?url=https%3A%2F%2Fmed.stanford.edu%2Fnews%2Fall-news%2F2018%2F01%2Fcancer-vaccine-eliminates-tumors-in-mice.html&token=AQAAiZm7WrdT9vPQgZ6vBmiIlKk93kvhkKS2ZV3jkg2ytvYJ3LUm&app_name=reddit.com" href="https://med.stanford.edu/news/all-news/2018/01/cancer-vaccine-eliminates-tumors-in-mice.html" rel="" tabindex="1">Cancer ‘vaccine’ eliminates tumors in mice - 90 of 90 mice cured of cancers with lymphoma, similar results observed in colon, breast, and melanoma cancers.</a>,
<span class="title">top 200 comments</span>,
<li class="flat-vert title">about</li>,
<li class="flat-vert title">help</li>,
<li class="flat-vert title">apps & tools</li>,
<li class="flat-vert title"><3</li>]
There we go – 9 elements have `title` as a class attribute. We are lucky that there is only one `a` tag in it. The reason we showed this is to reinforce that you should always try to define a unique selection criteria, that will not match anything other than what we are looking for.
To get the title, we can use:
soup.find('a',attrs={'class':'title'}).text
or
soup.find('a',attrs={'class':'title may-blank outbound'}).text
or
soup.find('a',attrs={'data-event-action':'title'}).text
The lesson here is that there are always multiple ways to select the same element, and there is no single correct way. Just pick one that works with multiple pages of the same type by testing the same criteria in different URLs of the same kind of page. In this case – few other comment pages
Now that we have shown in detail how to extract the title, we’ll get few other fields without explaining a lot. It’s easy to figure out.
title = soup.find('a',attrs={'class':'title'}).text
Selecting Number of UpVotes
Inspect the UpVote count. You should something similar to the one below
<div class="midcol unvoted">
<div class="arrow up login-required access-required" data-event-action="upvote" role="button" aria-label="upvote" tabindex="0"></div>
<div class="score dislikes" title="88324">88.3k</div>
<!-- THE DIV BELOW WOULD BE HIGHLIGHTED -->
<div class="score unvoted" title="88325">88.3k</div>
<div class="score likes" title="88326">88.3k</div>
<div class="arrow down login-required access-required" data-event-action="downvote" role="button" aria-label="downvote" tabindex="0"></div>
</div>
Reddit seems to display 3 divs. One for a person who has downvoted in the div with `<div class=”score dislikes” title=”88324″>88.3k</div>` , for a person who hasn’t voted – `<div class=”score unvoted” title=”88325″>88.3k</div>` and for the upvoter `<div class=”score likes” title=”88326″>88.3k</div>`. Since our scraper doesn’t plan to login or upvote, we fall into the second category – the unvoted. So let’s pick that as our selector. We can look for `div` with `class` as `score unvoted` and get the text from it. First, let’s verify if there are multiple matches for our criteria.
soup.find_all('div',attrs={'class':'score unvoted'})
[<div class="score unvoted" title="128398">128k</div>]
Our selector seems to be right. Try removing the `div` criteria, and you’ll see there are almost 200 elements that match the class names. We’ll leave that to you. Let’s go ahead and save this into a variable
upvotes = soup.find('div',attrs={'class':'score unvoted'}).text
upvotes
Selecting Original Poster
Original poster or the username of the person who submitted the link is inside an `<a>` tag that looks like
<a href="https://www.reddit.com/user/SmartassRemarks" class="author may-blank id-t2_as4uv">SmartassRemarks</a>
Now you might be tempted to use the classes as is for the selection criteria. `id-t2_as4uv` looks like a randomly generated value. Let’s go to another comments page ( https://www.reddit.com/r/news/comments/7ud052/san_francisco_plans_to_wipe_out_thousands_of_pot/ ) and see if the classes remain the same.
<a href="https://www.reddit.com/user/Kobobzane" class="author may-blank id-t2_1389xh">Kobobzane</a>
Nope, it does. The `id-t2_` has another random value. Let’s try selecting `<a>` tags with `author` as a class. First, let’s check for multiple matches.
len(soup.find_all('a',attrs={'class':'author'}))
`len()` tells you how many items are present in the list. There are 209 – Not a unique selection criteria. Our current criteria selects every username present in the page, we just need the one of the person who submitted the link.
This calls for more isolation. Let’s go a few levels above the `a` tag and find a way to isolate this a from the rest.

All of the details for the link is enclosed in a `div` with an ID `siteTable`. Let’s go ahead and grab it, then extract the rest from the isolated HTML.
main_post = soup.find('div',attrs={'id':'siteTable'})
Let’s check if there are multiple matches here as well. Note that we are using `main_post` instead of `soup` here. We isolated the HTML into `main_post` and we only need to look inside it.
len(main_post.find_all('a',attrs={'class':'author'}))
Great. Just one. We might as well modify the other fields above to just look inside the main_post.
title = main_post.find('a',attrs={'class':'title'}).text
upvotes = main_post.find('div',attrs={'class':'score unvoted'}).text
original_poster = main_post.find('a',attrs={'class':'author'}).text
print(title)
print(upvotes)
print(original_poster)
Let’s stick to extracting from `main_post`. Number of comments is in
<a href="/r/news/comments/7ud052/san_francisco_plans_to_wipe_out_thousands_of_pot/"
data-inbound-url="/r/news/comments/7ud052/san_francisco_plans_to_wipe_out_thousands_of_pot/?utm_content=comments&utm_medium=front&utm_source=reddit&
utm_name=news" data-href-url="/r/news/comments/7ud052/san_francisco_plans_to_wipe_out_thousands_of_pot/" data-event-action="comments" class="bylink comments
may-blank" rel="nofollow">1825 comments</a>
We can select this by using the `class` or the `data-event-action` attributes. We won’t run into multiple matches are we are in the isolated part of the HTML
comments_count = main_post.find('a',attrs={'class':'bylink comments may-blank'}).text
print(comments_count)
We have so many comments, and then there are details we need to extract from each comment. So, lets first isolate the comments section to prevent matches with the main post. They are in a div with class as `commentarea`.
comment_area = soup.find('div',attrs={'class':'commentarea'})
Let’s find all comments. If you inspect a single comment and move a few tags upwards, you can see that each comment is in a dive with classes `entry unvoted`. We can use the same to grab and isolate each comment.
comments = comment_area.find_all('div', attrs={'class':'entry unvoted'})
len(comments)
Alright, we have grabbed 379 comments. You can print the variable to see each of them. Lets loop through each comment, extract the commenter, comment text and the permalink to the comment. S0 we’ll skip the inspection and selection as we have been over it for a while.
extracted_comments = []
for comment in comments:
commenter = comment.find('a',attrs={'class':'author'}).text
comment_text = comment.find('div',attrs={'class':'md'}).text
permalink = comment.find('a',attrs={'class':'bylink'})['href']
extracted_comments.append({'commenter':commenter,'commenter_profile':commenter_profile,'comment_text':comment_text,'permalink':permalink})
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-24-fc1ffc67af14> in <module>()
4 comment_text = comment.find('div',attrs={'class':'md'}).text
5 permalink = comment.find('a',attrs={'class':'bylink'})['href']
----> 6extracted_comments.append({'commenter':commenter,'commenter_profile':commenter_profile,'comment_text':comment_text,'permalink':permalink})
NameError: name 'commenter_profile' is not defined
Now that’s bad. Let’s find out why we got this error. Print the comment variable and look at the HTML.
commment
We are getting this because there is a load more comments box after many comments. For that, we need to filter those out when we look for comments. If you inspect the HTML of any comment, you can see it has a form element in it. So let’s filter out anything that doesn’t have a form element in it. Let’s rewrite the code block above to look like this:
extracted_comments = []
for comment in comments:
if comment.find('form'):
commenter = comment.find('a',attrs={'class':'author'}).text
comment_text = comment.find('div',attrs={'class':'md'}).text
permalink = comment.find('a',attrs={'class':'bylink'})['href']
extracted_comments.append({'commenter':commenter,'comment_text':comment_text,'permalink':permalink})
`extracted_comments` should now contain a list of dicts like
{'comment_text': 'TL;DR Two immune stimulating agents injected into the tumors of mice eliminated all traces of cancer as well as distant, untreated metastases. 87 of 90 mice were cured of the cancer.\nAlthough the cancer returned in three mice, they again regressed after a second treatment.\nThe researchers saw similar results in mice bearing breast, colon, and melanoma tumors. Treating the first tumor that arose often prevented the occurrence of future tumors.\nOne agent is already approved for human use and the other has been tested in other unrelated clinical trials. A clinical trial was launched in January to test the effect of the treatment in patients with lymphoma.\n',
'commenter': 'Robobvious',
'permalink': 'https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/dtk7tie/'}
Permalink of Post
The only field left is the permalink of the post, which we will get directly from the URL in the response object.
permalink = request.full_url
Putting it all together into a function
We’ll make a function that receives a comment page URL as an input and then parse and extract the data from it. This will return a dict with the data.
def parse_comment_page(page_url):
#Adding a User-Agent String in the request to prevent getting blocked while scraping
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = urllib.request.Request(page_url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
main_post = soup.find('div',attrs={'id':'siteTable'})
title = main_post.find('a',attrs={'class':'title'}).text
upvotes = main_post.find('div',attrs={'class':'score unvoted'}).text
original_poster = main_post.find('a',attrs={'class':'author'}).text
comments_count = main_post.find('a',attrs={'class':'bylink comments may-blank'}).text
comment_area = soup.find('div',attrs={'class':'commentarea'})
comments = comment_area.find_all('div', attrs={'class':'entry unvoted'})
extracted_comments = []
for comment in comments:
if comment.find('form'):
#We are now looking for any element with a class of author in the comment, instead of just looking for a tags.
#We noticed some comments whose authors have deleted their account shows up with a span tag instead of an a
commenter = comment.find(attrs={'class':'author'}).text
comment_text = comment.find('div',attrs={'class':'md'}).text.strip()
permalink = comment.find('a',attrs={'class':'bylink'})['href']
extracted_comments.append({'commenter':commenter,'comment_text':comment_text,'permalink':permalink})
#Lets put the data in dict
post_data = {
'title':title,
'no_of_upvotes':upvotes,
'poster':original_poster,
'no_of_comments':comments_count,
'comments':extracted_comments
}
return post_data
Let’s test this function out with another URL
parse_comment_page('https://www.reddit.com/r/aww/comments/87bj4l/dad_get_the_ball_plz/')
So our parser for the comments page is ready. Let’s get back to navigating.
Navigating Pages
Let’s call this function in the for loop, because we used to pull the comment page URL. For now, we’ll stick to extracting data from 10 links.
import urllib.request
from bs4 import BeautifulSoup
import json
from time import sleep
url = "https://www.reddit.com/top/"
#Adding a User-Agent String in the request to prevent getting blocked while scraping
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = urllib.request.Request(url,headers={'User-Agent': user_agent})
html = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html,'html.parser')
#First lets get the HTML of the table called site Table where all the links are displayed
main_table = soup.find("div",attrs={'id':'siteTable'})
comment_a_tags = main_table.find_all('a',attrs={'class':'bylink comments may-blank'})
extracted_data = []
#Remove[:10] to scrape all links
for a_tag in comment_a_tags[:10]:
url = a_tag['href']
if not url.startswith('http'):
url = "https://reddit.com"+url
print('Extracting data from %s'%url)
#Lets wait for 5 seconds before we make requests, so that we don't get blocked while scraping.
#if you see errors that say HTTPError: HTTP Error 429: Too Many Requests , increase this value by 1 second, till you get the data.
extracted_data.append(parse_comment_page(url))
#Lets wait for 10 seconds before we make requests, so that we don't get blocked while scraping
Extracting data from https://www.reddit.com/r/pics/comments/87bb1m/an_iranian_teacher_visits_his_cancerstricken/
Extracting data from https://www.reddit.com/r/gifs/comments/878eka/a_textbook_example_of_how_to_cross_a_road_safely/
Extracting data from https://www.reddit.com/r/funny/comments/87agu2/special_needs_teachers_put_this_up_today/
Extracting data from https://www.reddit.com/r/gifs/comments/87a9dz/dog_thrown_from_plane_into_the_abyss_below/
Extracting data from https://www.reddit.com/r/aww/comments/879b4q/this_is_the_life_we_should_all_aspire_to/
Extracting data from https://www.reddit.com/r/worldnews/comments/87bbio/facebook_has_lost_100_billion_in_10_days_and_now/
Extracting data from https://www.reddit.com/r/todayilearned/comments/878ei8/til_in_1994_the_kkk_applied_to_sponsor_a_section/
Extracting data from https://www.reddit.com/r/aww/comments/87bj4l/dad_get_the_ball_plz/
Extracting data from https://www.reddit.com/r/gifs/comments/87cw3o/new_silicon_valley_intro_throws_shade_at_facebook/
Extracting data from https://www.reddit.com/r/mildlyinteresting/comments/879wh7/the_wear_indicators_on_my_tires_show_a_percentage/
The data is quite large, so let’s just look at one record of that we extracted.
print(extracted_data[0])
You can view the output in the gist link here – https://gist.github.com/scrapehero/96840596781006fdb9ba21e4429fc6df
Page Blocked?
If you see errors that say `HTTPError: HTTP Error 429: Too Many Requests` it is because the website has detected you were making many requests or have identified the script as a bot. You can read more on how to prevent getting blacklisted while scraping here.
What’s next
This web scraper is still incomplete. As a result, it can’t save data to a file yet. We will leave that as an exercise to the reader. You can read more about writing files to JSON here https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/ in the next part of this web scraping tutorial, we will show you how to use a web scraping framework to build a complete scraper that can scale up locally, save data to files, start, stop and resume execution.
Next Web Scraping Tutorial:
How to scrape Alibaba.com product data using Scrapy Web Scraping Framework
Responses
Pretty same comment as Part 2, appreciate the effort and examples. But due to reddit’s popularity these guides are deprecated and require some updates either using PRAW wrapper, old.reddit rather than reddit.com, and including information about user agents to avoid error 429 (requesting overloading).
Thanks Joshua.
We will get these updated as soon as possible.
Go through your tutorials meticulously. Got the result for the example was page i.e reddit.com. but when i want to use the same method in other site i am not able to find out the targeted main table or container and its element. pleas e help me finding the css element in my targeted web address which (https://www.amazon.in/digital-slr-cameras/b/ref=sd_allcat_sbc_tvelec_dslr?ie=UTF8&node=1389177031) where i wanna to list out the best DSR camera details.
I’m not clear what’s being typed into the command line and what’s being typed into the file.
Comments are closed.