Web Scraping Tutorial for Beginners – Part 3 – Navigating and Extracting Data

In this part of our Web Scraping – Beginners Guide series we’ll show you how to navigate web pages and parse and extract data from them. Let’s continue from where we left off in the previous post – Beginners guide to Web Scraping: Part 2  – Build a web scraper for Reddit using Python and BeautifulSoup. If you are new to this series, we recommend that you start from What is web scraping: Part 1 – Beginner’s guide.

Navigating Pages using a simple Crawler

Now that we have extracted titles and URLs to the top links let’s go further. Instead of grabbing the title and URL of each link, we’ll go to the comments page for each link. Like before, right click on any comments link and choose Inspect Element. Chrome developer toolbar should popup highlighting the a tag that contains the URL to the comments page.
scrape-reddit-link-to-comments-from-top-page

You can see that they have a class attribute class='bylink comments may-blank'. But before we decide, let’s make sure the a tags to comments for the other links also have the same class.

If all the comment links don’t have at least a common class, we need to think of another attribute to use for selecting these tags. Fortunately, in this case, all the comments links have three common classes – bylink, comments, and may-blank.
Let’s modify the code from the previous scraper. We are removing everything after isolating the div called siteTable.

Let’s select all the a tags with the class attribute equal to bylink comments may-blank using find_all from the HTML we had isolated into main_table

Now that we have all the a tags with comment links, let’s go ahead and extract the href attribute from them. There are some absolute links and some relative links among the values. So lets clean them by prepending https://www.reddit.com/ before the URL to make them valid.

We’ve extracted and cleaned the URLs to the comments page.

Before we go into downloading all the URLs, let’s make a function to extract data from each page. We’ll first download one comment page https://www.reddit.com/r/worldnews/comments/7ug4pb/cancer_vaccine_eliminates_tumors_in_mice_90_of_90/, extract the data from it and then turn it into a function.

Let’s take a look at a comments page and find out where the data we are looking for is present in the HTML. Here are the fields

What data are we extracting?

details-to-extracts-reddit-comment-pages

We’ll extract these fields from the comments page

  • Title
  • Permalink
  • Original Poster
  • Number of Upvotes
  • Number of Comments
  • Comments ( all visible ones )
    • commenter
    • comment text
    • permalink to comment

Finding the data in the HTML

To find where each field we need in the HTML is, let’s do what we always do – right click the detail and inspect the element. Let’s download and get the HTML body for one URL first. We will later add this into the for loop above.

Download the Page Content

Defining selection criteria

Selecting the Title

Right click on the post title and inspect the element. The a tag enclosing the title looks like this
It looks like we can use the class attribute to select this like before. You could also use the data-event-action attribute to get the title.
Let’s test it out
Let’s check if there are multiple elements that match the selector we have used. Ideally, we need the selector to match only one element – the one we need to extract. The find method of BeautifulSoup takes the first element that matches the selector. So the selection criteria we choose must be unique to the element (preferred) or should be the first match.
The find_all method of BeautifulSoup selects all the elements that match the selection criteria. We had used find_all to get the a elements that had links to the comments page above.
Let’s take a look at each of these functions and what elements it would return if we only used class for getting the title.
Okay, find method picked the right element. Let’s just check if there are multiple matches by using find_all
Looks like there aren’t any other a tags which have class as title. Let’s check if there are any tags other than an a tag with the same class. We will leave the name argument or the first parameter blank, so that it will match all tags – not just a tags.
There we go – 9 elements have title as a class attribute. We are lucky that there is only one a tag in it. The reason we showed this is to reinforce that you should always try to define a unique selection criteria, that will not match anything other than what we are looking for.
To get the title, we can use:
or
or
The lesson here is that there are always multiple ways to select the same element, and there is no single correct way. Just pick one that works with multiple pages of the same type by testing the same criteria in different URLs of the same kind of page. In this case – few other comment pages
Now that we have shown in detail how to extract the title, we’ll get few other fields without explaining a lot. It’s easy to figure out.

Selecting Number of UpVotes

Inspect the UpVote count. You should something similar to the one below
Reddit seems to display 3 divs. One for a person who has downvoted in the div with <div class="score dislikes" title="88324">88.3k</div> , for a person who hasn’t voted – <div class="score unvoted" title="88325">88.3k</div> and for the upvoter <div class="score likes" title="88326">88.3k</div>. Since our scraper doesn’t plan to login or upvote, we fall into the second category – the unvoted. So let’s pick that as our selector. We can look for div with class as score unvoted and get the text from it. First let’s verify if there are multiple matches for our criteria.

2 comments on “Web Scraping Tutorial for Beginners – Part 3 – Navigating and Extracting Data

Joshua T Rees

Pretty same comment as Part 2, appreciate the effort and examples. But due to reddit’s popularity these guides are deprecated and require some updates either using PRAW wrapper, old.reddit rather than reddit.com, and including information about user agents to avoid error 429 (requesting overloading).

    ScrapeHero

    Thanks Joshua.
    We will get these updated as soon as possible.

Join the conversation


Turn websites into meaningful and structured data through our web data extraction service