XPath Cheat Sheet for Web Scraping

XPath or XML path language is used for navigating through XML or HTML documents. XPath can be used for extracting data in web scraping applications. Following is a concise XPath cheat sheet for web scraping needs.

XPath Basics

<div class="main">
    <h1>
        Bulbasaur
    </h1>
 <div id="description">
    <p>
        Bulbasaur can be seen napping in bright sunlight.
    </p>
 </div>
</div>

 

Accessing the data from HTML using XPath is an easy process. Let’s try to find <h1> from the above HTML using XPath.

The <h1> is located inside a <div> element with class name "main". The XPath for accessing the element will be:

//div[@class="main"]/h1
Output:
<h1> Bulbasaur </h1>

To access the text inside the <h1>

//div[@class="main"]/h1/text()
Output:
Bulbasaur

XPath Syntax

XPath uses path expressions to find nodes from XML or HTML documents. The node is selected with a series of steps and axes. The following is a breakdown of the XPaths used above.

//div[@class="main"]/h1

xpath-break-down

Axes – Axes can be used to locate nodes relative to the current node. Axes allow to write XPath to parse data from complex and nested documents.

Step – Step or Path contain the node that we like to navigate to.

Here are the list of most commonly used expressions:

Expression Description
// Select any descendant of current node
/ Select any child of current node
. Select current node
.. Select the parent node
@ Select attributes
nodename Select matching nodes

XPath Locators Cheat Sheet

A larger sample HTML is given below. It will be used for the following examples.

<html>
<head></head>
<body>
    <div class="main">
        <h1>Bulbasaur</h1>
        <img
            src="..."
            alt="Bulbasaur"
                width="100"
                height="100"
                ><br>
        <label for="price">Price: </label>
        <span id="price">63.00</span> <br>
        <label for="sku">SKU: </label>
        <span id="sku">8783</span><br> <br>
        <strong>Description</strong>
        <div class="description">
            <p>
                Bulbasaur can be seen napping in bright sunlight.
                There is a seed on its back. By soaking up the
                sun's rays, the seed grows progressively larger.
            </p>
        </div>

        <h4>Similar products</h4>
        <ul id="similar-products">
        <li>
            <a href="...">Charmander</a>
            </li>
            <li>
                <a href="...">Venusaur</a>
                </li>
                <li>
                <a href="...">Ivysaur</a>
            </li>
        </ul>
    </div>
</body>
</html>

How to select Nodes?

XPath Description Output
//h1 Selects all <h1> nodes <h1>Bulbasaur</h1>
/html/body Absolute path to <body> node <body>…<body>
//div/span Selects all <span> node that are children of a <div> node <span id=”price”>63.00</span>

<span id=”sku”>8783</span>

/html/body/div/h1 Absolute path to the <h1> node that is a child of <div> node <h1>Bulbasaur</h1>
//div//a Select all <a> nodes that are descendants of <div> nodes <a href=”…”>Charmander</a>
<a href=”…”>Venusaur</a>
<a href=”…”>Ivysaur</a>

How to select Attributes?

XPath Description Output
//@href Selects all nodes’ href attribute <a href=”…”>Charmander</a>
<a href=”…”>Venusaur</a>
<a href=”…”>Ivysaur</a>
//div/@class Select all <div> nodes’ class attribute main
description
//a/text() Select text content from all <a> nodes Charmander
Venusaur
Ivysaur

Select based on Order

XPath Description Output
//li[1] Selecting the first one from matching <li> nodes <li> <a href=”…”>Charmander</a> </li>
//li[2] Selecting the second one from matching <li> nodes <li> <a href=”…”>Venusaur</a> </li>
//span[@id][2] Select the first one from the <span> nodes with id attribute <span id=”sku”>8783</span>
//li[last()] Select the last node that from the matching <li> nodes <li> <a href=”…”>Ivysaur</a> </li>

Select based on Node relation

XPath Description Output
//label//following-sibling::span Selecting the <span> node that is a following sibling of <label> <span id=”price”>63.00</span>
<span id=”sku”>8783</span>
//label//following::p Selecting the <p> nodes that follows <label> <p> Bulbasaur can be… </p>
//label//preceding-sibling::img Selecting the <img> node that is a preceding sibling of <label> <img src=”…” alt=”Bulbasaur” width=”100″ height=”100″>
//a//parent::li Select the parent <li> node of all <a> nodes <li><a href=”…”>Charmander</a></li>…
<li><a href=”…”>Ivysaur</a></li>
//a//ancestor::ul Select the ancestor <ul> node of all <a> nodes <ul id=”similar-products”>…</ul>

Miscellaneous

XPath Description Output
//label[text()=”Price: “] Text equals <label for=”price”>Price: </label>
//a[contains(text(), “Charmander”)] Substring selection <a href=”…”>Charmander</a>
//span[contains(@class, “example”)] String matching in class attribute <div class=”description”> <p> … </p> </div>
//li[*] Has children <li> <a href=”…”>Charmander</a> </li>
<li> <a href=”…”>Venusaur</a> </li>
<li> <a href=”…”>Ivysaur</a> </li>
//ul[li] Has a specific tag as childrens <ul id=”similar-products”> … </ul>
//a[text()=”Charmander” or text()=”Ivysaur”] Or logic <a href=”…”>Charmander</a>
<a href=”…”>Ivysaur</a>
//span[@id=”price”] | //span[@id=”sku”] Union syntax. Joins two results together. <span id=”price”>63.00</span>
<span id=”sku”>8783</span>

Axes

Axes can be used to locate nodes relative to the current node. Axes allow to write XPath to parse data from complex and nested documents.

Example usage: //span//following::p

The keyword //following is an axis. It specifies to jump to the <p> node (::p) that follows the <span> node.

Axis Abbrev >Notes
ancestor
ancestor-or-self
attribute @ @href is short for attribute::href
child /div is short for //child::div
descendant
descendant-or-self // //h1 is short for /descendant-or-self::h1
namespace Selects all namespace nodes of the current node
self . . is short for self::node
parent .. .. is short for parent::node
following
following-sibling
preceding
preceding-sibling

Posted in:   Web Scraping Tutorials

Turn the Internet into meaningful, structured and usable data   

ScrapeHero Logo

Can we help you get some data?