XPath (XML Path Language) is a syntax for defining parts of an XML document. We will explain the relevance of Xpath in web scraping. XPath is a query language for identifying and selecting nodes or…
XPath or XML path language is used for navigating through XML or HTML documents. XPath can be used for extracting data in web scraping applications. Following is a concise XPath cheat sheet for web scraping needs.
XPath Basics
<div class="main"> <h1> Bulbasaur </h1> <div id="description"> <p> Bulbasaur can be seen napping in bright sunlight. </p> </div> </div>
Accessing the data from HTML using XPath is an easy process. Let’s try to find <h1>
from the above HTML using XPath.
The <h1>
is located inside a <div>
element with class name "main"
. The XPath for accessing the element will be:
//div[@class="main"]/h1
Output: <h1> Bulbasaur </h1>
To access the text inside the <h1>
//div[@class="main"]/h1/text()
Output: Bulbasaur
XPath Syntax
XPath uses path expressions to find nodes from XML or HTML documents. The node is selected with a series of steps and axes. The following is a breakdown of the XPaths used above.
//div[@class="main"]/h1
Axis – Axes can be used to locate nodes relative to the current node. Axes allow to write XPath to parse data from complex and nested documents.
Step – Step or Path contain the node that we like to navigate to.
Here are the list of most commonly used expressions:
Expression | Description |
---|---|
// | Select any descendant of current node |
/ | Select any child of current node |
. | Select current node |
.. | Select the parent node |
@ | Select attributes |
nodename | Select matching nodes |
XPath Locators Cheat Sheet
A larger sample HTML is given below. It will be used for the following examples.
<html> <head></head> <body> <div class="main"> <h1>Bulbasaur</h1> <img src="..." alt="Bulbasaur" width="100" height="100" ><br> <label for="price">Price: </label> <span id="price">63.00</span> <br> <label for="sku">SKU: </label> <span id="sku">8783</span><br> <br> <strong>Description</strong> <div class="description"> <p> Bulbasaur can be seen napping in bright sunlight. There is a seed on its back. By soaking up the sun's rays, the seed grows progressively larger. </p> </div> <h4>Similar products</h4> <ul id="similar-products"> <li> <a href="...">Charmander</a> </li> <li> <a href="...">Venusaur</a> </li> <li> <a href="...">Ivysaur</a> </li> </ul> </div> </body> </html>
How to select Nodes?
XPath | Description | Output |
//h1 | Selects all <h1> nodes | <h1>Bulbasaur</h1> |
/html/body | Absolute path to <body> node | <body>…<body> |
//div/span | Selects all <span> node that are children of a <div> node | <span id=”price”>63.00</span>
<span id=”sku”>8783</span> |
/html/body/div/h1 | Absolute path to the <h1> node that is a child of <div> node | <h1>Bulbasaur</h1> |
//div//a | Select all <a> nodes that are descendants of <div> nodes | <a href=”…”>Charmander</a> <a href=”…”>Venusaur</a> <a href=”…”>Ivysaur</a> |
How to select Attributes?
XPath | Description | Output |
---|---|---|
//@href | Selects all nodes’ href attribute | <a href=”…”>Charmander</a> <a href=”…”>Venusaur</a> <a href=”…”>Ivysaur</a> |
//div/@class | Select all <div> nodes’ class attribute | main description |
//a/text() | Select text content from all <a> nodes | Charmander Venusaur Ivysaur |
Select based on Order
XPath | Description | Output |
//li[1] | Selecting the first one from matching <li> nodes | <li> <a href=”…”>Charmander</a> </li> |
//li[2] | Selecting the second one from matching <li> nodes | <li> <a href=”…”>Venusaur</a> </li> |
//span[@id][2] | Select the first one from the <span> nodes with id attribute | <span id=”sku”>8783</span> |
//li[last()] | Select the last node that from the matching <li> nodes | <li> <a href=”…”>Ivysaur</a> </li> |
Select based on Node relation
XPath | Description | Output |
//label//following-sibling::span | Selecting the <span> node that is a following sibling of <label> | <span id=”price”>63.00</span> <span id=”sku”>8783</span> |
//label//following::p | Selecting the <p> nodes that follows <label> | <p> Bulbasaur can be… </p> |
//label//preceding-sibling::img | Selecting the <img> node that is a preceding sibling of <label> | <img src=”…” alt=”Bulbasaur” width=”100″ height=”100″> |
//a//parent::li | Select the parent <li> node of all <a> nodes | <li><a href=”…”>Charmander</a></li>… <li><a href=”…”>Ivysaur</a></li> |
//a//ancestor::ul | Select the ancestor <ul> node of all <a> nodes | <ul id=”similar-products”>…</ul> |
Miscellaneous
XPath | Description | Output |
//label[text()=”Price: “] | Text equals | <label for=”price”>Price: </label> |
//a[contains(text(), “Charmander”)] | Substring selection | <a href=”…”>Charmander</a> |
//span[contains(@class, “example”)] | String matching in class attribute | <div class=”description”> <p> … </p> </div> |
//li[*] | Has children | <li> <a href=”…”>Charmander</a> </li> <li> <a href=”…”>Venusaur</a> </li> <li> <a href=”…”>Ivysaur</a> </li> |
//ul[li] | Has a specific tag as childrens | <ul id=”similar-products”> … </ul> |
//a[text()=”Charmander” or text()=”Ivysaur”] | Or logic | <a href=”…”>Charmander</a> <a href=”…”>Ivysaur</a> |
//span[@id=”price”] | //span[@id=”sku”] | Union syntax. Joins two results together. | <span id=”price”>63.00</span> <span id=”sku”>8783</span> |
Axes
Axes can be used to locate nodes relative to the current node. Axes allow to write XPath to parse data from complex and nested documents.
Example usage: //span//following::p
The keyword //following
is an axis. It specifies to jump to the <p>
node (::p) that follows the <span>
node.
Axis | Abbrev | >Notes |
ancestor | ||
ancestor-or-self | ||
attribute | @ | @href is short for attribute::href |
child | /div is short for //child::div | |
descendant | ||
descendant-or-self | // | //h1 is short for /descendant-or-self::h1 |
namespace | Selects all namespace nodes of the current node | |
self | . | . is short for self::node |
parent | .. | .. is short for parent::node |
following | ||
following-sibling | ||
preceding | ||
preceding-sibling |