What are the basics of XPath syntax?
XPath uses path expressions to navigate HTML/XML documents. Understanding the core syntax is essential.
Basic path expressions:
Absolute path (from root):
/html/body/div/p
Selects exact path from document root. Fragile - breaks if structure changes.
Relative path (anywhere in document):
//p
Selects all <p> elements anywhere. Most common and flexible.
Current node:
.
Parent node:
..
Attribute selection:
Select element with attribute:
//div[@class='product']
Select attribute value:
//img/@src
Multiple attributes:
//input[@type='text'][@name='email']
Predicates (filters):
By position:
//li[1] # First li element
//li[last()] # Last li element
//li[position() < 4] # First 3 elements
By condition:
//div[@class='active']
//span[@id and @class] # Has both id and class
Text matching:
Exact text:
//button[text()='Submit']
Contains text:
//p[contains(text(), 'Price')]
Starts with:
//div[starts-with(@class, 'product-')]
Axes (navigation):
Child:
//div/child::p
# Same as: //div/p
Parent:
//span/parent::div
# Same as: //span/..
Following-sibling:
//h2/following-sibling::p
Preceding-sibling:
//p/preceding-sibling::h2
Ancestor:
//span/ancestor::div
Descendant:
//div/descendant::a
Operators:
And:
//div[@class='product' and @data-available='true']
Or:
//input[@type='text' or @type='email']
Not:
//div[not(@class='hidden')]
Common patterns:
Any element with class:
//*[@class='product']
Attribute contains value:
//a[contains(@href, 'product')]
Multiple classes:
//div[contains(@class, 'card') and contains(@class, 'active')]
Python usage:
from lxml import html
tree = html.fromstring(html_content)
results = tree.xpath('//div[@class="product"]/h2/text()')
Scrapy usage:
response.xpath('//div[@class="product"]/h2/text()').get()
response.xpath('//div[@class="product"]/h2/text()').getall()
Best practices:
- Keep paths short and relative (
//divnot/html/body/div) - Use
contains()for flexible class matching - Avoid deep absolute paths (fragile)
- Test expressions in XPath tester before implementing