What are the basics of XPath syntax?

XPath uses path expressions to navigate HTML/XML documents. Understanding the core syntax is essential.

Basic path expressions:

Absolute path (from root):

/html/body/div/p

/html/body/div/p

Selects exact path from document root. Fragile - breaks if structure changes.

Relative path (anywhere in document):

//p

//p

Selects all <p> elements anywhere. Most common and flexible.

Current node:

Parent node:

..

..

Attribute selection:

Select element with attribute:

//div[@class='product']

//div[@class='product']

Select attribute value:

//img/@src

//img/@src

Multiple attributes:

//input[@type='text'][@name='email']

//input[@type='text'][@name='email']

Predicates (filters):

By position:

//li[1]          # First li element
//li[last()]     # Last li element
//li[position() < 4]  # First 3 elements

//li[1]          # First li element
//li[last()]     # Last li element
//li[position() < 4]  # First 3 elements

By condition:

//div[@class='active']
//span[@id and @class]  # Has both id and class

//div[@class='active']
//span[@id and @class]  # Has both id and class

Text matching:

Exact text:

//button[text()='Submit']

//button[text()='Submit']

Contains text:

//p[contains(text(), 'Price')]

//p[contains(text(), 'Price')]

Starts with:

//div[starts-with(@class, 'product-')]

//div[starts-with(@class, 'product-')]

Axes (navigation):

Child:

//div/child::p
# Same as: //div/p

//div/child::p
# Same as: //div/p

Parent:

//span/parent::div
# Same as: //span/..

//span/parent::div
# Same as: //span/..

Following-sibling:

//h2/following-sibling::p

//h2/following-sibling::p

Preceding-sibling:

//p/preceding-sibling::h2

//p/preceding-sibling::h2

Ancestor:

//span/ancestor::div

//span/ancestor::div

Descendant:

//div/descendant::a

//div/descendant::a

Operators:

And:

//div[@class='product' and @data-available='true']

//div[@class='product' and @data-available='true']

Or:

//input[@type='text' or @type='email']

//input[@type='text' or @type='email']

Not:

//div[not(@class='hidden')]

//div[not(@class='hidden')]

Common patterns:

Any element with class:

//*[@class='product']

//*[@class='product']

Attribute contains value:

//a[contains(@href, 'product')]

//a[contains(@href, 'product')]

Multiple classes:

//div[contains(@class, 'card') and contains(@class, 'active')]

//div[contains(@class, 'card') and contains(@class, 'active')]

Python usage:

from lxml import html

tree = html.fromstring(html_content)
results = tree.xpath('//div[@class="product"]/h2/text()')

from lxml import html

tree = html.fromstring(html_content)
results = tree.xpath('//div[@class="product"]/h2/text()')

Scrapy usage:

response.xpath('//div[@class="product"]/h2/text()').get()
response.xpath('//div[@class="product"]/h2/text()').getall()

response.xpath('//div[@class="product"]/h2/text()').get()
response.xpath('//div[@class="product"]/h2/text()').getall()

Best practices:

Related Questions