What are common XPath mistakes to avoid?
Understanding common XPath mistakes helps you write more reliable selectors.
1. Using absolute paths:
Bad:
/html/body/div[1]/div[2]/p
Breaks when HTML structure changes.
Good:
//p[@class='description']
Flexible and structure-independent.
2. Forgetting text() for text extraction:
Wrong (returns element):
title = tree.xpath('//h1[@class="title"]')[0]
# Returns: <Element h1>
Right (returns text):
title = tree.xpath('//h1[@class="title"]/text()')[0]
# Returns: "Product Title"
3. Class attribute matching issues:
Wrong (exact match required):
//div[@class='product']
Fails for <div class="product featured">
Right (flexible):
//div[contains(@class, 'product')]
4. Position() vs index:
Wrong (position is 1-indexed):
//li[0] # Returns nothing
Right:
//li[1] # First element
5. Confusion between // and /:
# All divs anywhere
//div
# Direct child divs only
/div
# All p within specific div
//div[@id='content']//p
# Direct p children only
//div[@id='content']/p
6. Not handling empty results:
Unsafe:
price = tree.xpath('//span[@class="price"]/text()')[0]
# IndexError if not found
Safe:
prices = tree.xpath('//span[@class="price"]/text()')
price = prices[0] if prices else None
Better (Scrapy):
price = response.xpath('//span[@class="price"]/text()').get()
# Returns None if not found
7. Confusing text() and string():
# text() - direct text nodes only
//div/text()
# string() - all text including children
string(//div)
8. Wrong axis for siblings:
Wrong (no parent-to-sibling axis):
//h2/../following-sibling::p
Right:
//h2/following-sibling::p
9. Predicates outside brackets:
Wrong:
//div/p[@class='intro'][1]
Returns first p with class intro.
Different meaning:
(//div/p[@class='intro'])[1]
Returns first of all matching p elements globally.
10. Not normalizing whitespace:
# Fails with extra whitespace
//button[text()='Submit']
# Works with whitespace variations
//button[normalize-space(text())='Submit']
11. Attribute vs text confusion:
# Wrong - @src is already the value
//img/@src/text()
# Right
//img/@src
12. Case sensitivity:
XPath is case-sensitive:
# Different results
//DIV # Uppercase (XML)
//div # Lowercase (HTML)
For HTML, always use lowercase.
Debugging tips:
- Test XPath in browser console:
$x('//div[@class="product"]') - Use XPath tester tools before implementing
- Start simple, add complexity gradually
- Check if elements exist before extracting
- Print results to verify data type and content
Best practices:
- Keep paths short and relative
- Use
contains()for flexible matching - Always handle missing elements
- Use
normalize-space()for text matching - Test against multiple page samples
- Add fallback selectors for resilience