How do I extract text content with XPath?
XPath provides multiple ways to extract text, each with different behaviors.
text() function:
Direct text only:
//p/text()
For HTML <p>Hello</p>, returns: "Hello"
Multiple text nodes:
//div/text()
For <div>Hello <span>World</span>!</div>:
Returns: ["Hello ", "!"] (only direct text, not nested)
string() function:
All text content (including children):
string(//div)
For <div>Hello <span>World</span>!</div>:
Returns: "Hello World!" (all text concatenated)
normalize-space():
Trimmed and normalized text:
normalize-space(//p)
For <p> Hello World </p>:
Returns: "Hello World" (extra whitespace removed)
Common patterns:
Extract element text:
//h1[@class='title']/text()
Extract including nested elements:
//div[@class='content']//text()
Returns all text nodes within div and its descendants.
Clean whitespace:
normalize-space(//div[@class='description'])
Python implementation:
from lxml import html
tree = html.fromstring(html_content)
# Direct text
title = tree.xpath('//h1/text()')[0]
# All text (including nested)
description = ''.join(tree.xpath('//div[@class="desc"]//text()'))
# Normalized text
clean_text = tree.xpath('normalize-space(//p[@class="intro"])')
Scrapy:
# Get first text node
title = response.xpath('//h1/text()').get()
# Get all text nodes
texts = response.xpath('//div//text()').getall()
# Join and strip
description = ' '.join(response.xpath('//div//text()').getall()).strip()
Challenges and solutions:
Problem: Text split across nodes
HTML:
<p>Price: <span>$19.99</span></p>
Solution:
normalize-space(//p) # Returns: "Price: $19.99"
Problem: Extra whitespace
HTML:
<div>
Product Name
</div>
Solution:
normalize-space(//div) # Removes leading/trailing space
Problem: Only want specific nested text
HTML:
<div class="product">
<h2>Title</h2>
<p>Description</p>
</div>
Get only description:
//div[@class='product']/p/text()
Text matching vs extraction:
Matching (for selection):
//button[contains(text(), 'Add')]
Finds button containing "Add"
Extraction (for data):
//button[contains(text(), 'Add')]/text()
Extracts the button's text
Best practices:
- Use
text()for simple, direct text extraction - Use
normalize-space()when whitespace is inconsistent - Use
//text()to get all text including nested elements - Always trim/clean extracted text in your code
- Test with real HTML to handle edge cases