How do I extract text content with XPath?

XPath provides multiple ways to extract text, each with different behaviors.

text() function:

Direct text only:

//p/text()

For HTML <p>Hello</p>, returns: "Hello"

Multiple text nodes:

//div/text()

For <div>Hello <span>World</span>!</div>: Returns: ["Hello ", "!"] (only direct text, not nested)

string() function:

All text content (including children):

string(//div)

For <div>Hello <span>World</span>!</div>: Returns: "Hello World!" (all text concatenated)

normalize-space():

Trimmed and normalized text:

normalize-space(//p)

For <p> Hello World </p>: Returns: "Hello World" (extra whitespace removed)

Common patterns:

Extract element text:

//h1[@class='title']/text()

Extract including nested elements:

//div[@class='content']//text()

Returns all text nodes within div and its descendants.

Clean whitespace:

normalize-space(//div[@class='description'])

Python implementation:

from lxml import html

tree = html.fromstring(html_content)

# Direct text
title = tree.xpath('//h1/text()')[0]

# All text (including nested)
description = ''.join(tree.xpath('//div[@class="desc"]//text()'))

# Normalized text
clean_text = tree.xpath('normalize-space(//p[@class="intro"])')

Scrapy:

# Get first text node
title = response.xpath('//h1/text()').get()

# Get all text nodes
texts = response.xpath('//div//text()').getall()

# Join and strip
description = ' '.join(response.xpath('//div//text()').getall()).strip()

Challenges and solutions:

Problem: Text split across nodes

HTML:

<p>Price: <span>$19.99</span></p>

Solution:

normalize-space(//p)  # Returns: "Price: $19.99"

Problem: Extra whitespace

HTML:

<div>
  Product Name
</div>

Solution:

normalize-space(//div)  # Removes leading/trailing space

Problem: Only want specific nested text

HTML:

<div class="product">
  <h2>Title</h2>
  <p>Description</p>
</div>

Get only description:

//div[@class='product']/p/text()

Text matching vs extraction:

Matching (for selection):

//button[contains(text(), 'Add')]

Finds button containing "Add"

Extraction (for data):

//button[contains(text(), 'Add')]/text()

Extracts the button's text

Best practices:

  1. Use text() for simple, direct text extraction
  2. Use normalize-space() when whitespace is inconsistent
  3. Use //text() to get all text including nested elements
  4. Always trim/clean extracted text in your code
  5. Test with real HTML to handle edge cases

Related Questions