How do I use the Sitemap directive in robots.txt?

The Sitemap directive tells crawlers where to find XML sitemaps, making it easier to discover all pages on a site.

Format in robots.txt:

User-agent: *
Crawl-delay: 2
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Sitemap: https://example.com/sitemap-products.xml

Benefits for scrapers:

  • Discover all pages systematically
  • Find pages not linked from main navigation
  • Get page priorities and update frequencies
  • More efficient than crawling all links

Parsing sitemaps in Python:

import requests
from urllib.robotparser import RobotFileParser
from bs4 import BeautifulSoup

# Get sitemap URLs from robots.txt
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

# Extract sitemap URLs (not directly supported by RobotFileParser)
robots_content = requests.get("https://example.com/robots.txt").text
sitemaps = [line.split(': ')[1].strip()
            for line in robots_content.splitlines()
            if line.startswith('Sitemap:')]

# Parse each sitemap
for sitemap_url in sitemaps:
    response = requests.get(sitemap_url)
    soup = BeautifulSoup(response.content, 'xml')

    # Extract all URLs
    urls = [loc.text for loc in soup.find_all('loc')]

    for url in urls:
        # Scrape each URL
        print(url)

Sitemap types:

Regular sitemap (sitemap.xml):

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page1</loc>
    <lastmod>2024-01-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Sitemap index (links to multiple sitemaps):

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-news.xml</loc>
  </sitemap>
</sitemapindex>

Using sitemap metadata:

lastmod: When the page was last modified

  • Helps you scrape only updated pages
  • Implement incremental scraping

changefreq: How often the page changes

  • daily, weekly, monthly
  • Use to prioritize scraping schedule

priority: Relative importance (0.0 to 1.0)

  • Start with high-priority pages
  • Skip low-priority pages if not needed

Example: Incremental scraping:

from datetime import datetime, timedelta

# Only scrape pages modified in the last 7 days
cutoff_date = datetime.now() - timedelta(days=7)

for url_element in soup.find_all('url'):
    loc = url_element.find('loc').text
    lastmod = url_element.find('lastmod')

    if lastmod:
        lastmod_date = datetime.fromisoformat(lastmod.text)
        if lastmod_date > cutoff_date:
            # Scrape this page (it's been updated recently)
            scrape_page(loc)

Best practices:

  1. Always check for sitemaps before crawling
  2. Handle sitemap indexes (recursive parsing)
  3. Respect robots.txt when accessing sitemap URLs
  4. Cache sitemaps to avoid re-downloading
  5. Use lastmod for incremental updates
  6. Handle compressed sitemaps (.gz files)

Scrapy integration:

from scrapy.spiders import SitemapSpider

class MySpider(SitemapSpider):
    name = 'sitemap_spider'
    sitemap_urls = ['https://example.com/sitemap.xml']

    def parse(self, response):
        # Process each URL from sitemap
        yield {'url': response.url}

Sitemaps make scraping more efficient and comprehensive.

Related Questions