How do I use the Sitemap directive in robots.txt?
The Sitemap directive tells crawlers where to find XML sitemaps, making it easier to discover all pages on a site.
Format in robots.txt:
User-agent: *
Crawl-delay: 2
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Sitemap: https://example.com/sitemap-products.xml
Benefits for scrapers:
- Discover all pages systematically
- Find pages not linked from main navigation
- Get page priorities and update frequencies
- More efficient than crawling all links
Parsing sitemaps in Python:
import requests
from urllib.robotparser import RobotFileParser
from bs4 import BeautifulSoup
# Get sitemap URLs from robots.txt
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
# Extract sitemap URLs (not directly supported by RobotFileParser)
robots_content = requests.get("https://example.com/robots.txt").text
sitemaps = [line.split(': ')[1].strip()
for line in robots_content.splitlines()
if line.startswith('Sitemap:')]
# Parse each sitemap
for sitemap_url in sitemaps:
response = requests.get(sitemap_url)
soup = BeautifulSoup(response.content, 'xml')
# Extract all URLs
urls = [loc.text for loc in soup.find_all('loc')]
for url in urls:
# Scrape each URL
print(url)
Sitemap types:
Regular sitemap (sitemap.xml):
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page1</loc>
<lastmod>2024-01-01</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
Sitemap index (links to multiple sitemaps):
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-news.xml</loc>
</sitemap>
</sitemapindex>
Using sitemap metadata:
lastmod: When the page was last modified
- Helps you scrape only updated pages
- Implement incremental scraping
changefreq: How often the page changes
- daily, weekly, monthly
- Use to prioritize scraping schedule
priority: Relative importance (0.0 to 1.0)
- Start with high-priority pages
- Skip low-priority pages if not needed
Example: Incremental scraping:
from datetime import datetime, timedelta
# Only scrape pages modified in the last 7 days
cutoff_date = datetime.now() - timedelta(days=7)
for url_element in soup.find_all('url'):
loc = url_element.find('loc').text
lastmod = url_element.find('lastmod')
if lastmod:
lastmod_date = datetime.fromisoformat(lastmod.text)
if lastmod_date > cutoff_date:
# Scrape this page (it's been updated recently)
scrape_page(loc)
Best practices:
- Always check for sitemaps before crawling
- Handle sitemap indexes (recursive parsing)
- Respect robots.txt when accessing sitemap URLs
- Cache sitemaps to avoid re-downloading
- Use lastmod for incremental updates
- Handle compressed sitemaps (.gz files)
Scrapy integration:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
name = 'sitemap_spider'
sitemap_urls = ['https://example.com/sitemap.xml']
def parse(self, response):
# Process each URL from sitemap
yield {'url': response.url}
Sitemaps make scraping more efficient and comprehensive.