How do I parse and respect robots.txt in my scraper?
Parsing robots.txt ensures your scraper follows website rules and avoids blocks.
Python (urllib.robotparser):
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
# Check if URL can be fetched
can_fetch = rp.can_fetch("*", "https://example.com/products")
if can_fetch:
# Proceed with scraping
response = requests.get(url)
else:
print("Blocked by robots.txt")
# Get crawl delay (in seconds)
crawl_delay = rp.crawl_delay("*")
if crawl_delay:
time.sleep(crawl_delay)
Python with Scrapy:
Scrapy respects robots.txt automatically:
# In settings.py
ROBOTSTXT_OBEY = True
# Scrapy will automatically:
# - Download robots.txt
# - Parse rules
# - Filter requests
# - Apply crawl delays
Node.js (robots-parser):
const robotsParser = require('robots-parser');
const fetch = require('node-fetch');
// Fetch and parse robots.txt
const response = await fetch('https://example.com/robots.txt');
const robotsTxt = await response.text();
const robots = robotsParser('https://example.com/robots.txt', robotsTxt);
// Check if allowed
const canCrawl = robots.isAllowed('https://example.com/page', 'MyBot');
if (canCrawl) {
// Scrape the page
}
// Get crawl delay
const delay = robots.getCrawlDelay('MyBot');
Important directives:
User-agent: Specifies which bot the rules apply to:
User-agent: *applies to all botsUser-agent: Googlebotapplies only to Google's crawler- Use
*as your bot's identifier or a descriptive name
Disallow: Paths that shouldn't be crawled:
Disallow: /admin/blocks all URLs starting with /admin/Disallow: /blocks everything- Empty Disallow allows everything
Allow: Overrides Disallow for specific paths:
Allow: /public/allows /public/ even if parent is disallowed
Crawl-delay: Seconds to wait between requests:
Crawl-delay: 10means wait 10 seconds between requests
Sitemap: Location of XML sitemap:
Sitemap: https://example.com/sitemap.xml
Best practices:
- Cache robots.txt (don't fetch it on every request)
- Refresh cache periodically (every 24 hours)
- Handle missing robots.txt (treat as "allow all")
- Respect the most specific rule that matches
- Implement crawl-delay with exponential backoff
Error handling:
try:
rp.read()
except Exception as e:
# If robots.txt doesn't exist or fails to load
# Assume crawling is allowed but proceed carefully
print(f"robots.txt unavailable: {e}")