How do I parse and respect robots.txt in my scraper?

Parsing robots.txt ensures your scraper follows website rules and avoids blocks.

Python (urllib.robotparser):

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

# Check if URL can be fetched
can_fetch = rp.can_fetch("*", "https://example.com/products")

if can_fetch:
    # Proceed with scraping
    response = requests.get(url)
else:
    print("Blocked by robots.txt")

# Get crawl delay (in seconds)
crawl_delay = rp.crawl_delay("*")
if crawl_delay:
    time.sleep(crawl_delay)

Python with Scrapy:

Scrapy respects robots.txt automatically:

# In settings.py
ROBOTSTXT_OBEY = True

# Scrapy will automatically:
# - Download robots.txt
# - Parse rules
# - Filter requests
# - Apply crawl delays

Node.js (robots-parser):

const robotsParser = require('robots-parser');
const fetch = require('node-fetch');

// Fetch and parse robots.txt
const response = await fetch('https://example.com/robots.txt');
const robotsTxt = await response.text();
const robots = robotsParser('https://example.com/robots.txt', robotsTxt);

// Check if allowed
const canCrawl = robots.isAllowed('https://example.com/page', 'MyBot');

if (canCrawl) {
    // Scrape the page
}

// Get crawl delay
const delay = robots.getCrawlDelay('MyBot');

Important directives:

User-agent: Specifies which bot the rules apply to:

  • User-agent: * applies to all bots
  • User-agent: Googlebot applies only to Google's crawler
  • Use * as your bot's identifier or a descriptive name

Disallow: Paths that shouldn't be crawled:

  • Disallow: /admin/ blocks all URLs starting with /admin/
  • Disallow: / blocks everything
  • Empty Disallow allows everything

Allow: Overrides Disallow for specific paths:

  • Allow: /public/ allows /public/ even if parent is disallowed

Crawl-delay: Seconds to wait between requests:

  • Crawl-delay: 10 means wait 10 seconds between requests

Sitemap: Location of XML sitemap:

  • Sitemap: https://example.com/sitemap.xml

Best practices:

  1. Cache robots.txt (don't fetch it on every request)
  2. Refresh cache periodically (every 24 hours)
  3. Handle missing robots.txt (treat as "allow all")
  4. Respect the most specific rule that matches
  5. Implement crawl-delay with exponential backoff

Error handling:

try:
    rp.read()
except Exception as e:
    # If robots.txt doesn't exist or fails to load
    # Assume crawling is allowed but proceed carefully
    print(f"robots.txt unavailable: {e}")

Related Questions