How do I parse and respect robots.txt in my scraper?

Parsing robots.txt ensures your scraper follows website rules and avoids blocks.

Python (urllib.robotparser):

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

# Check if URL can be fetched
can_fetch = rp.can_fetch("*", "https://example.com/products")

if can_fetch:
    # Proceed with scraping
    response = requests.get(url)
else:
    print("Blocked by robots.txt")

# Get crawl delay (in seconds)
crawl_delay = rp.crawl_delay("*")
if crawl_delay:
    time.sleep(crawl_delay)

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

# Check if URL can be fetched
can_fetch = rp.can_fetch("*", "https://example.com/products")

if can_fetch:
    # Proceed with scraping
    response = requests.get(url)
else:
    print("Blocked by robots.txt")

# Get crawl delay (in seconds)
crawl_delay = rp.crawl_delay("*")
if crawl_delay:
    time.sleep(crawl_delay)

Python with Scrapy:

Scrapy respects robots.txt automatically:

# In settings.py
ROBOTSTXT_OBEY = True

# Scrapy will automatically:
# - Download robots.txt
# - Parse rules
# - Filter requests
# - Apply crawl delays

# In settings.py
ROBOTSTXT_OBEY = True

# Scrapy will automatically:
# - Download robots.txt
# - Parse rules
# - Filter requests
# - Apply crawl delays

Node.js (robots-parser):

const robotsParser = require('robots-parser');
const fetch = require('node-fetch');

// Fetch and parse robots.txt
const response = await fetch('https://example.com/robots.txt');
const robotsTxt = await response.text();
const robots = robotsParser('https://example.com/robots.txt', robotsTxt);

// Check if allowed
const canCrawl = robots.isAllowed('https://example.com/page', 'MyBot');

if (canCrawl) {
    // Scrape the page
}

// Get crawl delay
const delay = robots.getCrawlDelay('MyBot');

const robotsParser = require('robots-parser');
const fetch = require('node-fetch');

// Fetch and parse robots.txt
const response = await fetch('https://example.com/robots.txt');
const robotsTxt = await response.text();
const robots = robotsParser('https://example.com/robots.txt', robotsTxt);

// Check if allowed
const canCrawl = robots.isAllowed('https://example.com/page', 'MyBot');

if (canCrawl) {
    // Scrape the page
}

// Get crawl delay
const delay = robots.getCrawlDelay('MyBot');

Important directives:

User-agent: Specifies which bot the rules apply to:

User-agent: * applies to all bots
User-agent: Googlebot applies only to Google's crawler
Use * as your bot's identifier or a descriptive name

Disallow: Paths that shouldn't be crawled:

Disallow: /admin/ blocks all URLs starting with /admin/
Disallow: / blocks everything
Empty Disallow allows everything

Allow: Overrides Disallow for specific paths:

Allow: /public/ allows /public/ even if parent is disallowed

Crawl-delay: Seconds to wait between requests:

Crawl-delay: 10 means wait 10 seconds between requests

Sitemap: Location of XML sitemap:

Sitemap: https://example.com/sitemap.xml

Best practices:

Cache robots.txt (don't fetch it on every request)
Refresh cache periodically (every 24 hours)
Handle missing robots.txt (treat as "allow all")
Respect the most specific rule that matches
Implement crawl-delay with exponential backoff

Error handling:

try:
    rp.read()
except Exception as e:
    # If robots.txt doesn't exist or fails to load
    # Assume crawling is allowed but proceed carefully
    print(f"robots.txt unavailable: {e}")

try:
    rp.read()
except Exception as e:
    # If robots.txt doesn't exist or fails to load
    # Assume crawling is allowed but proceed carefully
    print(f"robots.txt unavailable: {e}")

How do I parse and respect robots.txt in my scraper?

Related Questions