What is Crawl-delay in robots.txt and how should I implement it?

Crawl-delay tells scrapers how many seconds to wait between requests to avoid overloading the server.

Format:

User-agent: *
Crawl-delay: 10

This means: wait 10 seconds between requests.

Why it exists:

  • Prevents server overload
  • Ensures fair resource distribution
  • Protects against aggressive crawlers
  • Helps sites with limited capacity

Implementation in Python:

import time
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

# Get crawl delay for your user agent
crawl_delay = rp.crawl_delay("*") or 1  # Default to 1 second

for url in urls:
    if rp.can_fetch("*", url):
        response = requests.get(url)
        # Process response
        time.sleep(crawl_delay)

Implementation with rate limiting:

from time import time, sleep

class RateLimiter:
    def __init__(self, crawl_delay):
        self.crawl_delay = crawl_delay
        self.last_request = 0

    def wait(self):
        elapsed = time() - self.last_request
        if elapsed < self.crawl_delay:
            sleep(self.crawl_delay - elapsed)
        self.last_request = time()

limiter = RateLimiter(crawl_delay=10)

for url in urls:
    limiter.wait()
    response = requests.get(url)

Common values:

  • 1-5 seconds: Most common for moderate-traffic sites
  • 10-30 seconds: High-traffic or resource-constrained sites
  • 60+ seconds: Very sensitive sites or explicit rate limiting

When no Crawl-delay is specified:

Use reasonable defaults:

  • Start with 1-2 seconds between requests
  • Monitor response times
  • Increase delay if you see errors or slowdowns
  • Decrease cautiously if performance is good

Scrapy implementation:

# settings.py
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 2  # Base delay in seconds
RANDOMIZE_DOWNLOAD_DELAY = True  # Add randomization

# Scrapy automatically respects robots.txt crawl-delay
# and uses the larger of DOWNLOAD_DELAY or crawl-delay

Advanced: Concurrent requests with delay:

Even with concurrency, respect per-domain rate limits:

import asyncio

async def fetch_with_delay(session, url, delay):
    await asyncio.sleep(delay)
    async with session.get(url) as response:
        return await response.text()

# Use semaphores to limit concurrent requests per domain
semaphore = asyncio.Semaphore(1)  # One request at a time per domain

async with semaphore:
    result = await fetch_with_delay(session, url, crawl_delay)

Best practices:

  1. Always implement some delay, even if not specified (1-2 seconds)
  2. Randomize delays slightly to appear more human
  3. Respect crawl-delay even if it seems excessive
  4. Monitor your scraper for 429 (Too Many Requests) errors
  5. Implement exponential backoff on errors

Red flags that you're going too fast:

  • 429 (Too Many Requests) responses
  • 503 (Service Unavailable) responses
  • Increasing response times
  • Sudden connection errors
  • IP ban or CAPTCHA challenges

Related Questions