What is Crawl-delay in robots.txt and how should I implement it?
Crawl-delay tells scrapers how many seconds to wait between requests to avoid overloading the server.
Format:
User-agent: *
Crawl-delay: 10
This means: wait 10 seconds between requests.
Why it exists:
- Prevents server overload
- Ensures fair resource distribution
- Protects against aggressive crawlers
- Helps sites with limited capacity
Implementation in Python:
import time
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
# Get crawl delay for your user agent
crawl_delay = rp.crawl_delay("*") or 1 # Default to 1 second
for url in urls:
if rp.can_fetch("*", url):
response = requests.get(url)
# Process response
time.sleep(crawl_delay)
Implementation with rate limiting:
from time import time, sleep
class RateLimiter:
def __init__(self, crawl_delay):
self.crawl_delay = crawl_delay
self.last_request = 0
def wait(self):
elapsed = time() - self.last_request
if elapsed < self.crawl_delay:
sleep(self.crawl_delay - elapsed)
self.last_request = time()
limiter = RateLimiter(crawl_delay=10)
for url in urls:
limiter.wait()
response = requests.get(url)
Common values:
- 1-5 seconds: Most common for moderate-traffic sites
- 10-30 seconds: High-traffic or resource-constrained sites
- 60+ seconds: Very sensitive sites or explicit rate limiting
When no Crawl-delay is specified:
Use reasonable defaults:
- Start with 1-2 seconds between requests
- Monitor response times
- Increase delay if you see errors or slowdowns
- Decrease cautiously if performance is good
Scrapy implementation:
# settings.py
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 2 # Base delay in seconds
RANDOMIZE_DOWNLOAD_DELAY = True # Add randomization
# Scrapy automatically respects robots.txt crawl-delay
# and uses the larger of DOWNLOAD_DELAY or crawl-delay
Advanced: Concurrent requests with delay:
Even with concurrency, respect per-domain rate limits:
import asyncio
async def fetch_with_delay(session, url, delay):
await asyncio.sleep(delay)
async with session.get(url) as response:
return await response.text()
# Use semaphores to limit concurrent requests per domain
semaphore = asyncio.Semaphore(1) # One request at a time per domain
async with semaphore:
result = await fetch_with_delay(session, url, crawl_delay)
Best practices:
- Always implement some delay, even if not specified (1-2 seconds)
- Randomize delays slightly to appear more human
- Respect crawl-delay even if it seems excessive
- Monitor your scraper for 429 (Too Many Requests) errors
- Implement exponential backoff on errors
Red flags that you're going too fast:
- 429 (Too Many Requests) responses
- 503 (Service Unavailable) responses
- Increasing response times
- Sudden connection errors
- IP ban or CAPTCHA challenges