How do I avoid getting blocked when web scraping with Node.js?

Set proper headers:

  • Use realistic User-Agent strings from recent browsers
  • Include Accept, Accept-Language, Accept-Encoding headers
  • Add Referer header when navigating between pages
  • Maintain consistent header sets that match real browsers

Implement rate limiting:

  • Add delays between requests (500ms-2000ms)
  • Limit concurrent requests (3-5 simultaneous connections)
  • Use exponential backoff on errors
  • Respect Retry-After headers in 429 responses

Rotate IP addresses:

  • Use proxy services (residential proxies are best)
  • Rotate proxies between requests or sessions
  • Handle proxy failures gracefully with fallbacks
  • Monitor proxy health and performance

Use Puppeteer stealth techniques:

  • Install puppeteer-extra-plugin-stealth to hide automation indicators
  • Randomize viewport sizes and user agents
  • Simulate human-like mouse movements and delays
  • Handle CAPTCHAs with solving services when necessary

Respect robots.txt:

  • Parse and follow robots.txt rules
  • Check Crawl-delay directives
  • Avoid disallowed paths
  • Consider reaching out for API access or permission

Session management:

  • Maintain cookies across requests using cookie jars
  • Handle authentication properly
  • Don't create new sessions for every request unnecessarily
  • Store session state to resume after interruptions

Error handling and retries:

  • Implement exponential backoff (retry after 2s, 4s, 8s)
  • Distinguish between temporary (retry) and permanent (skip) errors
  • Log failures for analysis
  • Circuit-break to stop hitting a failing target

Behavioral patterns:

  • Scrape during off-peak hours
  • Vary request timing (don't be too regular)
  • Start slowly and increase rate gradually
  • Alternate between different pages/sections

Technical measures:

  • Use HTTP/2 when supported
  • Enable compression (gzip, brotli)
  • Handle redirects properly
  • Validate SSL certificates correctly

Monitoring:

  • Track error rates and response codes
  • Monitor for blocks (403, 429, CAPTCHAs)
  • Log response times to detect throttling
  • Adjust strategy based on patterns

Legal and ethical:

  • Review terms of service
  • Avoid scraping personal data without consent
  • Don't overload servers
  • Consider official APIs when available

Related Questions