How do I avoid getting blocked when web scraping with Node.js?
Set proper headers:
- Use realistic User-Agent strings from recent browsers
- Include Accept, Accept-Language, Accept-Encoding headers
- Add Referer header when navigating between pages
- Maintain consistent header sets that match real browsers
Implement rate limiting:
- Add delays between requests (500ms-2000ms)
- Limit concurrent requests (3-5 simultaneous connections)
- Use exponential backoff on errors
- Respect Retry-After headers in 429 responses
Rotate IP addresses:
- Use proxy services (residential proxies are best)
- Rotate proxies between requests or sessions
- Handle proxy failures gracefully with fallbacks
- Monitor proxy health and performance
Use Puppeteer stealth techniques:
- Install puppeteer-extra-plugin-stealth to hide automation indicators
- Randomize viewport sizes and user agents
- Simulate human-like mouse movements and delays
- Handle CAPTCHAs with solving services when necessary
Respect robots.txt:
- Parse and follow robots.txt rules
- Check Crawl-delay directives
- Avoid disallowed paths
- Consider reaching out for API access or permission
Session management:
- Maintain cookies across requests using cookie jars
- Handle authentication properly
- Don't create new sessions for every request unnecessarily
- Store session state to resume after interruptions
Error handling and retries:
- Implement exponential backoff (retry after 2s, 4s, 8s)
- Distinguish between temporary (retry) and permanent (skip) errors
- Log failures for analysis
- Circuit-break to stop hitting a failing target
Behavioral patterns:
- Scrape during off-peak hours
- Vary request timing (don't be too regular)
- Start slowly and increase rate gradually
- Alternate between different pages/sections
Technical measures:
- Use HTTP/2 when supported
- Enable compression (gzip, brotli)
- Handle redirects properly
- Validate SSL certificates correctly
Monitoring:
- Track error rates and response codes
- Monitor for blocks (403, 429, CAPTCHAs)
- Log response times to detect throttling
- Adjust strategy based on patterns
Legal and ethical:
- Review terms of service
- Avoid scraping personal data without consent
- Don't overload servers
- Consider official APIs when available