Why do I need to set a User-Agent when web scraping?
Setting a proper User-Agent header is essential for successful web scraping.
Default behavior looks suspicious:
Most HTTP libraries send identifying User-Agent headers:
- Python requests:
python-requests/2.28.0 - cURL:
curl/7.68.0 - Node.js fetch:
node-fetch/2.6.1
These immediately identify your traffic as automated, making it easy for websites to block.
Why websites check User-Agent:
- Identify and block bots and scrapers
- Serve different content to different browsers
- Track browser statistics and compatibility
- Enforce API usage policies
- Prevent automated data extraction
Missing User-Agent consequences:
Many servers will:
- Return 403 Forbidden errors
- Serve degraded or blocked content
- Rate limit aggressively
- Return CAPTCHA challenges
- Redirect to error pages
Setting a realistic User-Agent:
In Python (requests):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
Best practices:
- Use recent, realistic User-Agent strings
- Match the User-Agent to your target audience (mobile vs desktop)
- Rotate User-Agents if making many requests
- Keep User-Agents up to date as browsers release new versions
A proper User-Agent is the most basic requirement for ethical and successful web scraping.