Why do I need to set a User-Agent when web scraping?

Setting a proper User-Agent header is essential for successful web scraping.

Default behavior looks suspicious:

Most HTTP libraries send identifying User-Agent headers:

  • Python requests: python-requests/2.28.0
  • cURL: curl/7.68.0
  • Node.js fetch: node-fetch/2.6.1

These immediately identify your traffic as automated, making it easy for websites to block.

Why websites check User-Agent:

  • Identify and block bots and scrapers
  • Serve different content to different browsers
  • Track browser statistics and compatibility
  • Enforce API usage policies
  • Prevent automated data extraction

Missing User-Agent consequences:

Many servers will:

  • Return 403 Forbidden errors
  • Serve degraded or blocked content
  • Rate limit aggressively
  • Return CAPTCHA challenges
  • Redirect to error pages

Setting a realistic User-Agent:

In Python (requests):

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)

Best practices:

  • Use recent, realistic User-Agent strings
  • Match the User-Agent to your target audience (mobile vs desktop)
  • Rotate User-Agents if making many requests
  • Keep User-Agents up to date as browsers release new versions

A proper User-Agent is the most basic requirement for ethical and successful web scraping.

Related Questions