What are Python web scraping best practices?
Following best practices ensures reliable, maintainable, and ethical scrapers.
1. Use sessions for efficiency:
session = requests.Session()
session.headers.update({'User-Agent': 'Your Bot Name'})
response = session.get(url)
2. Implement proper error handling:
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.RequestException as e:
logging.error(f"Request failed: {e}")
3. Add delays between requests:
import time
time.sleep(1) # Wait 1 second between requests
4. Validate data before extraction:
element = soup.find('div', class_='price')
price = element.text.strip() if element else None
5. Use logging instead of print:
import logging
logging.basicConfig(level=logging.INFO)
logging.info(f"Scraped {len(items)} items")
6. Store data incrementally:
Don't wait to save everything at the end - write data as you scrape to avoid data loss on crashes.
7. Respect robots.txt:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
can_fetch = rp.can_fetch("*", url)
8. Use appropriate selectors:
- Prefer CSS selectors for structure-based extraction
- Use XPath when you need text-based selection or parent navigation
- Make selectors as simple as possible while remaining specific
9. Handle pagination properly:
Follow pagination links or API parameters instead of guessing URL patterns.
10. Test against multiple samples:
Websites vary between pages - test selectors on multiple examples to ensure robustness.