What are common regex patterns for web scraping?
Common regex patterns help extract structured data from HTML and text content.
Email addresses:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}_
- Matches most valid email formats
- Useful for extracting contact information
URLs:
https?:\/\/[^\s<>"]+
- Matches HTTP and HTTPS URLs
- Stops at whitespace or common delimiters
Phone numbers (US):
\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}
- Matches formats like (123) 456-7890, 123-456-7890, 123.456.7890
Prices:
\$\d+(?:,\d{3})*(?:\.\d{2})?*
- Matches currency amounts like $1,234.56
- Handles optional cents
Dates (common formats):
\d{1,2}/\d{1,2}/\d{4} or \d{4}-\d{2}-\d{2}
- Matches MM/DD/YYYY or YYYY-MM-DD formats
Extracting content between tags:
<tag[^>]*>(.*?)</tag>
- Captures content between HTML tags
- Lazy matching prevents over-capturing
Important note:
While regex is useful for simple extraction, consider using dedicated HTML parsers (BeautifulSoup, Cheerio, lxml) for complex HTML parsing. Regex works best for:
- Plain text extraction
- Validating extracted values
- Quick one-off parsing tasks
- Content that isn't proper HTML/XML