What are common regex patterns for web scraping?

Common regex patterns help extract structured data from HTML and text content.

Email addresses:

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}_

Matches most valid email formats
Useful for extracting contact information

URLs:

https?:\/\/[^\s<>"]+

Matches HTTP and HTTPS URLs
Stops at whitespace or common delimiters

Phone numbers (US):

$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}

Matches formats like (123) 456-7890, 123-456-7890, 123.456.7890

Prices:

\$\d+(?:,\d{3})*(?:\.\d{2})?*

Matches currency amounts like $1,234.56
Handles optional cents

Dates (common formats):

\d{1,2}/\d{1,2}/\d{4} or \d{4}-\d{2}-\d{2}

Matches MM/DD/YYYY or YYYY-MM-DD formats

Extracting content between tags:

<tag[^>]*>(.*?)</tag>

Captures content between HTML tags
Lazy matching prevents over-capturing

Important note:

While regex is useful for simple extraction, consider using dedicated HTML parsers (BeautifulSoup, Cheerio, lxml) for complex HTML parsing. Regex works best for:

Plain text extraction
Validating extracted values
Quick one-off parsing tasks
Content that isn't proper HTML/XML

What are common regex patterns for web scraping?

Related Questions