What are common regex patterns for web scraping?

Common regex patterns help extract structured data from HTML and text content.

Email addresses:

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}_

  • Matches most valid email formats
  • Useful for extracting contact information

URLs:

https?:\/\/[^\s<>"]+

  • Matches HTTP and HTTPS URLs
  • Stops at whitespace or common delimiters

Phone numbers (US):

\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}

  • Matches formats like (123) 456-7890, 123-456-7890, 123.456.7890

Prices:

\$\d+(?:,\d{3})*(?:\.\d{2})?*

  • Matches currency amounts like $1,234.56
  • Handles optional cents

Dates (common formats):

\d{1,2}/\d{1,2}/\d{4} or \d{4}-\d{2}-\d{2}

  • Matches MM/DD/YYYY or YYYY-MM-DD formats

Extracting content between tags:

<tag[^>]*>(.*?)</tag>

  • Captures content between HTML tags
  • Lazy matching prevents over-capturing

Important note:

While regex is useful for simple extraction, consider using dedicated HTML parsers (BeautifulSoup, Cheerio, lxml) for complex HTML parsing. Regex works best for:

  • Plain text extraction
  • Validating extracted values
  • Quick one-off parsing tasks
  • Content that isn't proper HTML/XML

Related Questions