How do I extract phone numbers from HTML?

Extracting phone numbers from HTML is challenging due to the wide variety of phone number formats across different countries and inconsistent formatting on websites.

Common formats:

Phone numbers can appear in many formats:

  • (555) 123-4567
  • 555-123-4567
  • +1 555 123 4567
  • 555.123.4567
  • call 555 123 4567

Extraction strategy:

Use regex patterns that match common formats while being flexible enough to handle variations:

  • For international numbers, look for patterns starting with + followed by country codes
  • Check both visible text content and HTML attributes

Structured data in HTML:

Phone numbers often appear in href attributes with tel: protocol links:

<a href="tel:+15551234567">Call us</a>

These provide cleaner, structured data that's easier to extract.

Normalization:

After extraction, normalize phone numbers by:

  • Removing formatting characters (parentheses, hyphens, spaces)
  • Leaving only digits and the + prefix for international codes
  • Converting to a standard format

Using libphonenumber:

Libraries like libphonenumber (available in Python, JavaScript, and other languages) provide comprehensive parsing and validation:

  • Identify the country from the number format
  • Validate if a number is possible or valid for that region
  • Convert between different formats
  • Parse extensions and special numbers

Data cleaning:

  • Implement deduplication to remove duplicates
  • Filter out obviously invalid numbers like 123-456-7890 or 000-000-0000 that appear as placeholders
  • Handle edge cases like toll-free numbers, short codes, and emergency numbers

Related Questions