How do I extract phone numbers from HTML?
Extracting phone numbers from HTML is challenging due to the wide variety of phone number formats across different countries and inconsistent formatting on websites.
Common formats:
Phone numbers can appear in many formats:
(555) 123-4567555-123-4567+1 555 123 4567555.123.4567call 555 123 4567
Extraction strategy:
Use regex patterns that match common formats while being flexible enough to handle variations:
- For international numbers, look for patterns starting with
+followed by country codes - Check both visible text content and HTML attributes
Structured data in HTML:
Phone numbers often appear in href attributes with tel: protocol links:
<a href="tel:+15551234567">Call us</a>
These provide cleaner, structured data that's easier to extract.
Normalization:
After extraction, normalize phone numbers by:
- Removing formatting characters (parentheses, hyphens, spaces)
- Leaving only digits and the
+prefix for international codes - Converting to a standard format
Using libphonenumber:
Libraries like libphonenumber (available in Python, JavaScript, and other languages) provide comprehensive parsing and validation:
- Identify the country from the number format
- Validate if a number is possible or valid for that region
- Convert between different formats
- Parse extensions and special numbers
Data cleaning:
- Implement deduplication to remove duplicates
- Filter out obviously invalid numbers like
123-456-7890or000-000-0000that appear as placeholders - Handle edge cases like toll-free numbers, short codes, and emergency numbers