How do I extract all links from HTML?

Extracting links from HTML involves parsing the document and identifying all anchor tags (<a>) and their href attributes, which contain the URLs.

Basic extraction process:

Start by loading the HTML content using an HTML parser like Cheerio (Node.js), Beautiful Soup (Python), or browser DevTools. Search for all <a> tags using CSS selectors like a[href] to ensure you only get anchors with actual links. Extract the href attribute value from each anchor tag.

Other link sources:

Don't forget about link sources beyond <a> tags:

<link> tags in the <head> section (stylesheets, favicons, canonical URLs)
<area> tags in image maps
<frame> and <iframe> tags for embedded content
JavaScript-generated links that may be stored in data attributes or created dynamically

URL formats to handle:

Absolute URLs: https://example.com/page
Relative URLs: /about or ../contact
Protocol-relative URLs: //cdn.example.com/script.js
Anchor links: #section
JavaScript pseudo-URLs: javascript:void(0)
Special protocols: mailto: or tel: links

URL normalization:

Convert relative URLs to absolute URLs by combining them with the base domain and current page path. Parse URL fragments and query parameters if needed.

Be aware that some modern sites render links via JavaScript after page load, requiring a headless browser to see all links. Our Link Extractor automatically handles these cases and provides filtering options to distinguish between different link types.

How do I extract all links from HTML?

Related Questions