How do I handle relative URLs when extracting links?
Relative URLs are paths without a full domain, like /about, ../contact, or images/photo.jpg, and they must be converted to absolute URLs to be usable outside the source page context.
Understanding URL types:
- Absolute URLs: Include protocol and domain (
https://example.com/page) - Root-relative URLs: Start with
/and reference from domain root (/about→https://example.com/about) - Relative URLs: Reference from current directory (
contactfrom/products/→/products/contact) - Parent-relative URLs: Use
..to go up directories (../aboutfrom/products/items/→/products/about)
Conversion process:
- Identify the base URL - This is typically the current page's URL, but check for
<base>tags in the HTML<head>which override the default base - Parse both URLs - Extract components (protocol, domain, path) from both the base URL and relative URL
- Apply resolution rules:
- If the relative URL starts with
//, add the protocol - If it starts with
/, replace the entire path - If it starts with
.., navigate up directories - Otherwise, append to the current directory
- If the relative URL starts with
- Normalize the result - Resolve
.and..segments and remove duplicate slashes
Common pitfalls:
- Ignoring
<base>tags that change the base URL - Incorrectly handling query parameters and fragments
- Failing to decode URL-encoded characters
- Not handling edge cases like empty hrefs or single
#
Programming language support:
Most languages provide URL resolution libraries:
- Node.js:
new URL(relative, base) - Python:
urllib.parse.urljoin() - Browsers: native
new URL()
Our Link Extractor automatically converts all relative URLs to absolute URLs for easy use.