How do I handle relative URLs when extracting links?

Relative URLs are paths without a full domain, like /about, ../contact, or images/photo.jpg, and they must be converted to absolute URLs to be usable outside the source page context.

Understanding URL types:

Absolute URLs: Include protocol and domain (https://example.com/page)
Root-relative URLs: Start with / and reference from domain root (/about → https://example.com/about)
Relative URLs: Reference from current directory (contact from /products/ → /products/contact)
Parent-relative URLs: Use .. to go up directories (../about from /products/items/ → /products/about)

Conversion process:

Identify the base URL - This is typically the current page's URL, but check for <base> tags in the HTML <head> which override the default base
Parse both URLs - Extract components (protocol, domain, path) from both the base URL and relative URL
Apply resolution rules:
- If the relative URL starts with //, add the protocol
- If it starts with /, replace the entire path
- If it starts with .., navigate up directories
- Otherwise, append to the current directory
Normalize the result - Resolve . and .. segments and remove duplicate slashes

Common pitfalls:

Ignoring <base> tags that change the base URL
Incorrectly handling query parameters and fragments
Failing to decode URL-encoded characters
Not handling edge cases like empty hrefs or single #

Programming language support:

Most languages provide URL resolution libraries:

Node.js: new URL(relative, base)
Python: urllib.parse.urljoin()
Browsers: native new URL()

Our Link Extractor automatically converts all relative URLs to absolute URLs for easy use.

How do I handle relative URLs when extracting links?

Related Questions