How do I handle relative URLs when extracting links?

Relative URLs are paths without a full domain, like /about, ../contact, or images/photo.jpg, and they must be converted to absolute URLs to be usable outside the source page context.

Understanding URL types:

  • Absolute URLs: Include protocol and domain (https://example.com/page)
  • Root-relative URLs: Start with / and reference from domain root (/abouthttps://example.com/about)
  • Relative URLs: Reference from current directory (contact from /products//products/contact)
  • Parent-relative URLs: Use .. to go up directories (../about from /products/items//products/about)

Conversion process:

  1. Identify the base URL - This is typically the current page's URL, but check for <base> tags in the HTML <head> which override the default base
  2. Parse both URLs - Extract components (protocol, domain, path) from both the base URL and relative URL
  3. Apply resolution rules:
    • If the relative URL starts with //, add the protocol
    • If it starts with /, replace the entire path
    • If it starts with .., navigate up directories
    • Otherwise, append to the current directory
  4. Normalize the result - Resolve . and .. segments and remove duplicate slashes

Common pitfalls:

  • Ignoring <base> tags that change the base URL
  • Incorrectly handling query parameters and fragments
  • Failing to decode URL-encoded characters
  • Not handling edge cases like empty hrefs or single #

Programming language support:

Most languages provide URL resolution libraries:

  • Node.js: new URL(relative, base)
  • Python: urllib.parse.urljoin()
  • Browsers: native new URL()

Our Link Extractor automatically converts all relative URLs to absolute URLs for easy use.

Related Questions