How do I extract JavaScript-generated links?

JavaScript-generated links are created dynamically by client-side code after the initial HTML loads, making them invisible to simple HTML parsers that only see static content.

Modern single-page applications (SPAs) built with React, Vue, Angular, or vanilla JavaScript often render most or all links via JavaScript.

Headless browser approach (most reliable):

Use tools like Puppeteer, Playwright, or Selenium that run a real browser engine and execute JavaScript:

  1. Navigate to the page with page.goto()
  2. Wait for content to load with page.waitForSelector() or page.waitForLoadState()
  3. Optionally interact with the page (scroll, click buttons) to trigger more links
  4. Extract links using page.$$eval('a', links => links.map(a => a.href))$$

This approach works for all JavaScript patterns but is slower and more resource-intensive.

Static analysis approach (faster but limited):

  • Examine the JavaScript code to understand how links are generated
  • Look for URL patterns in JavaScript files or inline <script> tags
  • Find data attributes that might contain URLs (like data-url, data-href)
  • Check for frameworks' routing configurations (React Router, Vue Router)
  • Extract hardcoded URL strings

This requires understanding the site's structure but is much faster.

Hybrid approach:

  • Load the page with a headless browser
  • Wait for initial rendering
  • Inject JavaScript to extract links
  • Trigger any pagination or "load more" buttons

Common patterns:

  • Single-page apps with client-side routing (watch for pushState or hash changes)
  • Infinite scrolling (need to scroll to trigger loading)
  • Lazy-loaded content (wait for intersection observers)
  • Click-to-reveal links (simulate clicks on dropdown menus or expandable sections)

Our Link Extractor works on static HTML; for JavaScript-heavy sites, we recommend using headless browsers or browser DevTools to capture the rendered DOM.

Related Questions