How do I extract data from HTML tables?

Extracting HTML table data is straightforward with the right tools and techniques.

Using Python (pandas - easiest):

import pandas as pd

# Read all tables from HTML
tables = pd.read_html('page.html')

# Or from URL
tables = pd.read_html('https://example.com')

# Access specific table (0-indexed)
df = tables[0]

# Export to CSV or JSON
df.to_csv('output.csv', index=False)
df.to_json('output.json', orient='records')

Using Python (BeautifulSoup - more control):

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
table = soup.find('table')

# Extract headers
headers = [th.text.strip() for th in table.find_all('th')]

# Extract rows
rows = []
for tr in table.find_all('tr')[1:]:  # Skip header row
    cells = [td.text.strip() for td in tr.find_all('td')]
    rows.append(cells)

Using JavaScript (Cheerio):

const cheerio = require('cheerio');
const $ = cheerio.load(html);

const table = [];
$('table tr').each((i, row) => {
  const rowData = [];
  $(row).find('td, th').each((j, cell) => {
    rowData.push($(cell).text().trim());
  });
  table.push(rowData);
});

Common challenges:

  • Tables with rowspan/colspan (requires special handling)
  • Nested tables (need to select specific table)
  • Missing or inconsistent headers
  • Extra whitespace or HTML entities

Best practices:

  • Identify the target table by class, id, or position
  • Normalize whitespace in cell content
  • Handle missing cells gracefully
  • Validate extracted data before export

When to use a table extractor tool:

  • Quick one-off extraction without coding
  • Testing selectors before writing code
  • Previewing table structure
  • Converting to CSV/JSON for non-programmers

Related Questions