How do I extract data from HTML tables?

Extracting HTML table data is straightforward with the right tools and techniques.

Using Python (pandas - easiest):

import pandas as pd

# Read all tables from HTML
tables = pd.read_html('page.html')

# Or from URL
tables = pd.read_html('https://example.com')

# Access specific table (0-indexed)
df = tables[0]

# Export to CSV or JSON
df.to_csv('output.csv', index=False)
df.to_json('output.json', orient='records')

import pandas as pd

# Read all tables from HTML
tables = pd.read_html('page.html')

# Or from URL
tables = pd.read_html('https://example.com')

# Access specific table (0-indexed)
df = tables[0]

# Export to CSV or JSON
df.to_csv('output.csv', index=False)
df.to_json('output.json', orient='records')

Using Python (BeautifulSoup - more control):

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
table = soup.find('table')

# Extract headers
headers = [th.text.strip() for th in table.find_all('th')]

# Extract rows
rows = []
for tr in table.find_all('tr')[1:]:  # Skip header row
    cells = [td.text.strip() for td in tr.find_all('td')]
    rows.append(cells)

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
table = soup.find('table')

# Extract headers
headers = [th.text.strip() for th in table.find_all('th')]

# Extract rows
rows = []
for tr in table.find_all('tr')[1:]:  # Skip header row
    cells = [td.text.strip() for td in tr.find_all('td')]
    rows.append(cells)

Using JavaScript (Cheerio):

const cheerio = require('cheerio');
const $ = cheerio.load(html);

const table = [];
$('table tr').each((i, row) => {
  const rowData = [];
  $(row).find('td, th').each((j, cell) => {
    rowData.push($(cell).text().trim());
  });
  table.push(rowData);
});

const cheerio = require('cheerio');
const $ = cheerio.load(html);

const table = [];
$('table tr').each((i, row) => {
  const rowData = [];
  $(row).find('td, th').each((j, cell) => {
    rowData.push($(cell).text().trim());
  });
  table.push(rowData);
});

Common challenges:

Tables with rowspan/colspan (requires special handling)
Nested tables (need to select specific table)
Missing or inconsistent headers
Extra whitespace or HTML entities

Best practices:

Identify the target table by class, id, or position
Normalize whitespace in cell content
Handle missing cells gracefully
Validate extracted data before export

When to use a table extractor tool:

Quick one-off extraction without coding
Testing selectors before writing code
Previewing table structure
Converting to CSV/JSON for non-programmers

How do I extract data from HTML tables?

Related Questions