How do I extract data from HTML tables?
Extracting HTML table data is straightforward with the right tools and techniques.
Using Python (pandas - easiest):
import pandas as pd
# Read all tables from HTML
tables = pd.read_html('page.html')
# Or from URL
tables = pd.read_html('https://example.com')
# Access specific table (0-indexed)
df = tables[0]
# Export to CSV or JSON
df.to_csv('output.csv', index=False)
df.to_json('output.json', orient='records')
Using Python (BeautifulSoup - more control):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table')
# Extract headers
headers = [th.text.strip() for th in table.find_all('th')]
# Extract rows
rows = []
for tr in table.find_all('tr')[1:]: # Skip header row
cells = [td.text.strip() for td in tr.find_all('td')]
rows.append(cells)
Using JavaScript (Cheerio):
const cheerio = require('cheerio');
const $ = cheerio.load(html);
const table = [];
$('table tr').each((i, row) => {
const rowData = [];
$(row).find('td, th').each((j, cell) => {
rowData.push($(cell).text().trim());
});
table.push(rowData);
});
Common challenges:
- Tables with rowspan/colspan (requires special handling)
- Nested tables (need to select specific table)
- Missing or inconsistent headers
- Extra whitespace or HTML entities
Best practices:
- Identify the target table by class, id, or position
- Normalize whitespace in cell content
- Handle missing cells gracefully
- Validate extracted data before export
When to use a table extractor tool:
- Quick one-off extraction without coding
- Testing selectors before writing code
- Previewing table structure
- Converting to CSV/JSON for non-programmers