How do I scrape HTML tables with rowspan and colspan?
Tables with rowspan and colspan require special handling to extract correctly.
The challenge:
Rowspan and colspan create cells that span multiple rows or columns:
<td rowspan="2">spans two rows<td colspan="3">spans three columns- These create "missing" cells in the structure
Python solution with pandas:
Pandas handles rowspan/colspan automatically:
import pandas as pd
# pandas handles complex tables automatically
tables = pd.read_html(html)
df = tables[0]
Manual handling with BeautifulSoup:
from bs4 import BeautifulSoup
def extract_table_with_spans(table):
# Create a 2D matrix to track occupied cells
matrix = []
for tr in table.find_all('tr'):
row_data = []
for cell in tr.find_all(['td', 'th']):
text = cell.text.strip()
rowspan = int(cell.get('rowspan', 1))
colspan = int(cell.get('colspan', 1))
# Add cell value repeated for colspan
for _ in range(colspan):
row_data.append(text)
# Track rowspan for future rows
if rowspan > 1:
# Mark cells below as occupied
pass # Implementation depends on tracking
matrix.append(row_data)
return matrix
Better approach - use lxml:
from lxml import html
import pandas as pd
tree = html.fromstring(html_content)
tables = tree.xpath('//table')
# Extract table using XPath
for table in tables:
rows = table.xpath('.//tr')
# Process rows...
Challenges with complex tables:
- Nested tables within cells
- Irregular structures (different column counts per row)
- Headers that span multiple levels
- Tables used for layout, not data
Best practices:
- Try pandas first - it handles most cases
- If pandas fails, inspect the table structure manually
- Consider if you really need all the data or just key cells
- For very complex tables, target specific cells with XPath/CSS selectors
- Test extraction against multiple pages to ensure consistency
When manual extraction is needed:
- Tables with complex nested structures
- Need to preserve exact visual layout
- Want to extract only specific cells
- Table structure varies significantly
Alternative approach:
For extremely complex tables, consider extracting the raw HTML and processing it visually or using the specific cells you need rather than the entire table structure.