How do I scrape HTML tables with rowspan and colspan?

Tables with rowspan and colspan require special handling to extract correctly.

The challenge:

Rowspan and colspan create cells that span multiple rows or columns:

  • <td rowspan="2"> spans two rows
  • <td colspan="3"> spans three columns
  • These create "missing" cells in the structure

Python solution with pandas:

Pandas handles rowspan/colspan automatically:

import pandas as pd

# pandas handles complex tables automatically
tables = pd.read_html(html)
df = tables[0]

Manual handling with BeautifulSoup:

from bs4 import BeautifulSoup

def extract_table_with_spans(table):
    # Create a 2D matrix to track occupied cells
    matrix = []

    for tr in table.find_all('tr'):
        row_data = []

        for cell in tr.find_all(['td', 'th']):
            text = cell.text.strip()
            rowspan = int(cell.get('rowspan', 1))
            colspan = int(cell.get('colspan', 1))

            # Add cell value repeated for colspan
            for _ in range(colspan):
                row_data.append(text)

            # Track rowspan for future rows
            if rowspan > 1:
                # Mark cells below as occupied
                pass  # Implementation depends on tracking

        matrix.append(row_data)

    return matrix

Better approach - use lxml:

from lxml import html
import pandas as pd

tree = html.fromstring(html_content)
tables = tree.xpath('//table')

# Extract table using XPath
for table in tables:
    rows = table.xpath('.//tr')
    # Process rows...

Challenges with complex tables:

  • Nested tables within cells
  • Irregular structures (different column counts per row)
  • Headers that span multiple levels
  • Tables used for layout, not data

Best practices:

  1. Try pandas first - it handles most cases
  2. If pandas fails, inspect the table structure manually
  3. Consider if you really need all the data or just key cells
  4. For very complex tables, target specific cells with XPath/CSS selectors
  5. Test extraction against multiple pages to ensure consistency

When manual extraction is needed:

  • Tables with complex nested structures
  • Need to preserve exact visual layout
  • Want to extract only specific cells
  • Table structure varies significantly

Alternative approach:

For extremely complex tables, consider extracting the raw HTML and processing it visually or using the specific cells you need rather than the entire table structure.

Related Questions