How do I scrape HTML tables with rowspan and colspan?

Tables with rowspan and colspan require special handling to extract correctly.

The challenge:

Rowspan and colspan create cells that span multiple rows or columns:

<td rowspan="2"> spans two rows
<td colspan="3"> spans three columns
These create "missing" cells in the structure

Python solution with pandas:

Pandas handles rowspan/colspan automatically:

import pandas as pd

# pandas handles complex tables automatically
tables = pd.read_html(html)
df = tables[0]

import pandas as pd

# pandas handles complex tables automatically
tables = pd.read_html(html)
df = tables[0]

Manual handling with BeautifulSoup:

from bs4 import BeautifulSoup

def extract_table_with_spans(table):
    # Create a 2D matrix to track occupied cells
    matrix = []

    for tr in table.find_all('tr'):
        row_data = []

        for cell in tr.find_all(['td', 'th']):
            text = cell.text.strip()
            rowspan = int(cell.get('rowspan', 1))
            colspan = int(cell.get('colspan', 1))

            # Add cell value repeated for colspan
            for _ in range(colspan):
                row_data.append(text)

            # Track rowspan for future rows
            if rowspan > 1:
                # Mark cells below as occupied
                pass  # Implementation depends on tracking

        matrix.append(row_data)

    return matrix

from bs4 import BeautifulSoup

def extract_table_with_spans(table):
    # Create a 2D matrix to track occupied cells
    matrix = []

    for tr in table.find_all('tr'):
        row_data = []

        for cell in tr.find_all(['td', 'th']):
            text = cell.text.strip()
            rowspan = int(cell.get('rowspan', 1))
            colspan = int(cell.get('colspan', 1))

            # Add cell value repeated for colspan
            for _ in range(colspan):
                row_data.append(text)

            # Track rowspan for future rows
            if rowspan > 1:
                # Mark cells below as occupied
                pass  # Implementation depends on tracking

        matrix.append(row_data)

    return matrix

Better approach - use lxml:

from lxml import html
import pandas as pd

tree = html.fromstring(html_content)
tables = tree.xpath('//table')

# Extract table using XPath
for table in tables:
    rows = table.xpath('.//tr')
    # Process rows...

from lxml import html
import pandas as pd

tree = html.fromstring(html_content)
tables = tree.xpath('//table')

# Extract table using XPath
for table in tables:
    rows = table.xpath('.//tr')
    # Process rows...

Challenges with complex tables:

Nested tables within cells
Irregular structures (different column counts per row)
Headers that span multiple levels
Tables used for layout, not data

Best practices:

Try pandas first - it handles most cases
If pandas fails, inspect the table structure manually
Consider if you really need all the data or just key cells
For very complex tables, target specific cells with XPath/CSS selectors
Test extraction against multiple pages to ensure consistency

When manual extraction is needed:

Tables with complex nested structures
Need to preserve exact visual layout
Want to extract only specific cells
Table structure varies significantly

Alternative approach:

For extremely complex tables, consider extracting the raw HTML and processing it visually or using the specific cells you need rather than the entire table structure.

How do I scrape HTML tables with rowspan and colspan?

Related Questions