Databricks Community

MinuN · ‎07-03-2025

Hi all,

I'm working on a process in Databricks to convert multiple Excel files to CSV format. These Excel files follow a similar structure but with some variations. Here's the situation:

Each file contains two header rows:
- The first row contains merged cells representing main categories.
- The second row contains subheadings relevant to the main categories.
The number and names of subheadings under each main heading can vary between files.

Since I'm not using the Pandas read_excel() function, I extract the table content using BeautifulSoup from HTML-formatted .xls files, and then build a DataFrame using Pandas.

Here's a simplified version of my current code:

def extract_table_from_xml(file_path):
try:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
content = file.read()

soup = BeautifulSoup(content, 'html.parser')
tables = soup.find_all('table')

if not tables:
return None

table = tables[0]

# Extract rows
rows = []
for row in table.find_all('tr'):
row_data = []
for cell in row.find_all(['td', 'th']):
cell_text = cell.get_text(strip=True)
row_data.append(cell_text)
if row_data:
rows.append(row_data)

# First two rows are header rows
merged_header = rows[0]
sub_header = rows[1]

# Now need to generate a repeated merged header row
# based on number of subcolumns under each merged header
# ...

# After generating correct headers, continue building DataFrame
# data_rows = rows[2:] ...
```

My Requirement:

When saving the data to CSV, I want to repeat the merged (main) headings in the first row so that each subheading column has a clear parent. Since the columns are not merged in CSV, the first row must contain the main heading repeated across the subheadings it belongs to.

For example, if the Excel structure is like this:

Heading 1	Heading 2	Heading 3
Sub Heading1	Sub Heading 2	Sub Heading 3	Sub Heading 4	Sub Heading 5

Then the desired CSV header should look like:

Heading 1	Heading 1	Heading 2	Heading 3	Heading 3
Sub Heading 1	Sub Heading 2	Sub Heading 3	Sub Heading 4	Sub Heading 5

My Question:

- How can I programmatically generate this repeated first header row based on the actual structure in the Excel file?
- Considering that:
- The merged cells are read as single values with trailing empty strings.
- The number of subheadings can change per file.

I’d appreciate any guidance or best practices on how to design this logic in Python so I can use the result to build the correct column names before saving the file to CSV.

Brahmareddy · ‎07-13-2025

Hi MinuN,

How are you doing today?

That’s a great question, and you're definitely on the right path using BeautifulSoup to extract the table structure from .xls HTML-like files. To generate the repeated first row of main headings for the CSV, one practical approach is to loop through the merged_header row and count how many subheadings follow each non-empty cell. You can then repeat each non-empty heading for the number of subheadings under it. For example, as you iterate through merged_header, whenever you find a non-empty value (like "Heading 1"), you start counting how many consecutive empty strings follow it (which represent merged cells in Excel). Then, you repeat that heading name for each of those subheadings in the final list. Combine this with the sub_header row to create a two-row header (or a flattened list if needed). This way, when you create the DataFrame, you'll have a clear and consistent set of column names like ["Heading 1 - Sub Heading 1", ..., "Heading 3 - Sub Heading 5"]. This logic is flexible enough to handle variations across files. If needed, I can also help you script out that part to plug into your existing function.

Regards,

Brahma