Hi all,
I'm working on a process in Databricks to convert multiple Excel files to CSV format. These Excel files follow a similar structure but with some variations. Here's the situation:
Since I'm not using the Pandas read_excel() function, I extract the table content using BeautifulSoup from HTML-formatted .xls files, and then build a DataFrame using Pandas.
Here's a simplified version of my current code:
def extract_table_from_xml(file_path):
try:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
content = file.read()
soup = BeautifulSoup(content, 'html.parser')
tables = soup.find_all('table')
if not tables:
return None
table = tables[0]
# Extract rows
rows = []
for row in table.find_all('tr'):
row_data = []
for cell in row.find_all(['td', 'th']):
cell_text = cell.get_text(strip=True)
row_data.append(cell_text)
if row_data:
rows.append(row_data)
# First two rows are header rows
merged_header = rows[0]
sub_header = rows[1]
# Now need to generate a repeated merged header row
# based on number of subcolumns under each merged header
# ...
# After generating correct headers, continue building DataFrame
# data_rows = rows[2:] ...
```
My Requirement:
When saving the data to CSV, I want to repeat the merged (main) headings in the first row so that each subheading column has a clear parent. Since the columns are not merged in CSV, the first row must contain the main heading repeated across the subheadings it belongs to.
For example, if the Excel structure is like this:
Heading 1 | Heading 2 | Heading 3 |
Sub Heading1 | Sub Heading 2 | Sub Heading 3 | Sub Heading 4 | Sub Heading 5 |
| | | | |
Then the desired CSV header should look like:
Heading 1 | Heading 1 | Heading 2 | Heading 3 | Heading 3 |
Sub Heading 1 | Sub Heading 2 | Sub Heading 3 | Sub Heading 4 | Sub Heading 5 |
| | | | |
My Question:
- How can I programmatically generate this repeated first header row based on the actual structure in the Excel file?
- Considering that:
- The merged cells are read as single values with trailing empty strings.
- The number of subheadings can change per file.
I’d appreciate any guidance or best practices on how to design this logic in Python so I can use the result to build the correct column names before saving the file to CSV.