<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Handling Merged Heading Rows When Converting Excel to CSV in Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/handling-merged-heading-rows-when-converting-excel-to-csv-in/m-p/123960#M47094</link>
    <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;I'm working on a process in Databricks to convert multiple Excel files to CSV format. These Excel files follow a similar structure but with some variations. Here's the situation:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Each file contains &lt;STRONG&gt;two header rows&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;The &lt;STRONG&gt;first row&lt;/STRONG&gt; contains merged cells representing &lt;STRONG&gt;main categories&lt;/STRONG&gt;.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;The &lt;STRONG&gt;second row&lt;/STRONG&gt; contains &lt;STRONG&gt;subheadings&lt;/STRONG&gt; relevant to the main categories.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;The number and names of subheadings under each main heading can vary between files.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Since I'm not using the Pandas read_excel() function, I extract the table content using &lt;STRONG&gt;BeautifulSoup&lt;/STRONG&gt; from HTML-formatted .xls files, and then build a DataFrame using Pandas.&lt;/P&gt;&lt;P&gt;Here's a simplified version of my current code:&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;def extract_table_from_xml(file_path):&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;try:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;content = file.read()&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;soup = BeautifulSoup(content, 'html.parser')&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;tables = soup.find_all('table')&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;if not tables:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;return None&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;table = tables[0]&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;# Extract rows&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;rows = []&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;for row in table.find_all('tr'):&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;row_data = []&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;for cell in row.find_all(['td', 'th']):&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;cell_text = cell.get_text(strip=True)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;row_data.append(cell_text)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;if row_data:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;rows.append(row_data)&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;# First two rows are header rows&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;merged_header = rows[0]&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;sub_header = rows[1]&lt;/FONT&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;# Now need to generate a repeated merged header row&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;# based on number of subcolumns under each merged header&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;# ...&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;# After generating correct headers, continue building DataFrame&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;# data_rows = rows[2:] ...&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;```&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;My Requirement:&lt;/P&gt;&lt;P&gt;When saving the data to CSV, I want to &lt;STRONG&gt;repeat the merged (main) headings&lt;/STRONG&gt; in the first row so that each subheading column has a clear parent. Since the columns are not merged in CSV, the first row must contain the &lt;STRONG&gt;main heading repeated across the subheadings it belongs to&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;For example, if the Excel structure is like this:&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;Heading 1&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;Heading 2&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;Heading 3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;Sub Heading1&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;Sub Heading 2&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;Sub Heading 3&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;Sub Heading 4&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;Sub Heading&amp;nbsp;5&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;Then the desired CSV header should look like:&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="20%"&gt;Heading 1&lt;/TD&gt;&lt;TD width="20%"&gt;Heading 1&lt;/TD&gt;&lt;TD width="20%"&gt;Heading 2&lt;/TD&gt;&lt;TD width="20%"&gt;Heading 3&lt;/TD&gt;&lt;TD width="20%"&gt;Heading 3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%"&gt;Sub Heading 1&lt;/TD&gt;&lt;TD width="20%"&gt;Sub Heading 2&lt;/TD&gt;&lt;TD width="20%"&gt;Sub Heading 3&lt;/TD&gt;&lt;TD width="20%"&gt;Sub Heading 4&lt;/TD&gt;&lt;TD width="20%"&gt;Sub Heading 5&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;My Question:&lt;/P&gt;&lt;P&gt;- How can I programmatically generate this&lt;STRONG&gt; repeated first header row&lt;/STRONG&gt; based on the actual structure in the Excel file?&lt;BR /&gt;- Considering that:&lt;BR /&gt;- The merged cells are read as single values with trailing empty strings.&lt;BR /&gt;- The number of subheadings can change per file.&lt;/P&gt;&lt;P&gt;I’d appreciate any guidance or best practices on how to design this logic in Python so I can use the result to build the correct column names before saving the file to CSV.&lt;/P&gt;</description>
    <pubDate>Fri, 04 Jul 2025 03:58:49 GMT</pubDate>
    <dc:creator>MinuN</dc:creator>
    <dc:date>2025-07-04T03:58:49Z</dc:date>
    <item>
      <title>Handling Merged Heading Rows When Converting Excel to CSV in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/handling-merged-heading-rows-when-converting-excel-to-csv-in/m-p/123960#M47094</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;I'm working on a process in Databricks to convert multiple Excel files to CSV format. These Excel files follow a similar structure but with some variations. Here's the situation:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Each file contains &lt;STRONG&gt;two header rows&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;The &lt;STRONG&gt;first row&lt;/STRONG&gt; contains merged cells representing &lt;STRONG&gt;main categories&lt;/STRONG&gt;.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;The &lt;STRONG&gt;second row&lt;/STRONG&gt; contains &lt;STRONG&gt;subheadings&lt;/STRONG&gt; relevant to the main categories.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;The number and names of subheadings under each main heading can vary between files.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Since I'm not using the Pandas read_excel() function, I extract the table content using &lt;STRONG&gt;BeautifulSoup&lt;/STRONG&gt; from HTML-formatted .xls files, and then build a DataFrame using Pandas.&lt;/P&gt;&lt;P&gt;Here's a simplified version of my current code:&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;def extract_table_from_xml(file_path):&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;try:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;content = file.read()&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;soup = BeautifulSoup(content, 'html.parser')&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;tables = soup.find_all('table')&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;if not tables:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;return None&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;table = tables[0]&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;# Extract rows&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;rows = []&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;for row in table.find_all('tr'):&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;row_data = []&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;for cell in row.find_all(['td', 'th']):&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;cell_text = cell.get_text(strip=True)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;row_data.append(cell_text)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;if row_data:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;rows.append(row_data)&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;# First two rows are header rows&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;merged_header = rows[0]&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;sub_header = rows[1]&lt;/FONT&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;# Now need to generate a repeated merged header row&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;# based on number of subcolumns under each merged header&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;# ...&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#333399"&gt;# After generating correct headers, continue building DataFrame&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;# data_rows = rows[2:] ...&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT color="#333399"&gt;```&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;My Requirement:&lt;/P&gt;&lt;P&gt;When saving the data to CSV, I want to &lt;STRONG&gt;repeat the merged (main) headings&lt;/STRONG&gt; in the first row so that each subheading column has a clear parent. Since the columns are not merged in CSV, the first row must contain the &lt;STRONG&gt;main heading repeated across the subheadings it belongs to&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;For example, if the Excel structure is like this:&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;Heading 1&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;Heading 2&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;Heading 3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;Sub Heading1&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;Sub Heading 2&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;Sub Heading 3&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;Sub Heading 4&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;Sub Heading&amp;nbsp;5&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%" height="30px"&gt;&amp;nbsp;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;Then the desired CSV header should look like:&lt;/P&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="20%"&gt;Heading 1&lt;/TD&gt;&lt;TD width="20%"&gt;Heading 1&lt;/TD&gt;&lt;TD width="20%"&gt;Heading 2&lt;/TD&gt;&lt;TD width="20%"&gt;Heading 3&lt;/TD&gt;&lt;TD width="20%"&gt;Heading 3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%"&gt;Sub Heading 1&lt;/TD&gt;&lt;TD width="20%"&gt;Sub Heading 2&lt;/TD&gt;&lt;TD width="20%"&gt;Sub Heading 3&lt;/TD&gt;&lt;TD width="20%"&gt;Sub Heading 4&lt;/TD&gt;&lt;TD width="20%"&gt;Sub Heading 5&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="20%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;My Question:&lt;/P&gt;&lt;P&gt;- How can I programmatically generate this&lt;STRONG&gt; repeated first header row&lt;/STRONG&gt; based on the actual structure in the Excel file?&lt;BR /&gt;- Considering that:&lt;BR /&gt;- The merged cells are read as single values with trailing empty strings.&lt;BR /&gt;- The number of subheadings can change per file.&lt;/P&gt;&lt;P&gt;I’d appreciate any guidance or best practices on how to design this logic in Python so I can use the result to build the correct column names before saving the file to CSV.&lt;/P&gt;</description>
      <pubDate>Fri, 04 Jul 2025 03:58:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handling-merged-heading-rows-when-converting-excel-to-csv-in/m-p/123960#M47094</guid>
      <dc:creator>MinuN</dc:creator>
      <dc:date>2025-07-04T03:58:49Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Merged Heading Rows When Converting Excel to CSV in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/handling-merged-heading-rows-when-converting-excel-to-csv-in/m-p/125065#M47331</link>
      <description>&lt;P&gt;Hi&amp;nbsp;MinuN,&lt;/P&gt;&lt;P&gt;How are you doing today?&lt;/P&gt;&lt;P&gt;That’s a great question, and you're definitely on the right path using BeautifulSoup to extract the table structure from .xls HTML-like files. To generate the repeated first row of main headings for the CSV, one practical approach is to loop through the merged_header row and count how many subheadings follow each non-empty cell. You can then repeat each non-empty heading for the number of subheadings under it. For example, as you iterate through merged_header, whenever you find a non-empty value (like "Heading 1"), you start counting how many consecutive empty strings follow it (which represent merged cells in Excel). Then, you repeat that heading name for each of those subheadings in the final list. Combine this with the sub_header row to create a two-row header (or a flattened list if needed). This way, when you create the DataFrame, you'll have a clear and consistent set of column names like ["Heading 1 - Sub Heading 1", ..., "Heading 3 - Sub Heading 5"]. This logic is flexible enough to handle variations across files. If needed, I can also help you script out that part to plug into your existing function.&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Brahma&lt;/P&gt;</description>
      <pubDate>Sun, 13 Jul 2025 18:07:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handling-merged-heading-rows-when-converting-excel-to-csv-in/m-p/125065#M47331</guid>
      <dc:creator>Brahmareddy</dc:creator>
      <dc:date>2025-07-13T18:07:39Z</dc:date>
    </item>
  </channel>
</rss>

