<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: (Episode 2: Reading Excel Files) - Learning Databricks one brick at a time, using the Free Editi in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/episode-2-reading-excel-files-learning-databricks-one-brick-at-a/m-p/135656#M740</link>
    <description>&lt;P&gt;Thanks for this,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146924"&gt;@BS_THE_ANALYST&lt;/a&gt;.&amp;nbsp; Hugely beneficial.&lt;/P&gt;</description>
    <pubDate>Wed, 22 Oct 2025 09:43:56 GMT</pubDate>
    <dc:creator>SHIFTY</dc:creator>
    <dc:date>2025-10-22T09:43:56Z</dc:date>
    <item>
      <title>(Episode 2: Reading Excel Files) - Learning Databricks one brick at a time, using the Free Edition</title>
      <link>https://community.databricks.com/t5/community-articles/episode-2-reading-excel-files-learning-databricks-one-brick-at-a/m-p/133192#M698</link>
      <description>&lt;P&gt;&lt;FONT size="7"&gt;&lt;STRONG&gt;&lt;U&gt;Episode 2: Reading Excel Files&lt;BR /&gt;&lt;/U&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;EM&gt;Learning Databricks one brick at a time, using the Free Edition.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;You can download the accompanying Notebook and Excel files used in the demonstration over on my GitHub:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;U&gt;Excel Files&amp;nbsp;&lt;/U&gt; &amp;amp; &lt;U&gt;Notebook&lt;/U&gt;:&amp;nbsp;&lt;BR /&gt;&lt;A href="https://github.com/BSanalyst/Databricks-Community/tree/b2ab4de941395029192db9df6aa3d80fb893ec60/Community%20Blogs/Episode%202%20-%20Reading%20Excel%20Files" target="_blank" rel="noopener"&gt;https://github.com/BSanalyst/Databricks-Community/tree/b2ab4de941395029192db9df6aa3d80fb893ec60/Community%20Blogs/Episode%202%20-%20Reading%20Excel%20Files&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;FONT size="6"&gt;&lt;STRONG&gt;Project Intro&lt;BR /&gt;&lt;/STRONG&gt;&lt;/FONT&gt;Welcome to everyone reading. My name’s Ben, a.k.a&amp;nbsp;&lt;STRONG&gt;BS_THE_ANALYST&lt;/STRONG&gt;, and I’m going to share my experiences as I learn the world of Databricks. My objective is to master Data Engineering on Databricks, weave in AI &amp;amp; ML, and perform analyses.&lt;/P&gt;&lt;P&gt;&lt;FONT size="6"&gt;&lt;STRONG&gt;Today’s Challenge&lt;BR /&gt;&lt;/STRONG&gt;&lt;/FONT&gt;Picture this: &lt;U&gt;I'm facing the following challenges&lt;/U&gt; when reading excel files into Databricks&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Ingesting Multiple Excel Files&lt;/LI&gt;&lt;LI&gt;Ingesting Multiple Sheets on an Excel File&lt;/LI&gt;&lt;LI&gt;Ingesting Multiple Files with Multiple Sheets. &lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Let's face it, we'll always need a solution to hand to deal with these pesky Excel reports that get generated!&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="BS_THE_ANALYST_0-1759097401864.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20275i9B5E3D1AF4F710C5/image-size/medium?v=v2&amp;amp;px=400" role="button" title="BS_THE_ANALYST_0-1759097401864.png" alt="BS_THE_ANALYST_0-1759097401864.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="6"&gt;&lt;STRONG&gt;Solution&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR /&gt;For this solution, we'll leverage Python and the Pandas library.&lt;/P&gt;&lt;P&gt;Excel files have a limit of roughly 1 million rows (per sheet), so pandas will suffice with its single machine in-memory method of working. However, pandas can also scale to work in a distributed environment if you leverage the Pandas API on Spark. Read more about that here:&amp;nbsp;&lt;A href="https://spark.apache.org/pandas-on-spark/" target="_blank" rel="noopener"&gt;https://spark.apache.org/pandas-on-spark/&lt;/A&gt;&amp;nbsp;- the bench marks are interesting &lt;span class="lia-unicode-emoji" title=":smiling_face_with_smiling_eyes:"&gt;😊&lt;/span&gt;.&lt;/P&gt;&lt;P&gt;&lt;FONT size="6"&gt;&lt;STRONG&gt;&lt;U&gt;Notebook setup&lt;BR /&gt;&lt;/U&gt;&lt;/STRONG&gt;&lt;/FONT&gt;In the picture below you'll spot:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;I need to Install and Import libraries into my notebook&lt;/LI&gt;&lt;/UL&gt;&lt;LI-CODE lang="python"&gt;pip install openpyxl 
import pandas as pd 
import openpyxl&lt;/LI-CODE&gt;&lt;UL&gt;&lt;LI&gt;In the Notebook I reference a&amp;nbsp;&lt;STRONG&gt;Volume&lt;/STRONG&gt; in Unity Catalog where my Excel Files have been uploaded to. Feel free to change this for your needs.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="BS_THE_ANALYST_1-1759097401869.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20274iE9F35EF719125556/image-size/large?v=v2&amp;amp;px=999" role="button" title="BS_THE_ANALYST_1-1759097401869.png" alt="BS_THE_ANALYST_1-1759097401869.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT size="6"&gt;&lt;STRONG&gt;&lt;U&gt;Ingesting A Single Excel File with Multiple&lt;/U&gt;&lt;/STRONG&gt;&lt;U&gt; &lt;STRONG&gt;Sheets&lt;/STRONG&gt;&lt;BR /&gt;&lt;/U&gt;&lt;/FONT&gt;Now comes the fun part. In the picture below, you'll see three numbers: 1,2,3. Only two of them contribute to reading ALL sheets from an Excel file into a single Dataframe. &lt;EM&gt;Pandas is amazing for this&lt;/EM&gt;.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Read Excel file into a dataframe 
dfs = pd.read_excel("/Volumes/workspace/databricks_episodes/data_dump/multiple_excel_sheets_load/all regions sales.xlsx", sheet_name=None)&lt;/LI-CODE&gt;&lt;P&gt;(1) The important parameter used in the&amp;nbsp;&lt;STRONG&gt;.read_excel&lt;/STRONG&gt; method is&lt;STRONG&gt; sheet_name=None&lt;/STRONG&gt;. If you supply this, it'll read all sheets in. In a similar fashion, you could supply one or more sheets if you don't want to read them all in. Just supply a list ["Sheet1_here","Sheet2_here"]. If you leave this parameter out, I think it may just read the first sheet in. Worth checking out the docs and finding out for yourselves - there's your homework! &lt;span class="lia-unicode-emoji" title=":smirking_face:"&gt;😏&lt;/span&gt;.&lt;BR /&gt;&lt;BR /&gt;(2) You can see "&lt;STRONG&gt;dfs.keys()&lt;/STRONG&gt;". This shows all the sheets that have been read in. I always like to check &lt;span class="lia-unicode-emoji" title=":nerd_face:"&gt;🤓&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;(3) To stack all the sheets on top of one another, it's as simple as wrapping&amp;nbsp;&lt;STRONG&gt;dfs.values()&lt;/STRONG&gt; i.e all the different sheet dataframes, in a&amp;nbsp;&lt;STRONG&gt;pd.concat()&lt;/STRONG&gt; method which allows us to stack Dataframes vertically (or horizontally)&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="BS_THE_ANALYST_2-1759097401874.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20273iAA5F207254F8DAFF/image-size/large?v=v2&amp;amp;px=999" role="button" title="BS_THE_ANALYST_2-1759097401874.png" alt="BS_THE_ANALYST_2-1759097401874.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT size="6"&gt;&lt;STRONG&gt;&lt;U&gt;Ingesting Multiple Excel Files *with Multiple&lt;/U&gt;&lt;/STRONG&gt;&lt;U&gt; &lt;STRONG&gt;Sheets*&lt;/STRONG&gt;&lt;BR /&gt;&lt;/U&gt;&lt;/FONT&gt;Ooft, the code in the picture below is very concise. I originally started by creating a For Loop to build on the previous step. However, I thought I'd add a little flair for this section. List Comprehension is a thing of beauty, especially for the Pythonistas out there &lt;span class="lia-unicode-emoji" title=":snake:"&gt;🐍&lt;/span&gt;. The code below will read in multiple files and even all the sheets on these files! It's very compact. Adjust it as you see fit &lt;span class="lia-unicode-emoji" title=":smiling_face_with_smiling_eyes:"&gt;😊&lt;/span&gt;.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="BS_THE_ANALYST_3-1759097401882.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20276i7EDEC15929344460/image-size/large?v=v2&amp;amp;px=999" role="button" title="BS_THE_ANALYST_3-1759097401882.png" alt="BS_THE_ANALYST_3-1759097401882.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I encourage you to get your hands dirty and step through the loops if you don't feel you can understand it. It's a really fun and rewarding exercise &lt;span class="lia-unicode-emoji" title=":raising_hands:"&gt;🙌&lt;/span&gt;. Below, I'll outline what the logic achieves:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;directory = dbutils.fs.ls(...) → stores a &lt;STRONG&gt;list of file objects&lt;/STRONG&gt; from the specified Databricks folder in the variable directory. This is where the Excel files live.&lt;/LI&gt;&lt;LI&gt;The list comprehension loops over:&lt;/LI&gt;&lt;UL&gt;&lt;LI&gt;Each file in directory.&lt;/LI&gt;&lt;LI&gt;Each (sheet, data) pair in file (all sheets of that Excel file).&lt;/LI&gt;&lt;LI&gt;Reads the Excel file using pd.read_excel, stripping the "dbfs:" prefix to make it a local path for pandas. Note, if you don't strip out "dbfs:"&amp;nbsp;then you'll encounter an error which prompts you to remove it &lt;span class="lia-unicode-emoji" title=":grinning_squinting_face:"&gt;😆&lt;/span&gt;.&lt;/LI&gt;&lt;/UL&gt;&lt;LI&gt;Collects all the individual sheet DataFrames into a &lt;STRONG&gt;list&lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;pd.concat([...]) → stacks all these DataFrames into &lt;STRONG&gt;one combined DataFrame&lt;/STRONG&gt;.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;FONT size="6"&gt;&lt;STRONG&gt;&lt;U&gt;Till next time&lt;BR /&gt;&lt;/U&gt;&lt;/STRONG&gt;&lt;/FONT&gt;That’s all for now, folks! I can't wait for our next episode. We're currently wondering through ingestion ... but where do we stop? Should we delve into Databases? Should we hit APIs? Perhaps we'll bounce around a little! As long as we're having fun, we're moving forward, right? &lt;span class="lia-unicode-emoji" title=":raising_hands:"&gt;🙌&lt;/span&gt;.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Sun, 28 Sep 2025 22:27:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/episode-2-reading-excel-files-learning-databricks-one-brick-at-a/m-p/133192#M698</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-09-28T22:27:08Z</dc:date>
    </item>
    <item>
      <title>Re: (Episode 2: Reading Excel Files) - Learning Databricks one brick at a time, using the Free Editi</title>
      <link>https://community.databricks.com/t5/community-articles/episode-2-reading-excel-files-learning-databricks-one-brick-at-a/m-p/133198#M699</link>
      <description>&lt;P&gt;Well written&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146924"&gt;@BS_THE_ANALYST&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;As we know, spark is general purpose in memory compute engine, it will help in distributed processing for distributed storages.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Like if the default partition size is 128 MB, 1 GB file will be split into 8-9 blocks with 1 block size of 128MB, this will help us process large amount of data like above.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 29 Sep 2025 00:11:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/episode-2-reading-excel-files-learning-databricks-one-brick-at-a/m-p/133198#M699</guid>
      <dc:creator>Khaja_Zaffer</dc:creator>
      <dc:date>2025-09-29T00:11:23Z</dc:date>
    </item>
    <item>
      <title>Re: (Episode 2: Reading Excel Files) - Learning Databricks one brick at a time, using the Free Editi</title>
      <link>https://community.databricks.com/t5/community-articles/episode-2-reading-excel-files-learning-databricks-one-brick-at-a/m-p/133261#M700</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146924"&gt;@BS_THE_ANALYST&lt;/a&gt;&amp;nbsp;that final snippet of code looks very clean! I saw that "sheet name = None" part and was a bit confused why you'd written that as I assumed that was just the default. Turns out the default is "sheet name = 0", which is simply the first sheet. Definitely learnt a lot from this post, thank you!&lt;/P&gt;</description>
      <pubDate>Mon, 29 Sep 2025 13:32:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/episode-2-reading-excel-files-learning-databricks-one-brick-at-a/m-p/133261#M700</guid>
      <dc:creator>Pilsner</dc:creator>
      <dc:date>2025-09-29T13:32:21Z</dc:date>
    </item>
    <item>
      <title>Re: (Episode 2: Reading Excel Files) - Learning Databricks one brick at a time, using the Free Editi</title>
      <link>https://community.databricks.com/t5/community-articles/episode-2-reading-excel-files-learning-databricks-one-brick-at-a/m-p/135656#M740</link>
      <description>&lt;P&gt;Thanks for this,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146924"&gt;@BS_THE_ANALYST&lt;/a&gt;.&amp;nbsp; Hugely beneficial.&lt;/P&gt;</description>
      <pubDate>Wed, 22 Oct 2025 09:43:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/episode-2-reading-excel-files-learning-databricks-one-brick-at-a/m-p/135656#M740</guid>
      <dc:creator>SHIFTY</dc:creator>
      <dc:date>2025-10-22T09:43:56Z</dc:date>
    </item>
  </channel>
</rss>

