cancel
Showing results for 
Search instead for 
Did you mean: 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

(Episode 2: Reading Excel Files) - Learning Databricks one brick at a time, using the Free Edition

BS_THE_ANALYST
Esteemed Contributor III

Episode 2: Reading Excel Files
Learning Databricks one brick at a time, using the Free Edition.

You can download the accompanying Notebook and Excel files used in the demonstration over on my GitHub:

Project Intro
Welcome to everyone reading. My name’s Ben, a.k.a BS_THE_ANALYST, and I’m going to share my experiences as I learn the world of Databricks. My objective is to master Data Engineering on Databricks, weave in AI & ML, and perform analyses.

Today’s Challenge
Picture this: I'm facing the following challenges when reading excel files into Databricks

  1. Ingesting Multiple Excel Files
  2. Ingesting Multiple Sheets on an Excel File
  3. Ingesting Multiple Files with Multiple Sheets. 

Let's face it, we'll always need a solution to hand to deal with these pesky Excel reports that get generated!

BS_THE_ANALYST_0-1759097401864.png

Solution
For this solution, we'll leverage Python and the Pandas library.

Excel files have a limit of roughly 1 million rows (per sheet), so pandas will suffice with its single machine in-memory method of working. However, pandas can also scale to work in a distributed environment if you leverage the Pandas API on Spark. Read more about that here: https://spark.apache.org/pandas-on-spark/ - the bench marks are interesting 😊.

Notebook setup
In the picture below you'll spot:

  • I need to Install and Import libraries into my notebook
pip install openpyxl 
import pandas as pd 
import openpyxl
  • In the Notebook I reference a Volume in Unity Catalog where my Excel Files have been uploaded to. Feel free to change this for your needs.

BS_THE_ANALYST_1-1759097401869.png

 

Ingesting A Single Excel File with Multiple Sheets
Now comes the fun part. In the picture below, you'll see three numbers: 1,2,3. Only two of them contribute to reading ALL sheets from an Excel file into a single Dataframe. Pandas is amazing for this

# Read Excel file into a dataframe 
dfs = pd.read_excel("/Volumes/workspace/databricks_episodes/data_dump/multiple_excel_sheets_load/all regions sales.xlsx", sheet_name=None)

(1) The important parameter used in the .read_excel method is sheet_name=None. If you supply this, it'll read all sheets in. In a similar fashion, you could supply one or more sheets if you don't want to read them all in. Just supply a list ["Sheet1_here","Sheet2_here"]. If you leave this parameter out, I think it may just read the first sheet in. Worth checking out the docs and finding out for yourselves - there's your homework! 😏.

(2) You can see "dfs.keys()". This shows all the sheets that have been read in. I always like to check 🤓

(3) To stack all the sheets on top of one another, it's as simple as wrapping dfs.values() i.e all the different sheet dataframes, in a pd.concat() method which allows us to stack Dataframes vertically (or horizontally)

BS_THE_ANALYST_2-1759097401874.png

 

Ingesting Multiple Excel Files *with Multiple Sheets*
Ooft, the code in the picture below is very concise. I originally started by creating a For Loop to build on the previous step. However, I thought I'd add a little flair for this section. List Comprehension is a thing of beauty, especially for the Pythonistas out there 🐍. The code below will read in multiple files and even all the sheets on these files! It's very compact. Adjust it as you see fit 😊.

BS_THE_ANALYST_3-1759097401882.png

I encourage you to get your hands dirty and step through the loops if you don't feel you can understand it. It's a really fun and rewarding exercise 🙌. Below, I'll outline what the logic achieves:

  1. directory = dbutils.fs.ls(...) → stores a list of file objects from the specified Databricks folder in the variable directory. This is where the Excel files live.
  2. The list comprehension loops over:
    • Each file in directory.
    • Each (sheet, data) pair in file (all sheets of that Excel file).
    • Reads the Excel file using pd.read_excel, stripping the "dbfs:" prefix to make it a local path for pandas. Note, if you don't strip out "dbfs:" then you'll encounter an error which prompts you to remove it 😆.
  3. Collects all the individual sheet DataFrames into a list.
  4. pd.concat([...]) → stacks all these DataFrames into one combined DataFrame.

Till next time
That’s all for now, folks! I can't wait for our next episode. We're currently wondering through ingestion ... but where do we stop? Should we delve into Databases? Should we hit APIs? Perhaps we'll bounce around a little! As long as we're having fun, we're moving forward, right? 🙌

All the best,
BS

3 REPLIES 3

Khaja_Zaffer
Contributor III

Well written @BS_THE_ANALYST 

As we know, spark is general purpose in memory compute engine, it will help in distributed processing for distributed storages. 

Like if the default partition size is 128 MB, 1 GB file will be split into 8-9 blocks with 1 block size of 128MB, this will help us process large amount of data like above. 

 

Pilsner
Contributor III

@BS_THE_ANALYST that final snippet of code looks very clean! I saw that "sheet name = None" part and was a bit confused why you'd written that as I assumed that was just the default. Turns out the default is "sheet name = 0", which is simply the first sheet. Definitely learnt a lot from this post, thank you!

SHIFTY
Contributor II

Thanks for this, @BS_THE_ANALYST.  Hugely beneficial.