Episode 2: Reading Excel Files
Learning Databricks one brick at a time, using the Free Edition.
You can download the accompanying Notebook and Excel files used in the demonstration over on my GitHub:
Project Intro
Welcome to everyone reading. My nameโs Ben, a.k.a BS_THE_ANALYST, and Iโm going to share my experiences as I learn the world of Databricks. My objective is to master Data Engineering on Databricks, weave in AI & ML, and perform analyses.
Todayโs Challenge
Picture this: I'm facing the following challenges when reading excel files into Databricks
- Ingesting Multiple Excel Files
- Ingesting Multiple Sheets on an Excel File
- Ingesting Multiple Files with Multiple Sheets.โ
Let's face it, we'll always need a solution to hand to deal with these pesky Excel reports that get generated!

Solution
For this solution, we'll leverage Python and the Pandas library.
Excel files have a limit of roughly 1 million rows (per sheet), so pandas will suffice with its single machine in-memory method of working. However, pandas can also scale to work in a distributed environment if you leverage the Pandas API on Spark. Read more about that here: https://spark.apache.org/pandas-on-spark/ - the bench marks are interesting ๐.
Notebook setup
In the picture below you'll spot:
- I need to Install and Import libraries into my notebook
pip install openpyxl
import pandas as pd
import openpyxl
- In the Notebook I reference a Volume in Unity Catalog where my Excel Files have been uploaded to. Feel free to change this for your needs.

Ingesting A Single Excel File with Multiple Sheets
Now comes the fun part. In the picture below, you'll see three numbers: 1,2,3. Only two of them contribute to reading ALL sheets from an Excel file into a single Dataframe. Pandas is amazing for this.
# Read Excel file into a dataframe
dfs = pd.read_excel("/Volumes/workspace/databricks_episodes/data_dump/multiple_excel_sheets_load/all regions sales.xlsx", sheet_name=None)
(1) The important parameter used in the .read_excel method is sheet_name=None. If you supply this, it'll read all sheets in. In a similar fashion, you could supply one or more sheets if you don't want to read them all in. Just supply a list ["Sheet1_here","Sheet2_here"]. If you leave this parameter out, I think it may just read the first sheet in. Worth checking out the docs and finding out for yourselves - there's your homework! ๐.
(2) You can see "dfs.keys()". This shows all the sheets that have been read in. I always like to check ๐ค
(3) To stack all the sheets on top of one another, it's as simple as wrapping dfs.values() i.e all the different sheet dataframes, in a pd.concat() method which allows us to stack Dataframes vertically (or horizontally)

Ingesting Multiple Excel Files *with Multiple Sheets*
Ooft, the code in the picture below is very concise. I originally started by creating a For Loop to build on the previous step. However, I thought I'd add a little flair for this section. List Comprehension is a thing of beauty, especially for the Pythonistas out there ๐. The code below will read in multiple files and even all the sheets on these files! It's very compact. Adjust it as you see fit ๐.

I encourage you to get your hands dirty and step through the loops if you don't feel you can understand it. It's a really fun and rewarding exercise ๐. Below, I'll outline what the logic achieves:
- directory = dbutils.fs.ls(...) โ stores a list of file objects from the specified Databricks folder in the variable directory. This is where the Excel files live.
- The list comprehension loops over:
- Each file in directory.
- Each (sheet, data) pair in file (all sheets of that Excel file).
- Reads the Excel file using pd.read_excel, stripping the "dbfs:" prefix to make it a local path for pandas. Note, if you don't strip out "dbfs:" then you'll encounter an error which prompts you to remove it ๐.
- Collects all the individual sheet DataFrames into a list.
- pd.concat([...]) โ stacks all these DataFrames into one combined DataFrame.
Till next time
Thatโs all for now, folks! I can't wait for our next episode. We're currently wondering through ingestion ... but where do we stop? Should we delve into Databases? Should we hit APIs? Perhaps we'll bounce around a little! As long as we're having fun, we're moving forward, right? ๐.
All the best,
BS