Databricks Community

kmodelew · ‎09-03-2025

Hi, I'am trying to read excel file directly from Volume (not workspace or filestore) -> all examples on the internet use workspace or filestore. Volume is external location so I can read from there but I would like to read directly from Volume. I have confidential data - this is the reason to read data from Volume with secure access.

I'am using: com.crealytics:spark-excel_2.13:3.5.1_0.20.4 on 16.4 LTS cluster with scala 2.13

Code:

location = 'Volumes/directory1/file.xlsx'

df = (spark.read.format("com.crealytics.spark.excel") \

.option("header", "true") \

.option("inferSchema", "true") \

.load(location)

)

Error:

Failed to find the data source: com.crealytics.spark.excel. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02

szymon_dybczak · ‎09-03-2025

Hi @kmodelew ,

Could you provide also what access mode you have on your cluster?

Anyway, I've tried to recreate you scenario. Here's my cluster:

I've installed the same version of library as you:

And I was able to read excel without any issue. Pay attention here that you need to provide absolute path to the file - so add missing forward slash in your location path

ck7007 · ‎09-03-2025

@kmodelew The issue is that your volume path is missing the leading forward slash. Also, the library isn't loading properly.

Quick Fix

# Correct path format for Volumes
location = '/Volumes/catalog/schema/volume_name/file.xlsx' ## Note the leading

# For DBR 16.4, use pandas instead (built-in, no library needed)
import pandas as pd

df_pandas = pd.read_excel(f"/Volumes/{catalog}/{schema}/{volume}/file.xlsx")
df = spark.createDataFrame(df_pandas)

If You Must Use spark-excel

Install the library correctly: # In notebook cell
%pip install spark-excel
dbutils.library.restartPython()
2. Then read with the proper path:
location = '/Volumes/catalog/schema/volume/file.xlsx'
df = (spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.option("dataAddress", "'Sheet1'!A1") # Specify sheet if needed
.load(location))
3. Alternative: Native Approach (Most Reliable)
# Using openpyxl directly for better control
import openpyxl
from pyspark.sql import Row
file_path = "/Volumes/catalog/schema/volume/file.xlsx"
workbook = openpyxl.load_workbook(file_path, read_only=True)
sheet = workbook.active
# Convert to Spark DataFrame
rows = []
headers = [cell.value for cell in sheet[1]]
for row in sheet.iter_rows(min_row=2, values_only=True):
rows.append(Row(**dict(zip(headers, row))))
df = spark.createDataFrame(rows)
Security Note: Volumes are already secure with ACL controls. Your approach is correct for confidential data.
Which catalog/schema/volume are you using? The exact path structure matters.

szymon_dybczak · ‎09-03-2025

Guys, if you use LLMs please at least verify what is returned. There's no such python library as spark-excel.

BS_THE_ANALYST · ‎09-03-2025

@szymon_dybczak, that response made me chuckle. It's true though, if we use LLMs to assist, I think we either need to build a notebook out to show it working or at the very least preface the response with, I was unsure but AI provided me with "____".

For newer topics, especially those where I'm uncertain, I like to ask the AI and also request documentation references. This then provides me with a starting point & if I agree, I can then pass that knowledge on, knowing it's been vetted.

@szymon_dybczak Reading the LLM response be like:

All the best,
BS

BS_THE_ANALYST · ‎09-03-2025

@ck7007 good spot on the volumes path though. I assume the leading space is necessary 🙂

all the best,
BS

szymon_dybczak · ‎09-03-2025

Exactly as you wrote @BS_THE_ANALYST . I have nothing against using LLMs, because they’re an excellent tool for learning/productivity.
But you always need to verify what you get in the response. Especially when we’re talking about Databricks, where new features are added every month while the models were trained several months earlier

TheOC · ‎09-03-2025

@szymon_dybczak
I think we also need to be conscious of the damage blind LLM usage can do.
I'd hope it'd get caught early in a Community message, and in this case the hallucination was relatively harmless. However, There are plenty of instances on reddit, twitter, and other forums that suggest vibe coding deleted their database when facing an error. E.g:
https://forum.cursor.com/t/agent-deleted-databases-willy-nilly/71892

Cheers,
TheOC

kmodelew · ‎09-04-2025

@ck7007

Your suggestion to use pandas works. Thank you! For others my tests results:

On separate clusters 16.4 LTS and scala 2.13 a I have tested the same code and have different results:

Cluster 1) PySparkTypeError: Exception thrown when converting pandas.Series (object) with name 'column1' to Arrow Array (None).

Cluster 2) Works fine with warnings

On cluster with runtime 17.1 works without warinings.

THIS CODE DOES NOT WORK using DATABRICKS CONNECT (when you develop from IDE).

@szymon_dybczak thank you for your check.

ck7007 · ‎09-04-2025

@szymon_dybczak @BS_THE_ANALYST @TheOC:

The Actual Working Solution

The pandas approach works (as @kmodelew confirmed), but here's the complete, tested solution for reading Excel from Volumes:

# For Excel files in Unity Catalog Volumes
import pandas as pd

# Correct path format (with leading /)
file_path = '/Volumes/catalog/schema/volume/file.xlsx'

# Read and convert to Spark
df_pandas = pd.read_excel(file_path)
df = spark.createDataFrame(df_pandas)

For the Crealytics Library (if needed)

The Crealytics spark-excel library requires Maven installation, not pip:
# Install via cluster libraries UI or init script:
# Maven coordinates: com.crealytics:spark-excel_2.13:3.5.1_0.20.4

@bs_THE_ANALYSTThanks for catching the path issue—yes, the leading / is critical for volumes!

@kmodelewGreat that pandas worked! For the PySparkTypeError on cluster 1, it's likely a schema inference issue. You can fix it with:
df = spark.createDataFrame(df_pandas.astype(str)) ##Convert all to string first.

You're all right about LLM verification—I should have tested the pip install command before suggesting it. Lesson learned. The pandas approach is simpler and built-in anyway.

Thanks for keeping the community accurate! 👍

BS_THE_ANALYST · ‎09-05-2025

@ck7007 thanks for the update. Absolutely love that you've tested the solution too! Big props 👍. As you mention, if we keep the community accurate, it'll mean that when someone else searches for the thread, they don't end up using an incorrect solution.

Nice touch with the pandas solution.

All the best,
BS