09-03-2025 08:04 AM
Hi, I'am trying to read excel file directly from Volume (not workspace or filestore) -> all examples on the internet use workspace or filestore. Volume is external location so I can read from there but I would like to read directly from Volume. I have confidential data - this is the reason to read data from Volume with secure access.
I'am using: com.crealytics:spark-excel_2.13:3.5.1_0.20.4 on 16.4 LTS cluster with scala 2.13
Code:
Error:
09-03-2025 08:05 AM - edited 09-03-2025 08:26 AM
Hi @kmodelew ,
Could you provide also what access mode you have on your cluster?
Anyway, I've tried to recreate you scenario. Here's my cluster:
I've installed the same version of library as you:
And I was able to read excel without any issue. Pay attention here that you need to provide absolute path to the file - so add missing forward slash in your location path
09-03-2025 08:24 AM
@kmodelew The issue is that your volume path is missing the leading forward slash. Also, the library isn't loading properly.
# Correct path format for Volumes
location = '/Volumes/catalog/schema/volume_name/file.xlsx' ## Note the leading
# For DBR 16.4, use pandas instead (built-in, no library needed)
import pandas as pd
df_pandas = pd.read_excel(f"/Volumes/{catalog}/{schema}/{volume}/file.xlsx")
df = spark.createDataFrame(df_pandas)
# Using openpyxl directly for better control
import openpyxl
from pyspark.sql import Row
file_path = "/Volumes/catalog/schema/volume/file.xlsx"
workbook = openpyxl.load_workbook(file_path, read_only=True)
sheet = workbook.active
# Convert to Spark DataFrame
rows = []
headers = [cell.value for cell in sheet[1]]
for row in sheet.iter_rows(min_row=2, values_only=True):
rows.append(Row(**dict(zip(headers, row))))
df = spark.createDataFrame(rows)
Security Note: Volumes are already secure with ACL controls. Your approach is correct for confidential data.
Which catalog/schema/volume are you using? The exact path structure matters.
09-03-2025 08:38 AM
Guys, if you use LLMs please at least verify what is returned. There's no such python library as spark-excel.
09-03-2025 08:51 AM
@szymon_dybczak, that response made me chuckle. It's true though, if we use LLMs to assist, I think we either need to build a notebook out to show it working or at the very least preface the response with, I was unsure but AI provided me with "____".
For newer topics, especially those where I'm uncertain, I like to ask the AI and also request documentation references. This then provides me with a starting point & if I agree, I can then pass that knowledge on, knowing it's been vetted.
@szymon_dybczak Reading the LLM response be like:
All the best,
BS
09-03-2025 08:55 AM
@ck7007 good spot on the volumes path though. I assume the leading space is necessary 🙂
all the best,
BS
09-03-2025 09:03 AM
Exactly as you wrote @BS_THE_ANALYST . I have nothing against using LLMs, because they’re an excellent tool for learning/productivity.
But you always need to verify what you get in the response. Especially when we’re talking about Databricks, where new features are added every month while the models were trained several months earlier
09-03-2025 09:31 AM
@szymon_dybczak
I think we also need to be conscious of the damage blind LLM usage can do.
I'd hope it'd get caught early in a Community message, and in this case the hallucination was relatively harmless. However, There are plenty of instances on reddit, twitter, and other forums that suggest vibe coding deleted their database when facing an error. E.g:
https://forum.cursor.com/t/agent-deleted-databases-willy-nilly/71892
09-04-2025 01:59 AM
Your suggestion to use pandas works. Thank you! For others my tests results:
On separate clusters 16.4 LTS and scala 2.13 a I have tested the same code and have different results:
Cluster 1) PySparkTypeError: Exception thrown when converting pandas.Series (object) with name 'column1' to Arrow Array (None).
Cluster 2) Works fine with warnings
On cluster with runtime 17.1 works without warinings.
THIS CODE DOES NOT WORK using DATABRICKS CONNECT (when you develop from IDE).
@szymon_dybczak thank you for your check.
09-04-2025 03:04 PM
@szymon_dybczak @BS_THE_ANALYST @TheOC:
The pandas approach works (as @kmodelew confirmed), but here's the complete, tested solution for reading Excel from Volumes:
# For Excel files in Unity Catalog Volumes
import pandas as pd
# Correct path format (with leading /)
file_path = '/Volumes/catalog/schema/volume/file.xlsx'
# Read and convert to Spark
df_pandas = pd.read_excel(file_path)
df = spark.createDataFrame(df_pandas)
The Crealytics spark-excel library requires Maven installation, not pip:
# Install via cluster libraries UI or init script:
# Maven coordinates: com.crealytics:spark-excel_2.13:3.5.1_0.20.4
@bs_THE_ANALYSTThanks for catching the path issue—yes, the leading / is critical for volumes!
@kmodelewGreat that pandas worked! For the PySparkTypeError on cluster 1, it's likely a schema inference issue. You can fix it with:
df = spark.createDataFrame(df_pandas.astype(str)) ##Convert all to string first.
You're all right about LLM verification—I should have tested the pip install command before suggesting it. Lesson learned. The pandas approach is simpler and built-in anyway.
Thanks for keeping the community accurate! 👍
09-05-2025 01:36 AM - edited 09-05-2025 01:36 AM
@ck7007 thanks for the update. Absolutely love that you've tested the solution too! Big props 👍. As you mention, if we keep the community accurate, it'll mean that when someone else searches for the thread, they don't end up using an incorrect solution.
Nice touch with the pandas solution.
All the best,
BS
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now