cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to read excel file from Volume

kmodelew
New Contributor III

Hi, I'am trying to read excel file directly from Volume (not workspace or filestore) -> all examples on the internet use workspace or filestore. Volume is external location so I can read from there but I would like to read directly from Volume. I have confidential data - this is the reason to read data from Volume with secure access. 

I'am using: com.crealytics:spark-excel_2.13:3.5.1_0.20.4 on 16.4 LTS cluster with scala 2.13

Code: 

location = 'Volumes/directory1/file.xlsx'
df = (spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(location)
)

Error: 

 Failed to find the data source: com.crealytics.spark.excel. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02
 

 

10 REPLIES 10

szymon_dybczak
Esteemed Contributor III

Hi @kmodelew ,

Could you provide also what access mode you have on your cluster?

Anyway, I've tried to recreate you scenario. Here's my cluster:

szymon_dybczak_0-1756912903486.png


I've installed the same version of library as you:

szymon_dybczak_1-1756912933089.png

And I was able to read excel without any issue. Pay attention here that you need to provide absolute path to the file - so add missing forward slash in your location path

szymon_dybczak_2-1756912963362.png

 

ck7007
Contributor

@kmodelew The issue is that your volume path is missing the leading forward slash. Also, the library isn't loading properly.

Quick Fix

# Correct path format for Volumes
location = '/Volumes/catalog/schema/volume_name/file.xlsx' ## Note the leading

# For DBR 16.4, use pandas instead (built-in, no library needed)
import pandas as pd

df_pandas = pd.read_excel(f"/Volumes/{catalog}/{schema}/{volume}/file.xlsx")
df = spark.createDataFrame(df_pandas)

If You Must Use spark-excel

  1. Install the library correctly: # In notebook cell
    %pip install spark-excel
    dbutils.library.restartPython()
    2. Then read with the proper path:
    location = '/Volumes/catalog/schema/volume/file.xlsx'
    df = (spark.read.format("com.crealytics.spark.excel")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("dataAddress", "'Sheet1'!A1") # Specify sheet if needed
    .load(location))
    3. Alternative: Native Approach (Most Reliable)

    # Using openpyxl directly for better control
    import openpyxl
    from pyspark.sql import Row

    file_path = "/Volumes/catalog/schema/volume/file.xlsx"
    workbook = openpyxl.load_workbook(file_path, read_only=True)
    sheet = workbook.active

    # Convert to Spark DataFrame
    rows = []
    headers = [cell.value for cell in sheet[1]]
    for row in sheet.iter_rows(min_row=2, values_only=True):
    rows.append(Row(**dict(zip(headers, row))))

    df = spark.createDataFrame(rows)

    Security Note: Volumes are already secure with ACL controls. Your approach is correct for confidential data.

    Which catalog/schema/volume are you using? The exact path structure matters.

szymon_dybczak
Esteemed Contributor III

Guys, if you use LLMs please at least verify what is returned. There's no such python library as spark-excel.

BS_THE_ANALYST
Esteemed Contributor II

@szymon_dybczak, that response made me chuckle. It's true though, if we use LLMs to assist, I think we either need to build a notebook out to show it working or at the very least preface the response with, I was unsure but AI provided me with "____". 

For newer topics, especially those where I'm uncertain, I like to ask the AI and also request documentation references. This then provides me with a starting point & if I agree, I can then pass that knowledge on, knowing it's been vetted. 

@szymon_dybczak  Reading the LLM response be like:  

AreYouSureAboutThatTheRockGIF.gif


All the best,
BS

BS_THE_ANALYST
Esteemed Contributor II

@ck7007 good spot on the volumes path though. I assume the leading space is necessary 🙂

all the best,
BS

Exactly as you wrote @BS_THE_ANALYST . I have nothing against using LLMs, because they’re an excellent tool for learning/productivity.
But you always need to verify what you get in the response. Especially when we’re talking about Databricks, where new features are added every month while the models were trained several months earlier

TheOC
Contributor III

@szymon_dybczak 
I think we also need to be conscious of the damage blind LLM usage can do.
I'd hope it'd get caught early in a Community message, and in this case the hallucination was relatively harmless. However, There are plenty of instances on reddit, twitter, and other forums that suggest vibe coding deleted their database when facing an error. E.g:
https://forum.cursor.com/t/agent-deleted-databases-willy-nilly/71892

Cheers,
TheOC

kmodelew
New Contributor III

@ck7007 

Your suggestion to use pandas works. Thank you! For others my tests results:

On separate clusters 16.4 LTS and scala 2.13 a I have tested the same code and have different results:

Cluster 1) PySparkTypeError: Exception thrown when converting pandas.Series (object) with name 'column1' to Arrow Array (None).

Cluster 2) Works fine with warnings

On cluster with runtime 17.1 works without warinings.

THIS CODE DOES NOT WORK using DATABRICKS CONNECT (when you develop from IDE).

@szymon_dybczak  thank you for your check. 

ck7007
Contributor

@szymon_dybczak @BS_THE_ANALYST @TheOC:

The Actual Working Solution

The pandas approach works (as @kmodelew confirmed), but here's the complete, tested solution for reading Excel from Volumes:

# For Excel files in Unity Catalog Volumes
import pandas as pd

# Correct path format (with leading /)
file_path = '/Volumes/catalog/schema/volume/file.xlsx'

# Read and convert to Spark
df_pandas = pd.read_excel(file_path)
df = spark.createDataFrame(df_pandas)

For the Crealytics Library (if needed)

The Crealytics spark-excel library requires Maven installation, not pip:
# Install via cluster libraries UI or init script:
# Maven coordinates: com.crealytics:spark-excel_2.13:3.5.1_0.20.4

@bs_THE_ANALYSTThanks for catching the path issue—yes, the leading / is critical for volumes!

@kmodelewGreat that pandas worked! For the PySparkTypeError on cluster 1, it's likely a schema inference issue. You can fix it with:
df = spark.createDataFrame(df_pandas.astype(str)) ##Convert all to string first.

You're all right about LLM verification—I should have tested the pip install command before suggesting it. Lesson learned. The pandas approach is simpler and built-in anyway.

Thanks for keeping the community accurate! 👍

BS_THE_ANALYST
Esteemed Contributor II

@ck7007 thanks for the update. Absolutely love that you've tested the solution too! Big props 👍. As you mention, if we keep the community accurate, it'll mean that when someone else searches for the thread, they don't end up using an incorrect solution.

Nice touch with the pandas solution.

All the best,
BS

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now