Databricks Community

der · ‎11-04-2025

I want to read an Excel xlsx file on DBR 17.3. On the Cluster the library dev.mauch:spark-excel_2.13:4.0.0_0.31.2 is installed.

V1 Implementation works fine:

df = spark.read.format("dev.mauch.spark.excel").schema(schema).load(excel_file)
display(df)

V2 throws [EXCEL_DATA_SOURCE_NOT_ENABLED] Excel data source is not enabled in this cluster. SQLSTATE: 56038

df = spark.read.format("excel").schema(schema).load(excel_file)
display(df)

https://learn.microsoft.com/en-gb/azure/databricks/error-messages/error-classes#excel_data_source_no...

https://docs.databricks.com/aws/en/error-messages/excel-file-error-error-class

I wonder, why do we have specific excel error codes and how do we enable the excel data source?

der · ‎11-04-2025

If I build the spark-excel library with another short name (example "excelv2"), everything works fine.

https://github.com/nightscape/spark-excel/issues/896#issuecomment-3486861693

mmayorga · a month ago

hi @der

Thank you for reaching out with your question, very interesting behavior.

I attempted to reproduce your issue using DBR 17.3 LTS but was unable to install the specific library version “dev.mauch:spark-excel_2.13:4.0.0_0.31.2.” However, I installed the next available version, “dev.mauch:spark-excel_2.13:4.0.0_0.32.2-prerelease0.”

With this version, I was able to use the “spark.read.format(“excel”)” function without any issue.

spark.read.format("excel").load(excel_file)

Aside from the links you shared, I couldn’t find documentation explaining why the error occurred with the “Excel” format. It appears that the issue may have been resolved in this newer version, possibly due to a missing internal registration.

I recommend updating your dev.mauch:spark-excel library to the newer version.

Alternatively, you can handle Excel (xlsx) files using the Pandas API on Spark, with the read_excel function, then convert it to a spark dataframe using the code below; just ensure you have “openpyxl” installed (using pip install)

import pyspark.pandas as ps

df = ps.read_excel(excel_file)
sp_df = df.to_spark()
display(sp_df)

I hope this helps and if it does, please Accept as Solution.
Thank you!

der · a month ago

Hi @mmayorga

Thank you for your response. I can install both version, but both fail.

My current Azure Databricks cluster configuration json:

{
    "cluster_name": "test spark-excel",
    "spark_version": "17.3.x-scala2.13",
    "node_type_id": "Standard_F4s",
    "autotermination_minutes": 60,
    "single_user_name": "abc@company.com",
    "data_security_mode": "DATA_SECURITY_MODE_DEDICATED",
    "runtime_engine": "STANDARD",
    "kind": "CLASSIC_PREVIEW",
    "is_single_node": true
}

Package installation:

How is your setup?

der · a month ago

@mmayorga

Another thing is the specific error message.

https://learn.microsoft.com/en-gb/azure/databricks/error-messages/error-classes#excel_data_source_no...

I still not get it, why there is this specific "excel" error message?

if i run:

spark.read.format("test").load(excel_file)

The error message is different:

Py4JJavaError: An error occurred while calling o495.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: test. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02

mmayorga · a month ago

hi @der

First of all thank you for your patience and for providing more information about your case.

Use of ".format("excel")"

I replicated equally your cluster config in Azure. Without installing any library, I was able to run and load the xlsx file using ".format("excel")". Unfortunately the documentation does not mention about it, and at this point I'm not using "dev.mauch.spark.excel" library

Then I proceeded to check the "SparkUI" and identified within the "SQL / Dataframe Properties" for "spark.databricks.sql.excel.enabled" with value "true".

You can check this configuration by executing the following code:

spark.conf.get("spark.databricks.sql.excel.enabled")

Next I disabled it to test it:

spark.conf.set("spark.databricks.sql.excel.enabled","false")

With this configuration now I was able to replicate the issue:

With this said, please on your cluster, enable this configuration and try again to read the file

spark.conf.set("spark.databricks.sql.excel.enabled","true")

Use of ".format("dev.mauch.spark.excel)""

Then I installed the "dev.mauch.spark.excel" and was also able to use it as well.

Conclusion

This behavior highlights that these two implementations are separated and that ".format("excel")" leverages "spark.databricks.sql.excel" instead of the expected "dev.mauch.spark.excel". I agree that is a bit confusing while updating API versions where both use "excel" as format value.

I hope this helps and if it does, please Accept as Solution.

der · 4 weeks ago

Hi @mmayorga

Thank you very much for your insights.

I am not sure if the format "excel" has something to do with open source spark. I think this is more a Databricks thing spark.databricks.sql.excel.enabled

Current state:

Databricks Runtime	Spark Excel	default of config spark.databricks.sql.excel.enabled	V2 Reader .format("excel")
16.4.x-scala2.12	com.crealytics:spark-excel_2.12:3.5.0_0.20.3	cannot be found	works
17.3.x-scala2.13	dev.mauch:spark-excel_2.13:4.0.0_0.31.2	false	does not work