cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

EXCEL_DATA_SOURCE_NOT_ENABLED Excel data source is not enabled in this cluster

der
Contributor II

I want to read an Excel xlsx file on DBR 17.3. On the Cluster the library dev.mauch:spark-excel_2.13:4.0.0_0.31.2 is installed. 

V1 Implementation works fine:

df = spark.read.format("dev.mauch.spark.excel").schema(schema).load(excel_file)
display(df)

V2 throws [EXCEL_DATA_SOURCE_NOT_ENABLED] Excel data source is not enabled in this cluster. SQLSTATE: 56038

df = spark.read.format("excel").schema(schema).load(excel_file)
display(df)

https://learn.microsoft.com/en-gb/azure/databricks/error-messages/error-classes#excel_data_source_no...

https://docs.databricks.com/aws/en/error-messages/excel-file-error-error-class

I wonder, why do we have specific excel error codes and how do we enable the excel data source? 

 

 

6 REPLIES 6

der
Contributor II

If I build the spark-excel library with another short name (example "excelv2"), everything works fine. 

https://github.com/nightscape/spark-excel/issues/896#issuecomment-3486861693

mmayorga
Databricks Employee
Databricks Employee

hi @der 

Thank you for reaching out with your question, very interesting behavior.

I attempted to reproduce your issue using DBR 17.3 LTS but was unable to install the specific library version “dev.mauch:spark-excel_2.13:4.0.0_0.31.2.” However, I installed the next available version, “dev.mauch:spark-excel_2.13:4.0.0_0.32.2-prerelease0.”

With this version, I was able to use the “spark.read.format(“excel”)” function without any issue.

spark.read.format("excel").load(excel_file)

Aside from the links you shared, I couldn’t find documentation explaining why the error occurred with the “Excel” format. It appears that the issue may have been resolved in this newer version, possibly due to a missing internal registration.

I recommend updating your dev.mauch:spark-excel library to the newer version.

Alternatively, you can handle Excel (xlsx) files using the Pandas API on Spark, with the read_excel function, then convert it to a spark dataframe using the code below; just ensure you have “openpyxl” installed (using pip install)

import pyspark.pandas as ps

df = ps.read_excel(excel_file)
sp_df = df.to_spark()
display(sp_df)

 I hope this helps and if it does, please Accept as Solution.
Thank you!

der
Contributor II

Hi @mmayorga 

Thank you for your response. I can install both version, but both fail.

My current Azure Databricks cluster configuration json:

{
    "cluster_name": "test spark-excel",
    "spark_version": "17.3.x-scala2.13",
    "node_type_id": "Standard_F4s",
    "autotermination_minutes": 60,
    "single_user_name": "abc@company.com",
    "data_security_mode": "DATA_SECURITY_MODE_DEDICATED",
    "runtime_engine": "STANDARD",
    "kind": "CLASSIC_PREVIEW",
    "is_single_node": true
}

 Package installation:

der_0-1762419380757.png

How is your setup?

der
Contributor II

@mmayorga 

Another thing is the specific error message.

https://learn.microsoft.com/en-gb/azure/databricks/error-messages/error-classes#excel_data_source_no...

I still not get it, why there is this specific "excel" error message?

if i run:

spark.read.format("test").load(excel_file)

 The error message is different:

Py4JJavaError: An error occurred while calling o495.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: test. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02

mmayorga
Databricks Employee
Databricks Employee

hi @der 

First of all thank you for your patience and for providing more information about your case.

Use of ".format("excel")"

I replicated equally your cluster config in Azure. Without installing any library, I was able to run and load the xlsx file using ".format("excel")". Unfortunately the documentation does not mention about it, and at this point I'm not using "dev.mauch.spark.excel" library

mmayorga_0-1762445991717.png

Then I proceeded to check the "SparkUI" and identified within the "SQL / Dataframe Properties" for "spark.databricks.sql.excel.enabled" with value "true".

You can check this configuration by executing the following code:

spark.conf.get("spark.databricks.sql.excel.enabled")

Next I disabled it to test it:

spark.conf.set("spark.databricks.sql.excel.enabled","false")

With this configuration now I was able to replicate the issue:

mmayorga_1-1762446510138.png

With this said, please on your cluster, enable this configuration and try again to read the file

spark.conf.set("spark.databricks.sql.excel.enabled","true")

 Use of ".format("dev.mauch.spark.excel)""

Then I installed the "dev.mauch.spark.excel" and was also able to use it as well.

mmayorga_2-1762447091028.png

Conclusion

This behavior highlights that these two implementations are separated and that ".format("excel")" leverages "spark.databricks.sql.excel" instead of the expected "dev.mauch.spark.excel". I agree that is a bit confusing while updating API versions where both use "excel" as format value.

 I hope this helps and if it does, please Accept as Solution.

der
Contributor II

Hi @mmayorga 

Thank you very much for your insights. 

I am not sure if the format "excel" has something to do with open source spark. I think this is more a Databricks thing spark.databricks.sql.excel.enabled

Current state:

Databricks RuntimeSpark Excel
default of config spark.databricks.sql.excel.enabled
V2 Reader .format("excel")
16.4.x-scala2.12
com.crealytics:spark-excel_2.12:3.5.0_0.20.3
 
cannot be found
works
17.3.x-scala2.13dev.mauch:spark-excel_2.13:4.0.0_0.31.2false does not work

How to use V2 data source .format("excel") of dev.mauch:spark-excel_2.13:4.0.0_0.31.2 with 17.3.x-scala2.13?