a week ago
I want to read an Excel xlsx file on DBR 17.3. On the Cluster the library dev.mauch:spark-excel_2.13:4.0.0_0.31.2 is installed.
V1 Implementation works fine:
df = spark.read.format("dev.mauch.spark.excel").schema(schema).load(excel_file)
display(df)V2 throws [EXCEL_DATA_SOURCE_NOT_ENABLED] Excel data source is not enabled in this cluster. SQLSTATE: 56038
df = spark.read.format("excel").schema(schema).load(excel_file)
display(df)https://docs.databricks.com/aws/en/error-messages/excel-file-error-error-class
I wonder, why do we have specific excel error codes and how do we enable the excel data source?
a week ago
If I build the spark-excel library with another short name (example "excelv2"), everything works fine.
https://github.com/nightscape/spark-excel/issues/896#issuecomment-3486861693
a week ago
hi @der
Thank you for reaching out with your question, very interesting behavior.
I attempted to reproduce your issue using DBR 17.3 LTS but was unable to install the specific library version โdev.mauch:spark-excel_2.13:4.0.0_0.31.2.โ However, I installed the next available version, โdev.mauch:spark-excel_2.13:4.0.0_0.32.2-prerelease0.โ
With this version, I was able to use the โspark.read.format(โexcelโ)โ function without any issue.
spark.read.format("excel").load(excel_file)
Aside from the links you shared, I couldnโt find documentation explaining why the error occurred with the โExcelโ format. It appears that the issue may have been resolved in this newer version, possibly due to a missing internal registration.
I recommend updating your dev.mauch:spark-excel library to the newer version.
Alternatively, you can handle Excel (xlsx) files using the Pandas API on Spark, with the read_excel function, then convert it to a spark dataframe using the code below; just ensure you have โopenpyxlโ installed (using pip install)
import pyspark.pandas as ps
df = ps.read_excel(excel_file)
sp_df = df.to_spark()
display(sp_df)
I hope this helps and if it does, please Accept as Solution.
Thank you!
a week ago
Hi @mmayorga
Thank you for your response. I can install both version, but both fail.
My current Azure Databricks cluster configuration json:
{
"cluster_name": "test spark-excel",
"spark_version": "17.3.x-scala2.13",
"node_type_id": "Standard_F4s",
"autotermination_minutes": 60,
"single_user_name": "abc@company.com",
"data_security_mode": "DATA_SECURITY_MODE_DEDICATED",
"runtime_engine": "STANDARD",
"kind": "CLASSIC_PREVIEW",
"is_single_node": true
}Package installation:
How is your setup?
a week ago
Another thing is the specific error message.
I still not get it, why there is this specific "excel" error message?
if i run:
spark.read.format("test").load(excel_file)The error message is different:
Py4JJavaError: An error occurred while calling o495.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: test. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02
a week ago
hi @der
First of all thank you for your patience and for providing more information about your case.
I replicated equally your cluster config in Azure. Without installing any library, I was able to run and load the xlsx file using ".format("excel")". Unfortunately the documentation does not mention about it, and at this point I'm not using "dev.mauch.spark.excel" library
Then I proceeded to check the "SparkUI" and identified within the "SQL / Dataframe Properties" for "spark.databricks.sql.excel.enabled" with value "true".
You can check this configuration by executing the following code:
spark.conf.get("spark.databricks.sql.excel.enabled")
Next I disabled it to test it:
spark.conf.set("spark.databricks.sql.excel.enabled","false")
With this configuration now I was able to replicate the issue:
With this said, please on your cluster, enable this configuration and try again to read the file
spark.conf.set("spark.databricks.sql.excel.enabled","true")
Then I installed the "dev.mauch.spark.excel" and was also able to use it as well.
This behavior highlights that these two implementations are separated and that ".format("excel")" leverages "spark.databricks.sql.excel" instead of the expected "dev.mauch.spark.excel". I agree that is a bit confusing while updating API versions where both use "excel" as format value.
I hope this helps and if it does, please Accept as Solution.
Monday
Hi @mmayorga
Thank you very much for your insights.
I am not sure if the format "excel" has something to do with open source spark. I think this is more a Databricks thing spark.databricks.sql.excel.enabled
Current state:
| Databricks Runtime | Spark Excel | default of config spark.databricks.sql.excel.enabled | V2 Reader .format("excel") |
| 16.4.x-scala2.12 | com.crealytics:spark-excel_2.12:3.5.0_0.20.3 | cannot be found | works |
| 17.3.x-scala2.13 | dev.mauch:spark-excel_2.13:4.0.0_0.31.2 | false | does not work |
How to use V2 data source .format("excel") of dev.mauch:spark-excel_2.13:4.0.0_0.31.2 with 17.3.x-scala2.13?
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now