Hello @Sahil0007
Thanks for sharing the code and error. This specific error means Spark can’t find the Excel data source on your cluster.
What the error means
The message “[DATA_SOURCE_NOT_FOUND] Failed to find the data source: com.crealytics.spark.excel” is raised when the provider isn’t available on the cluster (not installed, incompatible, or not loadable).
Note: I tried to open the Microsoft Databricks error-classes page via our document reader, but Glean had issues fetching that page’s content. I’ve included other sources that show the same error and recommended fixes.
How to fix it on Databricks
The most common causes and resolutions:
-
Install the Excel connector as a JVM/Maven library on the cluster (not with pip). This package is not a Python wheel; it must be installed as a JVM library using Maven coordinates at the cluster level (Compute > your cluster > Libraries > Install new > Maven).
-
Pick the Maven coordinate that matches your cluster’s Spark and Scala versions. In Databricks you need the artifact with the correct Scala suffix (for example, “_2.12” vs “_2.13”) and a version aligned to your Spark version. The general rule is: choose based on your cluster’s Scala/Spark version in Maven Central when installing the library.
Example that is known to work on Spark 3.5/Scala 2.12 clusters: com.crealytics:spark-excel_2.12:3.5.0_0.20.3.
-
If you’re using a Serverless cluster, be aware that installing arbitrary Maven libraries isn’t supported. Use a classic/all-purpose cluster or another supported approach; otherwise you’ll keep getting the “data source not found” error even after attempting install via API.
-
After installing a new cluster library, restart the cluster so Spark loads it on the driver and executors. (Standard Databricks practice; required for new JVM libs to be visible.)
-
Use the correct format string for the version you installed:
- For the classic com.crealytics package, use format("com.crealytics.spark.excel").
- Some newer releases/forks expose the short name "excel", so format("excel") works as well; this depends on the specific artifact (for example, the dev.mauch fork on newer DBRs).
Quick verification steps
1) Confirm cluster runtime and versions (to select the right coordinate):
print("Spark:", spark.version)
print("DBR:", spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion", "n/a"))
Then install the Maven coordinate in Compute > Libraries > Install new > Maven; search Maven Central and select the artifact that matches your Scala suffix and Spark version.
2) Restart the cluster.
3) Re-run your code (this is fine as-is):
df = (spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.load("abfss://container_name@storage_account.dfs.core.windows.net/dop_testing/PrivilegeSheet.xlsx"))
df.show(5)
If you installed a version that registers the short name, you can alternatively try:
df = (spark.read.format("excel")
.option("header", "true")
.option("inferSchema", "true")
.load("abfss://container_name@storage_account.dfs.core.windows.net/dop_testing/PrivilegeSheet.xlsx"))
Workarounds if you can’t install the library
- Read with pandas on the driver, then convert to Spark:
import pandas as pd
pdf = pd.read_excel("abfss://container_name@storage_account.dfs.core.windows.net/dop_testing/PrivilegeSheet.xlsx")
df = spark.createDataFrame(pdf)
This avoids the JVM data source but is less scalable for very large files.
Cheers, Louis.