Databricks Community

Livingstone · ‎08-19-2024

My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.

Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, Serverless clusters do not allow the installation of additional libraries as regular clusters do. Therefore, I attempted to install it using the REST API.

headers = {
    'Authorization': f'Bearer {TOKEN}',
}

data = {
  "cluster_id": CLUSTER_ID,
  "libraries": [
    {
      "maven": {
        "coordinates": "com.crealytics:spark-excel_2.13:3.4.1_0.19.0"
      }
    }
  ]
}


response = requests.post(f'{HOST}/api/2.0/libraries/install', headers=headers, json=data)

But when I try to save the file in Excel format, it returns an error

[DATA_SOURCE_NOT_FOUND] Failed to find the data source: com.crealytics.spark.excel. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02

How can this issue be resolved? Are there any other ways to export an Excel file ASAP without waiting for the cluster to start up?

Nurota · ‎10-31-2024

I have a similar issue: how to install maven package in the notebook when running with a serverless cluster?

I need to install com.crealytics:spark-excel_2.12:3.4.2_0.20.3 in the notebook like the way pypl libraries installed in the notebook. e.g. %pip install package_name for pypl libraries.

I don't want to use environment sidebar and dependencies. First of all, adding the maven package in dependencies did not work ( I am guessing because it's not Pypl library). Secondly, I will be running the notebook in a workflow via Git, and even if applying the library via dependencies tab worked, it would not know about it when running the notebook from Git, so would not work.

VincentS · ‎03-07-2025

I have the exact same question and have not found any way to do it

GalenSwint · ‎03-13-2025

I also have this question and wondered what the options were / are

QuanSun · ‎04-30-2025

Hi @Livingstone , thanks for this questions.

Could you please share how did you get cluster id of serverless compute?

Louis_Frolio · ‎04-30-2025

As you stated, ou cannot install Maven packages on Databricks serverless clusters due to restricted library management capabilities.

However, there are alternative approaches to export data to Excel with minimal latency.

Solutions to Export Excel Files Without Maven on Serverless Clusters:

1. Use Pandas with XLSXWriter:
Convert the Spark DataFrame to a Pandas DataFrame and export to Excel directly.
```python
# Convert to Pandas DataFrame and save as Excel
pandas_df = spark_df.toPandas()
pandas_df.to_excel("/dbfs/path/output.xlsx", engine="xlsxwriter")
```
- Requirements: Install `xlsxwriter` via `%pip install xlsxwriter` in the notebook.
- Limitations: This approach works for smaller datasets ( 90G < Pandas is memory-intensive).

2. Switch to a Standard or Single-Node Cluster:
- Standard Cluster: Create a non-serverless cluster and install the Maven package `com.crealytics:spark-excel_2.12` via the UI or API.
- Single-Node Cluster: Use a driver-only cluster (set workers to 0) to reduce startup time while retaining Maven support.

3. Use CSV as an Intermediate Format:
Export the DataFrame as CSV and let Excel open it directly. This avoids dependencies entirely.
```python
spark_df.write.csv("/dbfs/path/output.csv", header=True)
```

Why the Serverless Cluster Approach Fails:
- Serverless clusters **do not support custom Maven libraries** via the UI, REST API, or init scripts.
- The error `DATA_SOURCE_NOT_FOUND` confirms the `spark-excel` package is not recognized, even if the REST API call appears successful[1].

---

Recommended Workflow for Minimal Latency:
1. Use a pre-started single-node cluster (configured with the `spark-excel` library) to avoid cold-start delays.
2. For large datasets, combine Spark and Pandas:
```python
# Export partitioned data to CSV, then merge into Excel
spark_df.repartition(1).write.csv("/dbfs/path/partition")
pandas.read_csv("/dbfs/path/partition/...").to_excel("output.xlsx")
```

Hope this helps, Big Roux

Databricks Community

Install maven package to serverless cluster

🌟 Community Pulse: Your Weekly Roundup! June 22 – 28, 2026

Solution Accelerator Series | Product Quality Inspection

Upcoming Community BrickTalk: Bringing (Geo)Spatial Awareness to your Conversational Agents

Databricks Community Champion - June 2026 - Amira Bedhiafi

Build apps without jumping through hoops