Re: Install maven package to serverless cluster

Louis_Frolio · ‎04-30-2025

As you stated, ou cannot install Maven packages on Databricks serverless clusters due to restricted library management capabilities.

However, there are alternative approaches to export data to Excel with minimal latency.

Solutions to Export Excel Files Without Maven on Serverless Clusters:

1. Use Pandas with XLSXWriter:
Convert the Spark DataFrame to a Pandas DataFrame and export to Excel directly.
```python
# Convert to Pandas DataFrame and save as Excel
pandas_df = spark_df.toPandas()
pandas_df.to_excel("/dbfs/path/output.xlsx", engine="xlsxwriter")
```
- Requirements: Install `xlsxwriter` via `%pip install xlsxwriter` in the notebook.
- Limitations: This approach works for smaller datasets ( 90G < Pandas is memory-intensive).

2. Switch to a Standard or Single-Node Cluster:
- Standard Cluster: Create a non-serverless cluster and install the Maven package `com.crealytics:spark-excel_2.12` via the UI or API.
- Single-Node Cluster: Use a driver-only cluster (set workers to 0) to reduce startup time while retaining Maven support.

3. Use CSV as an Intermediate Format:
Export the DataFrame as CSV and let Excel open it directly. This avoids dependencies entirely.
```python
spark_df.write.csv("/dbfs/path/output.csv", header=True)
```

Why the Serverless Cluster Approach Fails:
- Serverless clusters **do not support custom Maven libraries** via the UI, REST API, or init scripts.
- The error `DATA_SOURCE_NOT_FOUND` confirms the `spark-excel` package is not recognized, even if the REST API call appears successful[1].

---

Recommended Workflow for Minimal Latency:
1. Use a pre-started single-node cluster (configured with the `spark-excel` library) to avoid cold-start delays.
2. For large datasets, combine Spark and Pandas:
```python
# Export partitioned data to CSV, then merge into Excel
spark_df.repartition(1).write.csv("/dbfs/path/partition")
pandas.read_csv("/dbfs/path/partition/...").to_excel("output.xlsx")
```

Hope this helps, Big Roux