08-19-2024 07:54 AM
My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.
Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, Serverless clusters do not allow the installation of additional libraries as regular clusters do. Therefore, I attempted to install it using the REST API.
headers = {
'Authorization': f'Bearer {TOKEN}',
}
data = {
"cluster_id": CLUSTER_ID,
"libraries": [
{
"maven": {
"coordinates": "com.crealytics:spark-excel_2.13:3.4.1_0.19.0"
}
}
]
}
response = requests.post(f'{HOST}/api/2.0/libraries/install', headers=headers, json=data)
But when I try to save the file in Excel format, it returns an error
[DATA_SOURCE_NOT_FOUND] Failed to find the data source: com.crealytics.spark.excel. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02
How can this issue be resolved? Are there any other ways to export an Excel file ASAP without waiting for the cluster to start up?
10-31-2024 12:30 PM
I have a similar issue: how to install maven package in the notebook when running with a serverless cluster?
I need to install com.crealytics:spark-excel_2.12:3.4.2_0.20.3 in the notebook like the way pypl libraries installed in the notebook. e.g. %pip install package_name for pypl libraries.
I don't want to use environment sidebar and dependencies. First of all, adding the maven package in dependencies did not work ( I am guessing because it's not Pypl library). Secondly, I will be running the notebook in a workflow via Git, and even if applying the library via dependencies tab worked, it would not know about it when running the notebook from Git, so would not work.
03-07-2025 02:06 AM
I have the exact same question and have not found any way to do it
03-13-2025 07:47 PM
I also have this question and wondered what the options were / are
04-30-2025 09:59 AM
Hi @Livingstone , thanks for this questions.
Could you please share how did you get cluster id of serverless compute?
04-30-2025 11:15 AM
As you stated, ou cannot install Maven packages on Databricks serverless clusters due to restricted library management capabilities.
However, there are alternative approaches to export data to Excel with minimal latency.
Solutions to Export Excel Files Without Maven on Serverless Clusters:
1. Use Pandas with XLSXWriter:
Convert the Spark DataFrame to a Pandas DataFrame and export to Excel directly.
```python
# Convert to Pandas DataFrame and save as Excel
pandas_df = spark_df.toPandas()
pandas_df.to_excel("/dbfs/path/output.xlsx", engine="xlsxwriter")
```
- Requirements: Install `xlsxwriter` via `%pip install xlsxwriter` in the notebook.
- Limitations: This approach works for smaller datasets ( 90G < Pandas is memory-intensive).
2. Switch to a Standard or Single-Node Cluster:
- Standard Cluster: Create a non-serverless cluster and install the Maven package `com.crealytics:spark-excel_2.12` via the UI or API.
- Single-Node Cluster: Use a driver-only cluster (set workers to 0) to reduce startup time while retaining Maven support.
3. Use CSV as an Intermediate Format:
Export the DataFrame as CSV and let Excel open it directly. This avoids dependencies entirely.
```python
spark_df.write.csv("/dbfs/path/output.csv", header=True)
```
Why the Serverless Cluster Approach Fails:
- Serverless clusters **do not support custom Maven libraries** via the UI, REST API, or init scripts.
- The error `DATA_SOURCE_NOT_FOUND` confirms the `spark-excel` package is not recognized, even if the REST API call appears successful[1].
---
Recommended Workflow for Minimal Latency:
1. Use a pre-started single-node cluster (configured with the `spark-excel` library) to avoid cold-start delays.
2. For large datasets, combine Spark and Pandas:
```python
# Export partitioned data to CSV, then merge into Excel
spark_df.repartition(1).write.csv("/dbfs/path/partition")
pandas.read_csv("/dbfs/path/partition/...").to_excel("output.xlsx")
```
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now