topic Re: Install maven package to serverless cluster in Data Engineering

Install maven package to serverless cluster

Livingstone — Mon, 19 Aug 2024 14:54:14 GMT

My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.

Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, Serverless clusters do not allow the installation of additional libraries as regular clusters do. Therefore, I attempted to install it using the REST API.

headers = { 'Authorization': f'Bearer {TOKEN}', } data = { "cluster_id": CLUSTER_ID, "libraries": [ { "maven": { "coordinates": "com.crealytics:spark-excel_2.13:3.4.1_0.19.0" } } ] } response = requests.post(f'{HOST}/api/2.0/libraries/install', headers=headers, json=data)

But when I try to save the file in Excel format, it returns an error

[DATA_SOURCE_NOT_FOUND] Failed to find the data source: com.crealytics.spark.excel. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02

How can this issue be resolved? Are there any other ways to export an Excel file ASAP without waiting for the cluster to start up?

Re: Install maven package to serverless cluster

Nurota — Thu, 31 Oct 2024 19:30:46 GMT

I have a similar issue: how to install maven package in the notebook when running with a serverless cluster?

I need to install com.crealytics:spark-excel_2.12:3.4.2_0.20.3 in the notebook like the way pypl libraries installed in the notebook. e.g. %pip install package_name for pypl libraries.

I don't want to use environment sidebar and dependencies. First of all, adding the maven package in dependencies did not work ( I am guessing because it's not Pypl library). Secondly, I will be running the notebook in a workflow via Git, and even if applying the library via dependencies tab worked, it would not know about it when running the notebook from Git, so would not work.

Re: Install maven package to serverless cluster

VincentS — Fri, 07 Mar 2025 10:06:38 GMT

I have the exact same question and have not found any way to do it

Re: Install maven package to serverless cluster

GalenSwint — Fri, 14 Mar 2025 02:47:32 GMT

I also have this question and wondered what the options were / are

Re: Install maven package to serverless cluster

QuanSun — Wed, 30 Apr 2025 16:59:11 GMT

Hi @Livingstone , thanks for this questions.

Could you please share how did you get cluster id of serverless compute?

Re: Install maven package to serverless cluster

Louis_Frolio — Wed, 30 Apr 2025 18:15:53 GMT

As you stated, ou cannot install Maven packages on Databricks serverless clusters due to restricted library management capabilities.

However, there are alternative approaches to export data to Excel with minimal latency.

Solutions to Export Excel Files Without Maven on Serverless Clusters:

1. Use Pandas with XLSXWriter:
Convert the Spark DataFrame to a Pandas DataFrame and export to Excel directly.
```python
# Convert to Pandas DataFrame and save as Excel
pandas_df = spark_df.toPandas()
pandas_df.to_excel("/dbfs/path/output.xlsx", engine="xlsxwriter")
```
- Requirements: Install `xlsxwriter` via `%pip install xlsxwriter` in the notebook.
- Limitations: This approach works for smaller datasets ( 90G < Pandas is memory-intensive).

2. Switch to a Standard or Single-Node Cluster:
- Standard Cluster: Create a non-serverless cluster and install the Maven package `com.crealytics:spark-excel_2.12` via the UI or API.
- Single-Node Cluster: Use a driver-only cluster (set workers to 0) to reduce startup time while retaining Maven support.

3. Use CSV as an Intermediate Format:
Export the DataFrame as CSV and let Excel open it directly. This avoids dependencies entirely.
```python
spark_df.write.csv("/dbfs/path/output.csv", header=True)
```

Why the Serverless Cluster Approach Fails:
- Serverless clusters **do not support custom Maven libraries** via the UI, REST API, or init scripts.
- The error `DATA_SOURCE_NOT_FOUND` confirms the `spark-excel` package is not recognized, even if the REST API call appears successful[1].

---

Recommended Workflow for Minimal Latency:
1. Use a pre-started single-node cluster (configured with the `spark-excel` library) to avoid cold-start delays.
2. For large datasets, combine Spark and Pandas:
```python
# Export partitioned data to CSV, then merge into Excel
spark_df.repartition(1).write.csv("/dbfs/path/partition")
pandas.read_csv("/dbfs/path/partition/...").to_excel("output.xlsx")
```

Hope this helps, Big Roux