cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Install maven package to serverless cluster

Livingstone
New Contributor II

My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.

Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, Serverless clusters do not allow the installation of additional libraries as regular clusters do. Therefore, I attempted to install it using the REST API.

 

headers = {
    'Authorization': f'Bearer {TOKEN}',
}

data = {
  "cluster_id": CLUSTER_ID,
  "libraries": [
    {
      "maven": {
        "coordinates": "com.crealytics:spark-excel_2.13:3.4.1_0.19.0"
      }
    }
  ]
}


response = requests.post(f'{HOST}/api/2.0/libraries/install', headers=headers, json=data)

 

But when I try to save the file in Excel format, it returns an error

 

[DATA_SOURCE_NOT_FOUND] Failed to find the data source: com.crealytics.spark.excel. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02

 

 

How can this issue be resolved? Are there any other ways to export an Excel file ASAP without waiting for the cluster to start up?

 

5 REPLIES 5

Nurota
New Contributor II

I have a similar issue: how to install maven package in the notebook when running with  a serverless cluster?

I need to install com.crealytics:spark-excel_2.12:3.4.2_0.20.3 in the notebook like the way pypl libraries installed in the notebook. e.g. %pip install package_name for pypl libraries.

I don't want to use environment sidebar and dependencies. First of all, adding the maven package in dependencies did not work ( I am guessing because it's not Pypl library). Secondly, I will be running the notebook in a workflow via Git, and even if applying the library via dependencies tab worked, it would not know about it when running the notebook from Git, so would not work. 

VincentS
New Contributor II

I have the exact same question and have not found any way to do it

GalenSwint
New Contributor II

I also have this question and wondered what the options were / are

 

QuanSun
New Contributor II

Hi @Livingstone , thanks for this questions.

Could you please share how did you get cluster id of serverless compute?

BigRoux
Databricks Employee
Databricks Employee

As you stated, ou cannot install Maven packages on Databricks serverless clusters due to restricted library management capabilities.

However, there are alternative approaches to export data to Excel with minimal latency.

Solutions to Export Excel Files Without Maven on Serverless Clusters:

1. Use Pandas with XLSXWriter:
Convert the Spark DataFrame to a Pandas DataFrame and export to Excel directly.
```python
# Convert to Pandas DataFrame and save as Excel
pandas_df = spark_df.toPandas()
pandas_df.to_excel("/dbfs/path/output.xlsx", engine="xlsxwriter")
```
- Requirements: Install `xlsxwriter` via `%pip install xlsxwriter` in the notebook.
- Limitations: This approach works for smaller datasets ( 90G < Pandas is memory-intensive).

2. Switch to a Standard or Single-Node Cluster:
- Standard Cluster: Create a non-serverless cluster and install the Maven package `com.crealytics:spark-excel_2.12` via the UI or API.
- Single-Node Cluster: Use a driver-only cluster (set workers to 0) to reduce startup time while retaining Maven support.

3. Use CSV as an Intermediate Format:
Export the DataFrame as CSV and let Excel open it directly. This avoids dependencies entirely.
```python
spark_df.write.csv("/dbfs/path/output.csv", header=True)
```

Why the Serverless Cluster Approach Fails:
- Serverless clusters **do not support custom Maven libraries** via the UI, REST API, or init scripts.
- The error `DATA_SOURCE_NOT_FOUND` confirms the `spark-excel` package is not recognized, even if the REST API call appears successful[1].

---

Recommended Workflow for Minimal Latency:
1. Use a pre-started single-node cluster (configured with the `spark-excel` library) to avoid cold-start delays.
2. For large datasets, combine Spark and Pandas:
```python
# Export partitioned data to CSV, then merge into Excel
spark_df.repartition(1).write.csv("/dbfs/path/partition")
pandas.read_csv("/dbfs/path/partition/...").to_excel("output.xlsx")
```

 
Hope this helps, Big Roux

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now