<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Install maven package to serverless cluster in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/111987#M44063</link>
    <description>&lt;P&gt;I have the exact same question and have not found any way to do it&lt;/P&gt;</description>
    <pubDate>Fri, 07 Mar 2025 10:06:38 GMT</pubDate>
    <dc:creator>VincentS</dc:creator>
    <dc:date>2025-03-07T10:06:38Z</dc:date>
    <item>
      <title>Install maven package to serverless cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/83468#M36937</link>
      <description>&lt;P&gt;My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.&lt;/P&gt;&lt;P&gt;Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, Serverless clusters do not allow the installation of additional libraries as regular clusters do. Therefore, I attempted to install it using the REST API.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;headers = {
    'Authorization': f'Bearer {TOKEN}',
}

data = {
  "cluster_id": CLUSTER_ID,
  "libraries": [
    {
      "maven": {
        "coordinates": "com.crealytics:spark-excel_2.13:3.4.1_0.19.0"
      }
    }
  ]
}


response = requests.post(f'{HOST}/api/2.0/libraries/install', headers=headers, json=data)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But when I try to save the file in Excel format, it returns an error&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;[DATA_SOURCE_NOT_FOUND] Failed to find the data source: com.crealytics.spark.excel. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;P&gt;How can this issue be resolved? Are there any other ways to export an Excel file ASAP without waiting for the cluster to start up?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Aug 2024 14:54:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/83468#M36937</guid>
      <dc:creator>Livingstone</dc:creator>
      <dc:date>2024-08-19T14:54:14Z</dc:date>
    </item>
    <item>
      <title>Re: Install maven package to serverless cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/97106#M39427</link>
      <description>&lt;P&gt;I have a similar issue: how to install maven package in the notebook when running with&amp;nbsp; a serverless cluster?&lt;/P&gt;&lt;P&gt;I need to install&amp;nbsp;&lt;SPAN&gt;com.crealytics:spark&lt;/SPAN&gt;&lt;SPAN&gt;-&lt;/SPAN&gt;&lt;SPAN&gt;excel_2.12:&lt;/SPAN&gt;&lt;SPAN&gt;3.4&lt;/SPAN&gt;&lt;SPAN&gt;.2_0.20.3&amp;nbsp;&lt;/SPAN&gt;in the notebook like the way pypl libraries installed in the notebook. e.g. %pip install package_name for pypl libraries.&lt;/P&gt;&lt;P&gt;I don't want to use environment sidebar and dependencies. First of all, adding the maven package in&amp;nbsp;dependencies did not work ( I am guessing because it's not Pypl library). Secondly, I will be running the notebook in a workflow via Git, and even if applying the library via dependencies tab worked, it would not know about it when running the notebook from Git, so would not work.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Oct 2024 19:30:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/97106#M39427</guid>
      <dc:creator>Nurota</dc:creator>
      <dc:date>2024-10-31T19:30:46Z</dc:date>
    </item>
    <item>
      <title>Re: Install maven package to serverless cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/111987#M44063</link>
      <description>&lt;P&gt;I have the exact same question and have not found any way to do it&lt;/P&gt;</description>
      <pubDate>Fri, 07 Mar 2025 10:06:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/111987#M44063</guid>
      <dc:creator>VincentS</dc:creator>
      <dc:date>2025-03-07T10:06:38Z</dc:date>
    </item>
    <item>
      <title>Re: Install maven package to serverless cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/112532#M44243</link>
      <description>&lt;P&gt;I also have this question and wondered what the options were / are&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Mar 2025 02:47:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/112532#M44243</guid>
      <dc:creator>GalenSwint</dc:creator>
      <dc:date>2025-03-14T02:47:32Z</dc:date>
    </item>
    <item>
      <title>Re: Install maven package to serverless cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/117186#M45448</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/116594"&gt;@Livingstone&lt;/a&gt;&amp;nbsp;, thanks for this questions.&lt;/P&gt;&lt;P&gt;Could you please share how did you get cluster id of serverless compute?&lt;/P&gt;</description>
      <pubDate>Wed, 30 Apr 2025 16:59:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/117186#M45448</guid>
      <dc:creator>QuanSun</dc:creator>
      <dc:date>2025-04-30T16:59:11Z</dc:date>
    </item>
    <item>
      <title>Re: Install maven package to serverless cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/117190#M45452</link>
      <description>&lt;DIV class="paragraph"&gt;
&lt;P&gt;As you stated, ou cannot install Maven packages on Databricks serverless clusters due to restricted library management capabilities.&lt;/P&gt;
&lt;P&gt;However, there are alternative approaches to export data to Excel with minimal latency.&lt;/P&gt;
&lt;P&gt;Solutions to Export Excel Files Without Maven on Serverless Clusters:&lt;/P&gt;
&lt;P&gt;1. Use Pandas with XLSXWriter:&lt;BR /&gt;Convert the Spark DataFrame to a Pandas DataFrame and export to Excel directly. &lt;BR /&gt;```python&lt;BR /&gt;# Convert to Pandas DataFrame and save as Excel&lt;BR /&gt;pandas_df = spark_df.toPandas()&lt;BR /&gt;pandas_df.to_excel("/dbfs/path/output.xlsx", engine="xlsxwriter")&lt;BR /&gt;```&lt;BR /&gt;- Requirements: Install `xlsxwriter` via `%pip install xlsxwriter` in the notebook. &lt;BR /&gt;- Limitations: This approach works for smaller datasets ( 90G &amp;lt; Pandas is memory-intensive).&lt;/P&gt;
&lt;P&gt;2. Switch to a Standard or Single-Node Cluster:&lt;BR /&gt;- Standard Cluster: Create a non-serverless cluster and install the Maven package `com.crealytics:spark-excel_2.12` via the UI or API. &lt;BR /&gt;- Single-Node Cluster: Use a driver-only cluster (set workers to 0) to reduce startup time while retaining Maven support.&lt;/P&gt;
&lt;P&gt;3. &lt;STRONG&gt;Use CSV as an Intermediate Format:&lt;/STRONG&gt;&lt;BR /&gt;Export the DataFrame as CSV and let Excel open it directly. This avoids dependencies entirely. &lt;BR /&gt;```python&lt;BR /&gt;spark_df.write.csv("/dbfs/path/output.csv", header=True)&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Why the Serverless Cluster Approach Fails:&lt;/STRONG&gt;&lt;BR /&gt;- Serverless clusters **do not support custom Maven libraries** via the UI, REST API, or init scripts. &lt;BR /&gt;- The error `DATA_SOURCE_NOT_FOUND` confirms the `spark-excel` package is not recognized, even if the REST API call appears successful[1].&lt;/P&gt;
&lt;P&gt;---&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Recommended Workflow for Minimal Latency&lt;/STRONG&gt;:&lt;BR /&gt;1. Use a pre-started single-node cluster (configured with the `spark-excel` library) to avoid cold-start delays. &lt;BR /&gt;2. For large datasets, combine Spark and Pandas: &lt;BR /&gt;```python&lt;BR /&gt;# Export partitioned data to CSV, then merge into Excel&lt;BR /&gt;spark_df.repartition(1).write.csv("/dbfs/path/partition")&lt;BR /&gt;pandas.read_csv("/dbfs/path/partition/...").to_excel("output.xlsx")&lt;BR /&gt;```&lt;/P&gt;
&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Hope this helps, Big Roux&lt;/DIV&gt;</description>
      <pubDate>Wed, 30 Apr 2025 18:15:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/m-p/117190#M45452</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-04-30T18:15:53Z</dc:date>
    </item>
  </channel>
</rss>

