Databricks Community

Livingstone · ‎08-19-2024

My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.

Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, Serverless clusters do not allow the installation of additional libraries as regular clusters do. Therefore, I attempted to install it using the REST API.

headers = {
    'Authorization': f'Bearer {TOKEN}',
}

data = {
  "cluster_id": CLUSTER_ID,
  "libraries": [
    {
      "maven": {
        "coordinates": "com.crealytics:spark-excel_2.13:3.4.1_0.19.0"
      }
    }
  ]
}


response = requests.post(f'{HOST}/api/2.0/libraries/install', headers=headers, json=data)

But when I try to save the file in Excel format, it returns an error

[DATA_SOURCE_NOT_FOUND] Failed to find the data source: com.crealytics.spark.excel. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02

How can this issue be resolved? Are there any other ways to export an Excel file ASAP without waiting for the cluster to start up?

Nurota · ‎10-31-2024

I have a similar issue: how to install maven package in the notebook when running with a serverless cluster?

I need to install com.crealytics:spark-excel_2.12:3.4.2_0.20.3 in the notebook like the way pypl libraries installed in the notebook. e.g. %pip install package_name for pypl libraries.

I don't want to use environment sidebar and dependencies. First of all, adding the maven package in dependencies did not work ( I am guessing because it's not Pypl library). Secondly, I will be running the notebook in a workflow via Git, and even if applying the library via dependencies tab worked, it would not know about it when running the notebook from Git, so would not work.