My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.
Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, Serverless clusters do not allow the installation of additional libraries as regular clusters do. Therefore, I attempted to install it using the REST API.
headers = {
'Authorization': f'Bearer {TOKEN}',
}
data = {
"cluster_id": CLUSTER_ID,
"libraries": [
{
"maven": {
"coordinates": "com.crealytics:spark-excel_2.13:3.4.1_0.19.0"
}
}
]
}
response = requests.post(f'{HOST}/api/2.0/libraries/install', headers=headers, json=data)
But when I try to save the file in Excel format, it returns an error
[DATA_SOURCE_NOT_FOUND] Failed to find the data source: com.crealytics.spark.excel. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02
How can this issue be resolved? Are there any other ways to export an Excel file ASAP without waiting for the cluster to start up?