I have a daily job run that occasionally fails with the error: The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached. After I get the notification that this job failed on schedule, I manually run the job and it runs successfully. This article states that it may fail due to using collect or a shared cluster (although this is the only job that runs on the shared cluster at the scheduled time). Below is portion of the code inside the cell that throws the driver error.
for forecast_id in weekly_id_list:
LOGGER.info(f"Get forecasts from {forecast_id}")
# Get forecast_date of the forecast_id
query1 = (f"SELECT F.forecast_date FROM cosmosCatalog.{cosmos_database_name}.Forecasts F WHERE F.id = '{forecast_id}';")
forecast_date = spark.sql(query1).collect()[0][0]
When it fails due to spark driver issue, it automatically do a retry and throw a new error: Library installation failed for library due to user error for pypi {package: "pandas"} . I think databricks already has pandas installed in the cluster, so I don't think I will need to install pandas again, but my assumption is that it'll throw the same error with other packages because it stems from the driver being recently restarted or terminated in the Original run.
So I was wondering if there is a way to define a gap (sleep time) between the original run and the retrial run so that I don't get a library installation error being thrown from the driver being recently restarted or terminated.