Databricks Community

oleole · ‎03-19-2023

I have a daily job run that occasionally fails with the error: The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached. After I get the notification that this job failed on schedule, I manually run the job and it runs successfully. This article states that it may fail due to using collect or a shared cluster (although this is the only job that runs on the shared cluster at the scheduled time). Below is portion of the code inside the cell that throws the driver error.

for forecast_id in weekly_id_list:
  LOGGER.info(f"Get forecasts from {forecast_id}")
  # Get forecast_date of the forecast_id
  query1 = (f"SELECT F.forecast_date FROM cosmosCatalog.{cosmos_database_name}.Forecasts F WHERE F.id = '{forecast_id}';")
  forecast_date = spark.sql(query1).collect()[0][0]

When it fails due to spark driver issue, it automatically do a retry and throw a new error: Library installation failed for library due to user error for pypi {package: "pandas"} . I think databricks already has pandas installed in the cluster, so I don't think I will need to install pandas again, but my assumption is that it'll throw the same error with other packages because it stems from the driver being recently restarted or terminated in the Original run.

So I was wondering if there is a way to define a gap (sleep time) between the original run and the retrial run so that I don't get a library installation error being thrown from the driver being recently restarted or terminated.

oleole · ‎04-17-2023

According to this documentation, you can specify the wait time between the "start" of the first run and the retry start time.

View solution in original post

Anonymous · ‎03-25-2023

@Jay Yang :

Yes, you can add a sleep time between the original run and the retrial run to avoid the library installation error due to the driver being recently restarted or terminated. Here's an example of how you can modify your code to add a sleep time:

import time
 
for forecast_id in weekly_id_list:
  LOGGER.info(f"Get forecasts from {forecast_id}")
  # Get forecast_date of the forecast_id
  query1 = (f"SELECT F.forecast_date FROM cosmosCatalog.{cosmos_database_name}.Forecasts F WHERE F.id = '{forecast_id}';")
  
  try:
    forecast_date = spark.sql(query1).collect()[0][0]
  except:
    # If the driver fails, sleep for 5 seconds and try again
    time.sleep(5)
    forecast_date = spark.sql(query1).collect()[0][0]

In this example, the try block attempts to collect the forecast date as usual. If it encounters an error due to the driver being recently restarted or terminated, it will sleep for 5 seconds before attempting to collect the forecast date again. You can adjust the sleep time as needed.

Anonymous · ‎03-26-2023

Hi @Jay Yang

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!