cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to delay a new job run after job

oleole
Contributor

I have a daily job run that occasionally fails with the error: The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached. After I get the notification that this job failed on schedule, I manually run the job and it runs successfully. This article states that it may fail due to using collect or a shared cluster (although this is the only job that runs on the shared cluster at the scheduled time). Below is portion of the code inside the cell that throws the driver error.

for forecast_id in weekly_id_list:
  LOGGER.info(f"Get forecasts from {forecast_id}")
  # Get forecast_date of the forecast_id
  query1 = (f"SELECT F.forecast_date FROM cosmosCatalog.{cosmos_database_name}.Forecasts F WHERE F.id = '{forecast_id}';")
  forecast_date = spark.sql(query1).collect()[0][0]

When it fails due to spark driver issue, it automatically do a retry and throw a new error: Library installation failed for library due to user error for pypi {package: "pandas"} . I think databricks already has pandas installed in the cluster, so I don't think I will need to install pandas again, but my assumption is that it'll throw the same error with other packages because it stems from the driver being recently restarted or terminated in the Original run.

So I was wondering if there is a way to define a gap (sleep time) between the original run and the retrial run so that I don't get a library installation error being thrown from the driver being recently restarted or terminated.image.png 

image.png 

1 ACCEPTED SOLUTION

Accepted Solutions

oleole
Contributor

According to this documentation, you can specify the wait time between the "start" of the first run and the retry start time.

View solution in original post

3 REPLIES 3

Anonymous
Not applicable

@Jay Yang​ :

Yes, you can add a sleep time between the original run and the retrial run to avoid the library installation error due to the driver being recently restarted or terminated. Here's an example of how you can modify your code to add a sleep time:

import time
 
for forecast_id in weekly_id_list:
  LOGGER.info(f"Get forecasts from {forecast_id}")
  # Get forecast_date of the forecast_id
  query1 = (f"SELECT F.forecast_date FROM cosmosCatalog.{cosmos_database_name}.Forecasts F WHERE F.id = '{forecast_id}';")
  
  try:
    forecast_date = spark.sql(query1).collect()[0][0]
  except:
    # If the driver fails, sleep for 5 seconds and try again
    time.sleep(5)
    forecast_date = spark.sql(query1).collect()[0][0]

In this example, the try block attempts to collect the forecast date as usual. If it encounters an error due to the driver being recently restarted or terminated, it will sleep for 5 seconds before attempting to collect the forecast date again. You can adjust the sleep time as needed.

Anonymous
Not applicable

Hi @Jay Yang​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

oleole
Contributor

According to this documentation, you can specify the wait time between the "start" of the first run and the retry start time.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group