cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Job scheduling - continuous mode

smurug
New Contributor II

While scheduling the Databricks job using continuous mode - what will happen if the job is configured to run with Job cluster.

At the end of each run will the cluster be terminated and re-created again for the next run? The official documentation is not clear but it only mentioned that there will be a slight delay and it will be less than 60 seconds.

But a quick practical check for this scenario, points in the direction that the cluster is getting re-created, because a simple do nothing notebook is taking 2 minutes to completed and from the logs it looks like different clusters are used. Not conclusive though.

Appreciate any thoughts on the same - because logically the continuous option should re-use the cluster (to save on the start-up time), otherwise the value this option brings is limited. 

4 REPLIES 4

Tharun-Kumar
Databricks Employee
Databricks Employee

@smurug 

Job Cluster has been designed to be unique for each run of a job. So, each run of your job would run against a new job cluster.

If you want your job to run continuously without any delay and to re-use the cluster, I would recommend to use a dedicated interactive cluster. In this case, the cluster would be retained across job runs and your job runs would be instantly executed after the previous run is completed.

smurug
New Contributor II

Thanks for the response - Yes we are doing this currently (using interactive cluster), however following are the pointers which are being considered for re-evaluating this approach and arrive at a possible alternative (if possible)

1) Cost difference between Interactive and Job cluster

2) In the Production environment, the following error is being received every now and then - 

run failed with error message Context ExecutionContextId(1496834584910869936) is disconnected.. While this error can be received for multiple reasons, cluster resource constraints is one of the main reasons as per the understanding. Hence the thought process is to have individual Job clusters for different jobs, which can be scaled independently, hence this will result in making dedicated resources available for the Jobs rather than shared resources from interactive cluster across all jobs. However it might not be feasible to create many interactive cluster consider the costing, hence using Job cluster can offset some of this cost and help in reducing the overall cost.
 
Further, searching around the net - found this article https://medium.com/@24chynoweth/continuous-jobs-and-file-triggers-in-databricks-e7ba51a0c93a which mentioned about resources being re-used.
 
Also, the official documentation, https://docs.databricks.com/workflows/jobs/schedule-jobs.html - does not mention anything clearly about the re-use / termination, but mentions that there will be a slight delay which will be less than 60 seconds. Hence if the cluster needs to be re-created, I don't think it can guarantee only 60 seconds delay.

youssefmrini
Databricks Employee
Databricks Employee

When a Databricks job is configured to run with a job cluster in continuous mode, the cluster will be kept alive between job runs and reused for subsequent runs.

The cluster will not be terminated and recreated between each run as this would defeat the purpose of running the job in continuous mode, which is designed to reduce job startup time and increase the efficiency of the cluster usage.

Instead, Databricks will keep the cluster alive and attempt to assign subsequent job runs to the same cluster to avoid the costs and delay of launching a new cluster each time. There may be slight variations in startup time between subsequent runs due to factors like node availability, but the delay should be less than 60 seconds in most cases.

In your specific case, if you are observing that a simple do-nothing notebook is taking around 2 minutes to complete and it is unclear whether the same cluster is being used each time, it's possible that there may be other factors impacting the cluster performance (e.g., cluster configuration, node availability, etc.) or resource usage (e.g., other running jobs) that are contributing to the delay.

I would recommend reviewing the Databricks job logs and cluster utilization metrics to get a better understanding of the job's performance and resource usage over time, and if you continue to experience issues, consider reaching out to Databricks support for further assistance.

Jo5h
New Contributor II

Hello @youssefmrini 

So how is the DBU calculated? As the cluster is reused, the DBU should be calculated per hour on all the jobs run in an hour correct? Or will it be calculated based on each run?

I would like to know the cost calculation when running the continuous job

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group