cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Skip Task Without Spinning Up Cluster

Dave_Nithio
Contributor

I have a Job Workflow with multiple sequential tasks executing R or Python scripts. Currently, we can skip one of these tasks (if it has already been run) by passing a parameter and skipping via the script. This requires a full spin up of a compute resources to skip the task though which generally takes around 8-10 minutes. For multiple tasks this results in a lot of wasted time and spinning up of task compute resources that are not needed. Is there a better method for skipping a task in a refresh?

2 REPLIES 2

Anonymous
Not applicable

@Dave Wilson​ :

Yes, there are a few ways you can optimize your job workflow to avoid unnecessary compute resource spinning and wasted time when skipping tasks that have already been run:

  1. Implement a check for completed tasks: Instead of relying on passing a parameter to the script, you can implement a check for completed tasks in your workflow. For example, you can use a database or file system to keep track of completed tasks, and your workflow can query this information to determine which tasks need to be run. This can help avoid unnecessary compute resource spinning.
  2. Use caching: Depending on the nature of your tasks, you may be able to implement caching to avoid re-executing tasks that have already been run. For example, if your task involves processing data, you can store the processed data in a cache and reuse it in subsequent executions. This can help reduce compute resource spinning and overall execution time.
  3. Implement conditional execution: Some workflow engines support conditional execution, which allows you to skip tasks based on certain conditions. For example, you can specify that a task should only be executed if a certain file exists or if a certain condition is met. This can help avoid unnecessary compute resource spinning and execution time.
  4. Use a task queue: A task queue can help optimize your workflow by allowing you to queue up tasks and execute them as resources become available. This can help avoid unnecessary spinning up of compute resources and ensure that tasks are executed in the most efficient manner possible.

By implementing one or more of these approaches, you can help optimize your job workflow and avoid unnecessary compute resource spinning and wasted time when skipping tasks that have already been run.

Anonymous
Not applicable

Hi @Dave Wilson​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.