cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Skip Task Without Spinning Up Cluster

Dave_Nithio
Contributor

I have a Job Workflow with multiple sequential tasks executing R or Python scripts. Currently, we can skip one of these tasks (if it has already been run) by passing a parameter and skipping via the script. This requires a full spin up of a compute resources to skip the task though which generally takes around 8-10 minutes. For multiple tasks this results in a lot of wasted time and spinning up of task compute resources that are not needed. Is there a better method for skipping a task in a refresh?

2 REPLIES 2

Anonymous
Not applicable

@Dave Wilson​ :

Yes, there are a few ways you can optimize your job workflow to avoid unnecessary compute resource spinning and wasted time when skipping tasks that have already been run:

  1. Implement a check for completed tasks: Instead of relying on passing a parameter to the script, you can implement a check for completed tasks in your workflow. For example, you can use a database or file system to keep track of completed tasks, and your workflow can query this information to determine which tasks need to be run. This can help avoid unnecessary compute resource spinning.
  2. Use caching: Depending on the nature of your tasks, you may be able to implement caching to avoid re-executing tasks that have already been run. For example, if your task involves processing data, you can store the processed data in a cache and reuse it in subsequent executions. This can help reduce compute resource spinning and overall execution time.
  3. Implement conditional execution: Some workflow engines support conditional execution, which allows you to skip tasks based on certain conditions. For example, you can specify that a task should only be executed if a certain file exists or if a certain condition is met. This can help avoid unnecessary compute resource spinning and execution time.
  4. Use a task queue: A task queue can help optimize your workflow by allowing you to queue up tasks and execute them as resources become available. This can help avoid unnecessary spinning up of compute resources and ensure that tasks are executed in the most efficient manner possible.

By implementing one or more of these approaches, you can help optimize your job workflow and avoid unnecessary compute resource spinning and wasted time when skipping tasks that have already been run.

Anonymous
Not applicable

Hi @Dave Wilson​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group