cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Limiting concurrent tasks in a job

tgburrin-afs
New Contributor

I have a job with > 10 tasks in it that interacts with an external system outside of databricks.  At the moment that external system cannot handle more than 3 of the tasks executing concurrently.  How can I limit the number of tasks that concurrently execute in a job?  I'm not particularly worried about the order in which they execute, only that the number at any one time is limited to 3.

The cluster that I execute this on currently has only 1 worker in it and I'm looking to limit what takes place on that single worker.

2 REPLIES 2

Mounika_Tarigop
Databricks Employee
Databricks Employee

To limit the number of tasks that concurrently execute in a job to 3, you can use the max_concurrent_runs parameter in your job configuration. This parameter allows you to specify the maximum number of concurrent runs for a job, ensuring that no more than the specified number of tasks run at the same time.

When creating or updating your job, set the max_concurrent_runs parameter to 3. This will limit the number of concurrent tasks to 3.

The max_concurrent_runs parameter will handle the concurrency limit regardless of the cluster size.

filipniziol
Contributor III

Hi @tgburrin-afs@Mounika_Tarigop ,

As I understand the question is about running concurrent tasks within a single job rather than running concurrent jobs. 

max_concurrent_runs controls how many times a whole job can run simultaneously, not the concurrency of tasks within a single job run.

There is currently no direct feature in Databricks Jobs to specify a maximum number of concurrently running tasks within a single job run. Instead, you need to control concurrency through task dependencies or application logic.

Approaches to Limit Concurrent Tasks Within a Single Job Run

  1. Use Task Dependencies to Limit Parallelism:
    Structure your job so that no more than three tasks run at the same "layer." For example:
    • Suppose you have 12 tasks total. Instead of having all 12 start at once, arrange them in four "waves" of three tasks each.
    • In the Job UI or JSON configuration:
      • Start with three tasks (A, B, C) that have no upstream dependencies. They run simultaneously.
      • The next set of three tasks (D, E, F) only start after A, B, and C all complete.
      • Repeat this pattern until all tasks have run. This ensures that at most three tasks are active at the same time.
  2. Implement Concurrency Control in Your Code:

    • If each Databricks task itself runs code that can executes operations against the external system, implement concurrency control in your code logic.

    • For example, if a single task processes multiple items and you don’t want more than three operations to hit the external system concurrently, use a thread pool within the task’s code:

 

from concurrent.futures import ThreadPoolExecutor

# Limit to 3 concurrent operations
with ThreadPoolExecutor(max_workers=3) as executor:
    futures = [executor.submit(process_item, item) for item in items_to_process]
    results = [f.result() for f in futures]This approach requires merging multiple pieces of logic into a single task and controlling concurrency at the code level. 

 

  • This approach requires merging multiple pieces of logic into a single task and controlling concurrency at the code level.

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group